This happens because of the way you're populating the dataframe.
sample_data['error_msg'] = str(e)
This is probably the most efficient way to do it:
def int2date(argdate: int):
try:
year = int(argdate / 10000)
month = int((argdate % 10000) / 100)
day = int(argdate % 100)
return date(year, month, day)
except ValueError as e:
pass # you could write the row and the error to your logs here
df['date_of_birth'] = df.sec_num.apply(int2date)
df['is_in_error'] = df.date_of_birth.isnull()
However if you also want to write the errors to the dataframe, you can use this approach although it might be much slower (there might be faster solutions to this).
df['date_of_birth'] = None
df['error_msg'] = None
df['is_in_error'] = False
for i, row in df.iterrows():
try:
date_of_birth = int2date(row['sec_num'])
df.set_value(i, 'date_of_birth', date_of_birth)
except ValueError as e:
df.set_value(i, 'is_in_error', True)
df.set_value(i, 'error_msg', str(e))
To achieve that a typical approach is to use a function which does not throw the exception but which returns it instead.
def int2date(argdate: int):
try:
year = int(argdate / 10000)
month = int((argdate % 10000) / 100)
day = int(argdate % 100)
return date(year, month, day)
except ValueError:
return ValueError("Value:{0} not a legal date.".format(argdate))
I have a large dataframe containing, amongst other things, a (Norwegian) social security number. It is possible to get the date of birth out of this number via a special algorithm. However, every now and then an illegal social security number creeps into the database corrupting the calculation. ,What I would like to do is to tag every line having an illegal social security number, along with a log message showing the error raised., 1 week ago After comparison of Data Frames the output CSV isn't returning all rows Count the number of certain values in a data frame after group by Filtering based on the "rows" data after creating a pivot table in python pandas ,However if you also want to write the errors to the dataframe, you can use this approach although it might be much slower (there might be faster solutions to this).
import pandas as pd from datetime import date sample_data = pd.DataFrame({ 'id': [1, 2, 3], \'sec_num': [19790116, 19480631, 19861220] }) # The actual algorithm transforming the sec number is more complicated # this is just for illustration purposes def int2date(argdate: int): try: year = int(argdate / 10000) month = int((argdate % 10000) / 100) day = int(argdate % 100) return date(year, month, day) except ValueError: raise ValueError("Value:{0} not a legal date.".format(argdate))
sample_data['error_msg'] = str(e)
This happens because of the way you're populating the dataframe.,This handles each row separately and will only write the error to the correct index instead of updating the entire column.,You are in the realm of handling large data. Throwing exceptions out of a loop is often not the best idea there because it will normally abort the loop. As many others you do not seem to want that.,However if you also want to write the errors to the dataframe, you can use this approach although it might be much slower (there might be faster solutions to this).
This happens because of the way you're populating the dataframe.
sample_data['error_msg'] = str(e)
This is probably the most efficient way to do it:
def int2date(argdate: int):
try:
year = int(argdate / 10000)
month = int((argdate % 10000) / 100)
day = int(argdate % 100)
return date(year, month, day)
except ValueError as e:
pass # you could write the row and the error to your logs here
df['date_of_birth'] = df.sec_num.apply(int2date)
df['is_in_error'] = df.date_of_birth.isnull()
However if you also want to write the errors to the dataframe, you can use this approach although it might be much slower (there might be faster solutions to this).
df['date_of_birth'] = None
df['error_msg'] = None
df['is_in_error'] = False
for i, row in df.iterrows():
try:
date_of_birth = int2date(row['sec_num'])
df.set_value(i, 'date_of_birth', date_of_birth)
except ValueError as e:
df.set_value(i, 'is_in_error', True)
df.set_value(i, 'error_msg', str(e))
How to tag corrupted data in dataframe after an error has been raised,Trying to aggregate a query on data that has already been aggregated - not sure the best approach to use,What is the best way to retain data for columns that can't be aggregated during a a groupby operation on a pandas dataframe?,How can I make the column numbers change to start at zero again in a pandas data frame that has been sorted
This would do the work:
>>> df.groupby('office').agg(numer_unique = ('LocationType', 'count')
use the value_counts
method (Documentation) for the column that you need (Office
):
myfilteredInfo['Office'].value_counts()
Last Updated : 28 Nov, 2021,GATE CS 2021 Syllabus
output:
KeyError: 'country'
Output:
['Country', 'Age', 'Salary', 'Purchased']
Column(s) to use as the row labels of the DataFrame, either given as string name or column index. If a sequence of int / str is given, a MultiIndex is used.,Note: index_col=False can be used to force pandas to not use the first column as the index, e.g. when you have a malformed file with delimiters at the end of each line.,The usecols argument allows you to select any subset of the columns in a file, either using the column names, position numbers or a callable:,In this special case, read_csv assumes that the first column is to be used as the index of the DataFrame:
In[1]: import pandas as pd
In[2]: from io
import StringIO
In[3]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"
In[4]: pd.read_csv(StringIO(data))
Out[4]:
col1 col2 col3
0 a b 1
1 a b 2
2 c d 3
In[5]: pd.read_csv(StringIO(data), usecols = lambda x: x.upper() in ["COL1", "COL3"])
Out[5]:
col1 col3
0 a 1
1 a 2
2 c 3
In[6]: data = "col1,col2,col3\na,b,1"
In[7]: df = pd.read_csv(StringIO(data))
In[8]: df.columns = [f "pre_{col}"
for col in df.columns
]
In[9]: df
Out[9]:
pre_col1 pre_col2 pre_col3
0 a b 1
In[10]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"
In[11]: pd.read_csv(StringIO(data))
Out[11]:
col1 col2 col3
0 a b 1
1 a b 2
2 c d 3
In[12]: pd.read_csv(StringIO(data), skiprows = lambda x: x % 2 != 0)
Out[12]:
col1 col2 col3
0 a b 2
In[13]: import numpy as np
In[14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"
In[15]: print(data)
a, b, c, d
1, 2, 3, 4
5, 6, 7, 8
9, 10, 11
In[16]: df = pd.read_csv(StringIO(data), dtype = object)
In[17]: df
Out[17]:
a b c d
0 1 2 3 4
1 5 6 7 8
2 9 10 11 NaN
In[18]: df["a"][0]
Out[18]: '1'
In[19]: df = pd.read_csv(StringIO(data), dtype = {
"b": object,
"c": np.float64,
"d": "Int64"
})
In[20]: df.dtypes
Out[20]:
a int64
b object
c float64
d Int64
dtype: object
In [21]: data = "col_1\n1\n2\n'A'\n4.22"
In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str})
In [23]: df
Out[23]:
col_1
0 1
1 2
2 'A'
3 4.22
In [24]: df["col_1"].apply(type).value_counts()
Out[24]:
<class 'str'> 4
Name: col_1, dtype: int64
In [25]: df2 = pd.read_csv(StringIO(data))
In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce")
In [27]: df2
Out[27]:
col_1
0 1.00
1 2.00
2 NaN
3 4.22
In [28]: df2["col_1"].apply(type).value_counts()
Out[28]:
<class 'float'> 4
Name: col_1, dtype: int64