parsing large amount of dates with pandas - scalability - performance drops faster than linear

  • Last Update :
  • Techknowledgy :

look at this comparison:

In[507]: fn
Out[507]: 'D:\\download\\slow.csv.tar.gz'

In[508]: fn2
Out[508]: 'D:\\download\\slow_filtered.csv.gz'

In[509]: % timeit df = pd.read_csv(fn, parse_dates = ['from'], index_col = 0)
1 loop, best of 3: 15.7 s per loop

In[510]: % timeit df2 = pd.read_csv(fn2, parse_dates = ['from'], index_col = 0)
1 loop, best of 3: 399 ms per loop

In[511]: len(df)
Out[511]: 99831

In[512]: len(df2)
Out[512]: 99831

In[513]: df.dtypes
Out[513]:
   from object
dtype: object

In[514]: df2.dtypes
Out[514]:
   from datetime64[ns]
dtype: object

The only difference between those two DFs is in the row# 36867, which i've manually corrected in the D:\\download\\slow_filtered.csv.gz file:

In[518]: df.iloc[36867]
Out[518]:
   from 20124 - 10 - 20 10: 12: 00
Name: 36867, dtype: object

In[519]: df2.iloc[36867]
Out[519]:
   from 2014 - 10 - 20 10: 12: 00
Name: 36867, dtype: datetime64[ns]

setup:

start_ts = '2000-01-01 00:00:00'

pd.DataFrame({
   'date': pd.date_range(start_ts, freq = '1S', periods = 10 ** 4)
}).to_csv('d:/temp/10k.csv', index = False)

pd.DataFrame({
   'date': pd.date_range(start_ts, freq = '1S', periods = 10 ** 5)
}).to_csv('d:/temp/100k.csv', index = False)

pd.DataFrame({
   'date': pd.date_range(start_ts, freq = '1S', periods = 10 ** 6)
}).to_csv('d:/temp/1m.csv', index = False)

pd.DataFrame({
   'date': pd.date_range(start_ts, freq = '1S', periods = 10 ** 7)
}).to_csv('d:/temp/10m.csv', index = False)

dt_parser = lambda x: pd.to_datetime(x, format = "%Y-%m-%d %H:%M:%S")

Okay -- based on the discussion in the comments and in the chat room it seems that there is a problem with OP's data. Using the code below he is unable to reproduce his own error:

import pandas as pd
import datetime
from time
import time

format_string = '%Y-%m-%d %H:%M:%S'
base_dt = datetime.datetime(2016, 1, 1)
exponent_range = range(2, 8)

def dump(number_records):
   print 'now dumping %s records' % number_records
dts = pd.date_range(base_dt, periods = number_records, freq = '1s')
df = pd.DataFrame({
   'date': [dt.strftime(format_string) for dt in dts]
})
df.to_csv('%s_records.csv' % number_records)

def test(number_records):
   start = time()
pd.read_csv('%s_records.csv' % number_records, parse_dates = ['date'])
end = time()
print str(number_records), str(end - start)

def main():
   for i in exponent_range:
   number_records = 10 ** i
dump(number_records)
test(number_records)

if __name__ == '__main__':
   main()

Suggestion : 2

May 28, 2021

First, install and import the python library as shown below.

#!pip install vaex
import vaex

Let’s see an example. You can download the dataset I’m using from here

# Reading data from local disk
df = vaex.open('yellow_tripdata_2020-01.hdf5')

Vaex will read the CSV in chunks, and convert each chunk to a temporary HDF5 file which is further concatenated into a single HDF5 file.You can decide the size of the individual chunks using chunk_size argument.

# Converting csv into HDF5 and reading dataframe
   %
   time df = vaex.from_csv('yellow_tripdata_2020-01.csv', convert = True)
df

That means, vaex does not actually perform the operation or read through whole data unless necessary (unlike pandas). For example, say you call an expression like: df['passenger_count'].mean, the actual computations does not happen. It just notes down what computations it must do. A vaex expression object is created instead, and when printed out it shows some preview values. This significantly saves memory space.

df['passenger_count'].mean

Let’s have a look at another lazy computation example.

import numpy as np
np.sqrt(df.passenger_count ** 2 + df.trip_distance ** 2)

Suggestion : 3

Parsing large amount of dates with pandas - scalability - performance drops faster than linear,How to filter positives negatives in pairs since they cancels out each other after grouping on Pandas dataframe?,Pandas: Split large file into seperate files by Date, preserving original ordering.,Session generation from log file analysis with pandas

Then parse id and prod and last merge all together by concat:

import pandas as pd
import io

temp = u ""
"[02/Jan/2012:09:07:32] "
GET / click ? id = 162 & prod = 5475 HTTP / 1.1 " 200 4352 [02 / Jan / 2012: 09: 07: 32]
"GET /click?id=162&prod=5475 HTTP/1.1"
200 4352
   [02 / Jan / 2012: 09: 07: 32]
"GET /click?id=162&prod=5475 HTTP/1.1"
200 4352
   [02 / Jan / 2012: 09: 07: 32]
"GET /click?id=162&prod=5475 HTTP/1.1"
200 4352
   [02 / Jan / 2012: 09: 07: 32]
"GET /click?id=162&prod=5475 HTTP/1.1"
200 4352 ""
"

#change io.StringIO(temp) to 'filename.csv'
df = pd.read_csv(io.StringIO(temp), sep = "\s*", engine = 'python', header = None,
   names = ['date', 'get', 'data', 'http', 'no1', 'no2'])

#format - http: //strftime.org/
   df['date'] = pd.to_datetime(df['date'].str.strip('[]'), format = "%d/%b/%Y:%H:%M:%S")

#split Dataframe
df1 = pd.DataFrame([x.split('=') for x in df['data'].tolist()], columns = ['c', 'id', 'prod'])

#split Dataframe
df2 = pd.DataFrame([x.split('&') for x in df1['id'].tolist()], columns = ['id', 'no3'])
print df

date get data http no1 no2
0 2012 - 01 - 02 09: 07: 32 "GET  /click?id=162&prod=5475  HTTP/1.1"
200 4352
1 2012 - 01 - 02 09: 07: 32 "GET  /click?id=162&prod=5475  HTTP/1.1"
200 4352
2 2012 - 01 - 02 09: 07: 32 "GET  /click?id=162&prod=5475  HTTP/1.1"
200 4352
3 2012 - 01 - 02 09: 07: 32 "GET  /click?id=162&prod=5475  HTTP/1.1"
200 4352
4 2012 - 01 - 02 09: 07: 32 "GET  /click?id=162&prod=5475  HTTP/1.1"
200 4352
print df1

c id prod
0 / click ? id 162 & prod 5475
1 / click ? id 162 & prod 5475
2 / click ? id 162 & prod 5475
3 / click ? id 162 & prod 5475
4 / click ? id 162 & prod 5475
print df2

id no3
0 162 prod
1 162 prod
2 162 prod
3 162 prod
4 162 prod

df = pd.concat([df['date'], df1['prod'], df2['id']], axis = 1)

Suggestion : 4

For datetime64[ns] types, NaT represents missing values. This is a pseudo-native sentinel value that can be represented by NumPy in a singular dtype (datetime64[ns]). pandas objects provide compatibility between NaT and NaN.,Starting from pandas 1.0, an experimental pd.NA value (singleton) is available to represent scalar missing values. At this moment, it is used in the nullable integer, boolean and dedicated string data types as the missing value indicator.,You can insert missing values by simply assigning to containers. The actual missing value used will be chosen based on the dtype.,Because NaN is a float, a column of integers with even one missing values is cast to floating-point dtype (see Support for integer NA for more). pandas provides a nullable integer array, which can be used by explicitly requesting the dtype:

In[1]: df = pd.DataFrame(
      ...: np.random.randn(5, 3),
      ...: index = ["a", "c", "e", "f", "h"],
      ...: columns = ["one", "two", "three"],
      ...: )
   ...:

   In[2]: df["four"] = "bar"

In[3]: df["five"] = df["one"] > 0

In[4]: df
Out[4]:
   one two three four five
a 0.469112 - 0.282863 - 1.509059 bar True
c - 1.135632 1.212112 - 0.173215 bar False
e 0.119209 - 1.044236 - 0.861849 bar True
f - 2.104569 - 0.494929 1.071804 bar False
h 0.721555 - 0.706771 - 1.039575 bar True

In[5]: df2 = df.reindex(["a", "b", "c", "d", "e", "f", "g", "h"])

In[6]: df2
Out[6]:
   one two three four five
a 0.469112 - 0.282863 - 1.509059 bar True
b NaN NaN NaN NaN NaN
c - 1.135632 1.212112 - 0.173215 bar False
d NaN NaN NaN NaN NaN
e 0.119209 - 1.044236 - 0.861849 bar True
f - 2.104569 - 0.494929 1.071804 bar False
g NaN NaN NaN NaN NaN
h 0.721555 - 0.706771 - 1.039575 bar True
In[7]: df2["one"]
Out[7]:
   a 0.469112
b NaN
c - 1.135632
d NaN
e 0.119209
f - 2.104569
g NaN
h 0.721555
Name: one, dtype: float64

In[8]: pd.isna(df2["one"])
Out[8]:
   a False
b True
c False
d True
e False
f False
g True
h False
Name: one, dtype: bool

In[9]: df2["four"].notna()
Out[9]:
   a True
b False
c True
d False
e True
f True
g False
h True
Name: four, dtype: bool

In[10]: df2.isna()
Out[10]:
   one two three four five
a False False False False False
b True True True True True
c False False False False False
d True True True True True
e False False False False False
f False False False False False
g True True True True True
h False False False False False
In[11]: None == None # noqa: E711
Out[11]: True

In[12]: np.nan == np.nan
Out[12]: False
In[13]: df2["one"] == np.nan
Out[13]:
   a False
b False
c False
d False
e False
f False
g False
h False
Name: one, dtype: bool
In [14]: pd.Series([1, 2, np.nan, 4], dtype=pd.Int64Dtype())
Out[14]:
0 1
1 2
2 <NA>
   3 4
   dtype: Int64
In[15]: df2 = df.copy()

In[16]: df2["timestamp"] = pd.Timestamp("20120101")

In[17]: df2
Out[17]:
   one two three four five timestamp
a 0.469112 - 0.282863 - 1.509059 bar True 2012 - 01 - 01
c - 1.135632 1.212112 - 0.173215 bar False 2012 - 01 - 01
e 0.119209 - 1.044236 - 0.861849 bar True 2012 - 01 - 01
f - 2.104569 - 0.494929 1.071804 bar False 2012 - 01 - 01
h 0.721555 - 0.706771 - 1.039575 bar True 2012 - 01 - 01

In[18]: df2.loc[["a", "c", "h"], ["one", "timestamp"]] = np.nan

In[19]: df2
Out[19]:
   one two three four five timestamp
a NaN - 0.282863 - 1.509059 bar True NaT
c NaN 1.212112 - 0.173215 bar False NaT
e 0.119209 - 1.044236 - 0.861849 bar True 2012 - 01 - 01
f - 2.104569 - 0.494929 1.071804 bar False 2012 - 01 - 01
h NaN - 0.706771 - 1.039575 bar True NaT

In[20]: df2.dtypes.value_counts()
Out[20]:
   float64 3
object 1
bool 1
datetime64[ns] 1
dtype: int64