lazy loading csv with pandas

  • Last Update :
  • Techknowledgy :

The error shows that the machine does not have enough memory to read the entire CSV into a DataFrame at one time. Assuming you do not need the entire dataset in memory all at one time, one way to avoid the problem would be to process the CSV in chunks (by specifying the chunksize parameter):

chunksize = 10 ** 6
for chunk in pd.read_csv(filename, chunksize = chunksize):

read_csv with chunksize returns a context manager, to be used like so:

chunksize = 10 ** 6
with pd.read_csv(filename, chunksize = chunksize) as reader:
   for chunk in reader:

For large data l recommend you use the library "dask"

# Dataframes implement the Pandas API
import dask.dataframe as dd
df = dd.read_csv('s3://.../2018-*-*.csv')

From my projects another superior library is datatables.

# Datatable python library
import datatable as dt
df = dt.fread("s3://.../2018-*-*.csv")

I proceeded like this:

chunks = pd.read_table('aphro.csv', chunksize = 1000000, sep = ';', \
   names = ['lat', 'long', 'rf', 'date', 'slno'], index_col = 'slno', \
   header = None, parse_dates = ['date'])

df = pd.DataFrame() %
   time df = pd.concat(chunk.groupby(['lat', 'long', chunk['date'].map(lambda x: x.year)])['rf'].agg(['sum']) for chunk in chunks)

You can read in the data as chunks and save each chunk as pickle.

import pandas as pd
import pickle

in_path = ""
#Path where the large file is
out_path = ""
#Path to save the pickle files to
chunk_size = 400000 #size of chunks relies on your available memory
separator = "~"

reader = pd.read_csv(in_path, sep = separator, chunksize = chunk_size,
   low_memory = False)

for i, chunk in enumerate(reader):
   out_file = out_path + "/data_{}.pkl".format(i + 1)
with open(out_file, "wb") as f:
   pickle.dump(chunk, f, pickle.HIGHEST_PROTOCOL)

In the next step you read in the pickles and append each pickle to your desired dataframe.

import glob
pickle_path = ""
#Same Path as out_path i.e.where the pickle files are

data_p_files = []
for name in glob.glob(pickle_path + "/data_*.pkl"):

df = pd.DataFrame([])
for i in range(len(data_p_files)):
   df = df.append(pd.read_pickle(data_p_files[i]), ignore_index = True)

Referring to data structures, every data stored, a memory allocation takes place. At a basic level refer to the values below (The table below illustrates values for C programming language):

The maximum value of UNSIGNED CHAR = 255
The minimum value of SHORT INT = -32768
The maximum value of SHORT INT = 32767
The minimum value of INT = -2147483648
The maximum value of INT = 2147483647
The minimum value of CHAR = -128
The maximum value of CHAR = 127
The minimum value of LONG = -9223372036854775808
The maximum value of LONG = 9223372036854775807

You can pass dtype parameter as a parameter on pandas methods as dict on read like {column: type}.

import numpy as np
import pandas as pd

df_dtype = {
   "column_1": int,
   "column_2": str,
   "column_3": np.int16,
   "column_4": np.uint8,
   "column_n": np.float32

df = pd.read_csv('path/to/file', dtype = df_dtype)

The function read_csv and read_table is almost the same. But you must assign the delimiter “,” when you use the function read_table in your program.

def get_from_action_data(fname, chunk_size = 100000):
   reader = pd.read_csv(fname, header = 0, iterator = True)
chunks = []
loop = True
while loop:
   chunk = reader.get_chunk(chunk_size)[["user_id", "type"]]
except StopIteration:
   loop = False
print("Iteration is stopped")

df_ac = pd.concat(chunks, ignore_index = True)

Suggestion : 2

6 days ago Luckily pandas.read_csv() is one of the “richest” methods in the library, and its behavior can be finetuned to a great extent. One minor shortfall of read_csv() is that it cannot skip arbitrary rows based on a function, ie. it is not possible to filter the dataset while loading the csv. For this we have to load all rows and necessary ... ,The display of third-party trademarks and trade names on this site does not necessarily indicate any affiliation or endorsement of, 5 days ago Oct 04, 2020  · 5. Loading Data in Chunks: Memory Issues in pandas read_csv() are there for a long time. So one of the best workarounds to load large datasets is in chunks. Note: loading data in chunks is actually slower than reading whole data directly as you need to concat the chunks again but you can load files with more than 10’s of GB’s easily. , 1 day ago Load DataFrame from CSV with no header. If your CSV file does not have a header (column names), you can specify that to read_csv () in two ways. Pass the argument header=None to pandas.read_csv () function. Pass the argument names to pandas.read_csv () function, which implicitly makes header=None.

import linecache from pandas
import DataFrame filename = "large.csv"
indices = [12, 24, 36] li = []
for i in indices: li.append(linecache.getline(filename, i).rstrip().split(',')) dataframe = DataFrame(li)
import linecache from pandas
import DataFrame filename = "large.csv"
indices = [12, 24, 36] li = []
for i in indices: li.append(linecache.getline(filename, i).rstrip().split(',')) dataframe = DataFrame(li)

Suggestion : 3

pandas.read_stata , pandas.read_excel , pandas.read_table , pandas.read_pickle

>>> pd.read_csv('data.csv')