The error shows that the machine does not have enough memory to read the entire
CSV into a DataFrame at one time. Assuming you do not need the entire dataset in
memory all at one time, one way to avoid the problem would be to process the CSV in
chunks (by specifying the chunksize
parameter):
chunksize = 10 ** 6 for chunk in pd.read_csv(filename, chunksize = chunksize): process(chunk)
read_csv
with chunksize
returns a context manager, to be used like so:
chunksize = 10 ** 6
with pd.read_csv(filename, chunksize = chunksize) as reader:
for chunk in reader:
process(chunk)
For large data l recommend you use the library "dask"
e.g:
# Dataframes implement the Pandas API import dask.dataframe as dd df = dd.read_csv('s3://.../2018-*-*.csv')
From my projects another superior library is datatables.
# Datatable python library import datatable as dt df = dt.fread("s3://.../2018-*-*.csv")
I proceeded like this:
chunks = pd.read_table('aphro.csv', chunksize = 1000000, sep = ';', \
names = ['lat', 'long', 'rf', 'date', 'slno'], index_col = 'slno', \
header = None, parse_dates = ['date'])
df = pd.DataFrame() %
time df = pd.concat(chunk.groupby(['lat', 'long', chunk['date'].map(lambda x: x.year)])['rf'].agg(['sum']) for chunk in chunks)
You can read in the data as chunks and save each chunk as pickle.
import pandas as pd
import pickle
in_path = ""
#Path where the large file is
out_path = ""
#Path to save the pickle files to
chunk_size = 400000 #size of chunks relies on your available memory
separator = "~"
reader = pd.read_csv(in_path, sep = separator, chunksize = chunk_size,
low_memory = False)
for i, chunk in enumerate(reader):
out_file = out_path + "/data_{}.pkl".format(i + 1)
with open(out_file, "wb") as f:
pickle.dump(chunk, f, pickle.HIGHEST_PROTOCOL)
In the next step you read in the pickles and append each pickle to your desired dataframe.
import glob
pickle_path = ""
#Same Path as out_path i.e.where the pickle files are
data_p_files = []
for name in glob.glob(pickle_path + "/data_*.pkl"):
data_p_files.append(name)
df = pd.DataFrame([])
for i in range(len(data_p_files)):
df = df.append(pd.read_pickle(data_p_files[i]), ignore_index = True)
Referring to data structures, every data stored, a memory allocation takes place. At a basic level refer to the values below (The table below illustrates values for C programming language):
The maximum value of UNSIGNED CHAR = 255 The minimum value of SHORT INT = -32768 The maximum value of SHORT INT = 32767 The minimum value of INT = -2147483648 The maximum value of INT = 2147483647 The minimum value of CHAR = -128 The maximum value of CHAR = 127 The minimum value of LONG = -9223372036854775808 The maximum value of LONG = 9223372036854775807
You can pass dtype
parameter as a parameter on pandas methods as dict on read
like {column: type}.
import numpy as np
import pandas as pd
df_dtype = {
"column_1": int,
"column_2": str,
"column_3": np.int16,
"column_4": np.uint8,
...
"column_n": np.float32
}
df = pd.read_csv('path/to/file', dtype = df_dtype)
The function read_csv and read_table is almost the same. But you must assign the delimiter “,” when you use the function read_table in your program.
def get_from_action_data(fname, chunk_size = 100000):
reader = pd.read_csv(fname, header = 0, iterator = True)
chunks = []
loop = True
while loop:
try:
chunk = reader.get_chunk(chunk_size)[["user_id", "type"]]
chunks.append(chunk)
except StopIteration:
loop = False
print("Iteration is stopped")
df_ac = pd.concat(chunks, ignore_index = True)
6 days ago Luckily pandas.read_csv() is one of the “richest” methods in the library, and its behavior can be finetuned to a great extent. One minor shortfall of read_csv() is that it cannot skip arbitrary rows based on a function, ie. it is not possible to filter the dataset while loading the csv. For this we have to load all rows and necessary ... ,The display of third-party trademarks and trade names on this site does not necessarily indicate any affiliation or endorsement of FaqCode4U.com., 5 days ago Oct 04, 2020 · 5. Loading Data in Chunks: Memory Issues in pandas read_csv() are there for a long time. So one of the best workarounds to load large datasets is in chunks. Note: loading data in chunks is actually slower than reading whole data directly as you need to concat the chunks again but you can load files with more than 10’s of GB’s easily. , 1 day ago Load DataFrame from CSV with no header. If your CSV file does not have a header (column names), you can specify that to read_csv () in two ways. Pass the argument header=None to pandas.read_csv () function. Pass the argument names to pandas.read_csv () function, which implicitly makes header=None.
import linecache from pandas
import DataFrame filename = "large.csv"
indices = [12, 24, 36] li = []
for i in indices: li.append(linecache.getline(filename, i).rstrip().split(',')) dataframe = DataFrame(li)
import linecache from pandas
import DataFrame filename = "large.csv"
indices = [12, 24, 36] li = []
for i in indices: li.append(linecache.getline(filename, i).rstrip().split(',')) dataframe = DataFrame(li)
pandas.read_stata , pandas.read_excel , pandas.read_table , pandas.read_pickle
>>> pd.read_csv('data.csv')