hdf5 file grows in size after overwriting the pandas dataframe

  • Last Update :
  • Techknowledgy :

I'm trying to overwrite the pandas dataframe in hdf5 file. Each time I do this, the file size grows up while the stored frame content is the same. If I use mode='w' I lost all other records. Is this a bug or am I missing something?

import pandas
df = pandas.read_csv('1.csv')
for i in range(100):
   store = pandas.HDFStore('tmp.h5')
store.put('TMP', df)

Suggestion : 2

Hierarchical Data Format (HDF) is self-describing, allowing an application to interpret the structure and contents of a file with no outside information. One HDF file can hold a mix of related objects which can be accessed as a group or as individual objects.,Write the contained data to an HDF5 file using HDFStore.,‘table’: Table format. Write as a PyTables Table structure which may perform worse but allow more flexible operations like searching / selecting subsets of the data.,In order to add another DataFrame or Series to an existing HDF file please use append mode and a different a key.

>>> df = pd.DataFrame({
         'A': [1, 2, 3],
         'B': [4, 5, 6]
      ...index = ['a', 'b', 'c']) >>>
   df.to_hdf('data.h5', key = 'df', mode = 'w')
>>> s = pd.Series([1, 2, 3, 4]) >>>
   s.to_hdf('data.h5', key = 's')
>>> pd.read_hdf('data.h5', 'df')
a 1 4
b 2 5
c 3 6
   pd.read_hdf('data.h5', 's')
0 1
1 2
2 3
3 4
dtype: int64

Suggestion : 3

Storing Pandas DataFrame to HDF5 with both axes MultiIndexed,Joining MultiIndexed Pandas Dataframes for Another MultiIndexed Dataframe,MultiIndexed Pandas Dataframe to HDF5 - MemoryError,Merge Pandas Multiindexed DataFrame with Singleindexed Pandas DataFrame

Here's what I would try (can't test this at the moment):

# Obtain the 400 k x 5000 DF
data = (
   pandas.concat([df1, df2], axis = 1, join = 'inner')
   .rename_axis(['columns'], axis = 1)
final = data[data != 0].reset_index().rename(columns = {
   0: 'values'
store = pd.HDFStore("store.h5")
store['ComboDF'] = final

Then when you read it back to pandas, you can set the index, unstack the dataframe, and then fillna(0) to get all of your zeroes back.

data = (
   load_hdf5_data("storm.h5")['ComboDF'] # or however you do this
      .unstack(level = 'columns')

You could try the following:

df.columns = list(map(lambda x: prefix + str(x), df.columns))

Suggestion : 4

Groups are the container mechanism by which HDF5 files are organized. From a Python perspective, they operate somewhat like dictionaries. In this case the “keys” are the names of group members, and the “values” are the members themselves (Group and Dataset) objects.,Note that this is not a copy of the dataset! Like hard links in a UNIX file system, objects in an HDF5 file can be stored in multiple groups:,What happens when assigning an object to a name in the group? It depends on the type of object being assigned. For NumPy arrays or other data, the default is to create an HDF5 datasets:,Exists to allow creation of soft links in the file. See Soft links. These only serve as containers for a path; they are not related in any way to a particular file.

>>> f = h5py.File('foo.hdf5', 'w') >>>
   f.name '/' >>>
>>> grp = f.create_group("bar") >>>
   grp.name '/bar' >>>
   subgrp = grp.create_group("baz") >>>
   subgrp.name '/bar/baz'
>>> grp2 = f.create_group("/some/long/path") >>>
   grp2.name '/some/long/path' >>>
   grp3 = f['/some/long'] >>>
   grp3.name '/some/long'
>>> myds = subgrp["MyDS"] >>>
   missing = subgrp["missing"]
KeyError: "Name doesn't exist (Symbol table: Object not found)"
>>> del subgroup["MyDataset"]
>>> grp["name"] = 42
>>> out = grp["name"]
>>> out
<HDF5 dataset "name" : shape (), type "<i8">

Suggestion : 5

The open_args_load and open_args_save parameters are passed to the filesystem’s open method to configure how a dataset file (on a specific filesystem) is opened during a load or save operation, respectively.,The fs_args is used to configure the interaction with a filesystem. All the top-level parameters of fs_args (except open_args_load and open_args_save) will be passed in an underlying filesystem class.,You might come across a situation where you would like to read the same file using two different dataset implementations. Use transcoding when you want to load and save the same file, via its specified filepath, using different DataSet implementations.,load_args and save_args configure how a third-party library (e.g. pandas for CSVDataSet) loads/saves data from/to a file.

   type: ...
   project: test_project
   type: ...
   mode: "rb"
encoding: "utf-8"
   type: pandas.CSVDataSet
   index: False
encoding: "utf-8"
   type: pandas.CSVDataSet
filepath: data / 01_ raw / bikes.csv
   type: pandas.CSVDataSet
filepath: data / 01_ raw / company / cars.csv
   sep: ','
   index: False
date_format: '%Y-%m-%d %H:%M'
decimal: .
   type: pandas.CSVDataSet
filepath: data / 01_ raw / company / boats.csv.gz
   sep: ','
compression: 'gzip'
   mode: 'rb'