Please note that almost any NumPy operation is broadcastable like this, so in practice there is probably no need for an explicit cartesian product:
#shared dimensions:
sh = a.shape[1: ]
aba = (a[: , None, None] + b[None,: , None] - a[None, None,: ]).reshape(-1, * sh)
aba
#array([
[0., 0.03],
#[-1., 0.16],
#[1., -0.1],
#[0., 0.03]
])
Broadcasting: When for example adding two arrays their shapes do not have to be identical, only compatible because of broadcasting. Broadcasting is in a sense a generalization of adding scalars to arrays:
[ [2], [ [7], [ [2], 7 + [3], equiv to[7], +[3], [4] ][7] ][4] ]
Broadcasting:
[ [4], [ [1, 2, 3], [ [4, 4, 4], [ [1, 2, 3] ] + [5], equiv to[1, 2, 3], +[5, 5, 5], [6] ][1, 2, 3] ][6, 6, 6] ]
An alternate solution is to create a cartesian product of indices (which is easier, as solutions for cartesian products of 1D arrays exist):
idx = cartesian_product( np.arange(len(a)), np.arange(len(b)) + len(a), np.arange(len(a)) )
And then use fancy indexing to create the output array:
x = np.concatenate((a, b)) result = x[idx.ravel(),: ].reshape( * idx.shape, -1)
Size of the result data
size_in_GB = A.shape[0] ** 2 * A.shape[1] * B.shape[0] * (size_of_datatype) / 1e9
The code calculating the results is from @PaulPanzer.
import numpy as np import tables #register blosc import h5py as h5 import h5py_cache as h5c a = np.arange(500 * 50).reshape(500, 50) b = np.arange(40 * 50).reshape(40, 50) # isn 't well documented, have a look at https://github.com/Blosc/hdf5-blosc compression_opts = (0, 0, 0, 0, 5, 1, 1) compression_opts[4] = 9 #compression level 0.. .9 compression_opts[5] = 1 #shuffle compression_opts[6] = 1 #compressor(I guess that 's lz4) File_Name_HDF5 = 'Test.h5' f = h5.File(File_Name_HDF5, 'w', chunk_cache_mem_size = 1024 ** 2 * 300) dset = f.create_dataset('Data', shape = (a.shape[0] ** 2 * b.shape[0], a.shape[1]), dtype = 'd', chunks = (a.shape[0] * b.shape[0], 1), compression = 32001, compression_opts = (0, 0, 0, 0, 9, 1, 1), shuffle = False) #Write the data for i in range(a.shape[0]): sh = a.shape[1: ] aba = (a[i] + b[: , None] - a).reshape(-1, * sh) dset[i * a.shape[0] * b.shape[0]: (i + 1) * a.shape[0] * b.shape[0]] = aba f.close()
Reading the data
File_Name_HDF5 = 'Test.h5' f = h5c.File(File_Name_HDF5, 'r', chunk_cache_mem_size = 1024 ** 2 * 300) dset = f['Data'] chunks_size = 500 for i in range(0, dset.shape[0], chunks_size): #Iterate over the first column data = dset[i: i + chunks_size,: ] #avoid excessive calls to the hdf5 library #Do something with the data f.close() f = h5c.File(File_Name_HDF5, 'r', chunk_cache_mem_size = 1024 ** 2 * 300) dset = f['Data'] for i in range(dset.shape[1]): # Iterate over the second dimension # fancy indexing e.g.[: , i] will be much slower # use np.expand_dims or in this case np.squeeze after the read operation from the dset # if you wan 't to have the same result than [:,i] (1 dim array) data = dset[: , i: i + 1] #Do something with the data f.close()
This is a good starting point for avoiding MemoryError. As a short term fix you can train your model using a subset of the data available to you and discard the rest. Doing this really is a shame however. , Somewhat related topics and suggestions involve parsing smaller pieces of the image, which is fine, but reassembling the entire image in numpy seems impossible because an array of that size even with a very small data type can never exist. Can you prove your 20GB claim with a code example?\ ,The function works as expected as long as my array fits in memory. But my usecase requires me to work with huge data and I get a MemoryError at the line np.empty() since it is unable to allocate the memory required. I am working with circa 20GB data at the moment and this might increase in future. ,I will be using these arrays for further processing and ideally I would not like to store them in files at this stage. So memmap / h5py format may not help, although I am not sure of this.
def cartesian_2d(arrays, out = None): arrays = [np.asarray(x) for x in arrays] dtype = arrays[0].dtype n = np.prod([x.shape[0]
for x in arrays
]) if out is None: out = np.empty([n, len(arrays), arrays[0].shape[1]], dtype = dtype) m = n // arrays[0].shape[0] out[:, 0] = np.repeat(arrays[0], m, axis=0) if arrays[1:]: cartesian_2d(arrays[1:], out=out[0:m, 1:, :]) for j in range(1, arrays[0].shape[0]): out[j * m:(j + 1) * m, 1:] = out[0:m, 1:] return out a = [[ 0, -0.02], [1, -0.15]] b = [[0, 0.03]] result = cartesian_2d([a,b,a]) // array([[[ 0. , -0.02], [ 0. , 0.03], [ 0. , -0.02]], [[ 0. , -0.02], [ 0. , 0.03], [ 1. , -0.15]], [[ 1. , -0.15], [ 0. , 0.03], [ 0. , -0.02]], [[ 1. , -0.15], [ 0. , 0.03], [ 1. , -0.15]]])
def cartesian_2d(arrays, out = None): arrays = [np.asarray(x) for x in arrays] dtype = arrays[0].dtypen = np.prod([x.shape[0]
for x in arrays
]) if out is None: out = np.empty([n, len(arrays), arrays[0].shape[1]], dtype = dtype) m = n // arrays[0].shape[0]out[:, 0] = np.repeat(arrays[0], m, axis=0)if arrays[1:]: cartesian_2d(arrays[1:], out=out[0:m, 1:, :]) for j in range(1, arrays[0].shape[0]): out[j * m:(j + 1) * m, 1:] = out[0:m, 1:]return out a = [[ 0, -0.02], [1, -0.15]] b = [[0, 0.03]] result = cartesian_2d([a,b,a]) // array([[[ 0. , -0.02],[ 0. , 0.03],[ 0. , -0.02]],[[ 0. , -0.02],[ 0. , 0.03],[ 1. , -0.15]],[[ 1. , -0.15],[ 0. , 0.03],[ 0. , -0.02]],[[ 1. , -0.15],[ 0. , 0.03],[ 1. , -0.15]]])
result[: , 0,: ] + result[: , 1,: ] - result[: , 2,: ] //array([[ 0. , 0.03], [-1. , 0.16], [ 1. , -0.1 ], [ 0. , 0.03]])
#shared dimensions: sh = a.shape[1: ] aba = (a[: , None, None] + b[None,: , None] - a[None, None,: ]).reshape(-1, * sh) aba #array([
[0., 0.03], #[-1., 0.16], #[1., -0.1], #[0., 0.03]
])
[ [2], [ [7], [ [2], 7 + [3], equiv to[7], +[3], [4] ][7] ][4] ]
[ [4], [ [1, 2, 3], [ [4, 4, 4], [ [1, 2, 3] ] + [5], equiv to[1, 2, 3], +[5, 5, 5], [6] ][1, 2, 3] ][6, 6, 6] ]
This issue tracker has been migrated to GitHub, and is currently read-only. For more information, see the GitHub FAQs in the Python's Developer Guide., Help Tracker Documentation Tracker Development Report Tracker Problem
The product method in itertools provides an implementation of the Cartesian product that when run on with many arguments quickly gives out of memory errors.The current implementation creates a lot of unnecessary lists in this situation.A more appropriate implementation uses dynamic programming to avoid these out of memory issues.
The trouble is that itertools.product accepts iterators, and there is no guaranteed way of "restarting" an arbitrary iterator in Python.Consider: >>> a = iter([1, 2, 3]) >>> b = iter([4, 5, 6]) >>> next(a) 1 >>> next(b) 4 >>> from itertools import product >>> list(product(a, b))[(2, 5), (2, 6), (3, 5), (3, 6)] Since there 's no way to get back to items you' ve already consumed, the current approach is to consume all of the iterators to begin with and store their items in arrays, then lazily produce tuples of the items at the right indices of those arrays. Perhaps one could consume lazily from the iterators, say, only filling up the pools as they 're needed and not storing the contents of the first iterator, but this would mean sometimes producing a product iterator that was doomed to cause a memory error eventually. If you really need this behavior you could do this in Python: def lazy_product( * iterables): if not iterables: yield() return it0 = iterables[0] for x in it0: print(f "{x=}") for rest in lazy_product( * iterables[1: ]): print(f "{rest=}") yield(x, ) + rest The above could surely be optimized as maybe you 're suggesting, but this would be a backward-incompatible change for itertools.product.
Possibly related:
https: //bugs.python.org/issue10109
Henry, I 'm not clear at all about what you'
re saying.Please give at least one specific, concrete example of the behavior you 're objecting to, and specify the behavior you want to see instead.
Hey, Tim, I just wanted a note of clarification.I was working on an approximation algorithm for a network science tool that is being released soon.Initially, I relied on itertools.product(), but when I moved to testing on non - trivial graphs, I immediately had Out of Memory Errors.This was even on the nodes with large memories on the computing cluster that I was using.The issue is a lot of extra lists are added as the number of iterables passed product grows although the actual number of elements in the iterables is relatively small. Here is the workaround that I used to handle many arguments: def product( * args) - > Iterable: '' ' Takes in several iterables and returns a generator that yields the cartesian product of the elements in the iterables: param args: sets which you would like the cartesian product of : return: generator '' ' if len(args) == 1 and type(args) == list: return args numArgs = len(args) indices = np.zeros(numArgs, dtype = np.uint) checkLen = len(args[0]) while indices[0] < checkLen: curr_term = [args[0][indices[0]]] for i in range(1, numArgs): curr_term.append(args[i][indices[i]]) indices[-1] += np.uint(1) for i in range(1, numArgs): updated = False if indices[-i] >= len(args[-i]): indices[-i] = np.uint(0) indices[-i - 1] += np.uint(1) updated = True if not updated: break yield curr_term
Henry, I have to ask again: please give at least one specific, concrete example of the behavior you 're objecting to. English isn' t helping, and I still have no idea what your complaint is. What I 'm looking for: for i in itertools.product( ?? ? ): pass where you replace the ?? ? with executable code(preferably not using numpy or any other extension package) that provokes the MemoryError you 're talking about. For example, here I 'm constructing a case with a million arguments. There' s no problem at all: >>> import itertools >>> args = [range(100) for i in range(1_000_000)] >>> for i in itertools.product( * args): ...print(len(i))[and it prints 1000000 over & over until I get bored and kill it] Note if it _did_ provoke a problem, we probably wouldn 't care - there are no plausible realistic use cases for passing a million arguments to product(). You haven 't given us a way to reproduce your complaint, or even a clue as to the number of arguments you' re talking about.The above code was my best guess as to what you _might_ be talking about.But since there 's no MemoryError in sight, apparently not. I 'm suspecting this may be an XY problem[1], and especially because you posted "a solution" instead of an actual problem ;-) [1] https: //meta.stackexchange.com/questions/66377/what-is-the-xy-problem
> when I moved to testing on non - trivial graphs, I immediately had > Out of Memory Errors. I 'm going to hazard a guess that the input to product() was a graph traversal iterator that got trapped in an undetected cycle. Feeding an infinite iterator into product() would eat all available memory, just as it would with list().
The range argument of numpy.histogramdd can now contain None values to indicate that the range for the corresponding axis should be computed from the data. Previously, this could not be specified on a per-axis basis.,The type comparison functions have been made consistent with the new sort order of nans. Searchsorted now works with sorted arrays containing nan values.,The type comparison functions have been made consistent with the new sort order of nans. Searchsorted now works with sorted arrays containing nan values. ,This release has seen a lot of refactoring and features many bug fixes, improved code organization, and better cross platform compatibility. Not all of these improvements will be visible to users, but they should help make maintenance easier going forward.
>>> np.zeros(10) //1
array([-0., -0., -0., -0., -0., -0., -0., -0., -0., -0.])
>>> np.zeros(10) //1
array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])
npy_intp
const fixed_dims[] = {
1,
2,
3
};
// no longer complains that the const-qualifier is discarded
npy_intp size = PyArray_MultiplyList(fixed_dims, 3);
NPY_BLAS_ORDER = openblas python setup.py build
NPY_LAPACK_ORDER = openblas python setup.py build
def foo(): pass