how is the dtype of a numpy array calculated internally?

  • Last Update :
  • Techknowledgy :

For example, the choice is based on inputs, not result

>>> A = np.full((2, 2), 30000, 'i2') >>>
   >>>
   A
array([
   [30000, 30000],
   [30000, 30000]
], dtype = int16)
# 1
   >>>
   A + 30000
array([
   [-5536, -5536],
   [-5536, -5536]
], dtype = int16)
# 2
   >>>
   A + 60000
array([
   [90000, 90000],
   [90000, 90000]
], dtype = int32)

Also, and more directly related to your question, type promotion only applies out-of-place, not in-place:

# out - of - place >>>
   A_new = A + 60000 >>>
   A_new
array([
   [90000, 90000],
   [90000, 90000]
], dtype = int32)
# in -place >>>
   A += 60000 >>>
   A
array([
   [24464, 24464],
   [24464, 24464]
], dtype = int16)

or

# out - of - place >>>
   A_new = np.where([
      [0, 0],
      [0, 1]
   ], 60000, A) >>>
   A_new
array([
   [30000, 30000],
   [30000, 60000]
], dtype = int32)
# in -place >>>
   A[1, 1] = 60000 >>>
   A
array([
   [30000, 30000],
   [30000, -5536]
], dtype = int16)

Suggestion : 2

So, my first question is: How is this calculated? Does it make the datatype suitable for the maximum element as a datatype for all the elements? If that is the case, don’t you think it requires more space because it is unnecessarily storing excess memory to store 2 in the second array as a 64 bit integer? ,dtype : data-type, optional The desired data-type for the array. If not given, then the type will be determined as the minimum type required to hold the objects in the sequence. This argument can only be used to ‘upcast’ the array. For downcasting, use the .astype(t) method.,Note that for numpy to be fast it is essential that all elements of an array be of the same size. Otherwise, how would you quickly locate the 1000th element, say? Also, mixing types wouldn’t save all that much space since you would have to store the types of every single element on top of the raw data.,It should be noted that this is not entirely accurate, for example for integer arrays the system (C) default integer is preferred over smaller integer types as is evident form your example.

t = np.array([2, 2])
t.dtype
t = np.array([2, 22222222222])
t.dtype
t = np.array([2, 2])
t[0] = 222222222222222
>>> A = np.full((2, 2), 30000, 'i2') >>>
   >>>
   A
array([
   [30000, 30000],
   [30000, 30000]
], dtype = int16)
# 1
   >>>
   A + 30000
array([
   [-5536, -5536],
   [-5536, -5536]
], dtype = int16)
# 2
   >>>
   A + 60000
array([
   [90000, 90000],
   [90000, 90000]
], dtype = int32)
# out - of - place >>>
   A_new = A + 60000 >>>
   A_new
array([
   [90000, 90000],
   [90000, 90000]
], dtype = int32)
# in -place >>>
   A += 60000 >>>
   A
array([
   [24464, 24464],
   [24464, 24464]
], dtype = int16)
# out - of - place >>>
   A_new = np.where([
      [0, 0],
      [0, 1]
   ], 60000, A) >>>
   A_new
array([
   [30000, 30000],
   [30000, 60000]
], dtype = int32)
# in -place >>>
   A[1, 1] = 60000 >>>
   A
array([
   [30000, 30000],
   [30000, -5536]
], dtype = int16)

Suggestion : 3

An ndarray is a (usually fixed-size) multidimensional container of items of the same type and size. The number of dimensions and items in an array is defined by its shape, which is a tuple of N non-negative integers that specify the sizes of each dimension. The type of items in the array is specified by a separate data-type object (dtype), one of which is associated with each ndarray.,An array is considered aligned if the memory offsets for all elements and the base offset itself is a multiple of self.itemsize. Understanding memory-alignment leads to better performance on most hardware.,An array object represents a multidimensional, homogeneous array of fixed-size items.,Arithmetic and comparison operations on ndarrays are defined as element-wise operations, and generally yield ndarray objects as results.

>>> x = np.array([[1, 2, 3], [4, 5, 6]], np.int32)
>>> type(x)
<class 'numpy.ndarray'>
   >>> x.shape
   (2, 3)
   >>> x.dtype
   dtype('int32')
>>> # The element of x in the * second * row, * third * column, namely, 6. >>>
   x[1, 2]
6
>>> y = x[: , 1] >>>
   y
array([2, 5], dtype = int32) >>>
   y[0] = 9 # this also changes the corresponding element in x >>>
   y
array([9, 5], dtype = int32) >>>
   x
array([
   [1, 9, 3],
   [4, 5, 6]
], dtype = int32)
>>> x = np.arange(27).reshape((3, 3, 3)) >>>
   x
array([
      [
         [0, 1, 2],
         [3, 4, 5],
         [6, 7, 8]
      ],
      [
         [9, 10, 11],
         [12, 13, 14],
         [15, 16, 17]
      ],
      [
         [18, 19, 20],
         [21, 22, 23],
         [24, 25, 26]
      ]
   ]) >>>
   x.sum(axis = 0)
array([
      [27, 30, 33],
      [36, 39, 42],
      [45, 48, 51]
   ]) >>>
   #
for sum, axis is the first keyword, so we may omit it, >>>
# specifying only its value
   >>>
   x.sum(0), x.sum(1), x.sum(2)
   (array([
         [27, 30, 33],
         [36, 39, 42],
         [45, 48, 51]
      ]),
      array([
         [9, 12, 15],
         [36, 39, 42],
         [63, 66, 69]
      ]),
      array([
         [3, 12, 21],
         [30, 39, 48],
         [57, 66, 75]
      ]))

Suggestion : 4

An ndarray is a generic multidimensional container for homogeneous data; that is, all of the elements must be the same type. Every array has a shape, a tuple indicating the size of each dimension, and a dtype, an object describing the data type of the array:,If casting were to fail for some reason (like a string that cannot be converted to float64), a ValueError will be raised. Here I was a bit lazy and wrote float instead of np.float64; NumPy aliases the Python types to its own equivalent data dtypes.,Let’s consider an example where we have some data in an array and an array of names with duplicates. I’m going to use here the randn function in numpy.random to generate some random normally distributed data:,When slicing like this, you always obtain array views of the same number of dimensions. By mixing integer indexes and slices, you get lower dimensional slices.

The easiest way to create an array is to use the array function. This accepts any sequence-like object (including other arrays) and produces a new NumPy array containing the passed data. For example, a list is a good candidate for conversion:

In[19]: data1 = [6, 7.5, 8, 0, 1]

In[20]: arr1 = np.array(data1)

In[21]: arr1
Out[21]: array([6., 7.5, 8., 0., 1.])

Nested sequences, like a list of equal-length lists, will be converted into a multidimensional array:

In[22]: data2 = [
   [1, 2, 3, 4],
   [5, 6, 7, 8]
]

In[23]: arr2 = np.array(data2)

In[24]: arr2
Out[24]:
   array([
      [1, 2, 3, 4],
      [5, 6, 7, 8]
   ])

Since data2 was a list of lists, the NumPy array arr2 has two dimensions with shape inferred from the data. We can confirm this by inspecting the ndim and shape attributes:

In[25]: arr2.ndim
Out[25]: 2

In[26]: arr2.shape
Out[26]: (2, 4)

In addition to np.array, there are a number of other functions for creating new arrays. As examples, zeros and ones create arrays of 0s or 1s, respectively, with a given length or shape. empty creates an array without initializing its values to any particular value. To create a higher dimensional array with these methods, pass a tuple for the shape:

In[29]: np.zeros(10)
Out[29]: array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

In[30]: np.zeros((3, 6))
Out[30]:
   array([
      [0., 0., 0., 0., 0., 0.],
      [0., 0., 0., 0., 0., 0.],
      [0., 0., 0., 0., 0., 0.]
   ])

In[31]: np.empty((2, 3, 2))
Out[31]:
   array([
      [
         [0., 0.],
         [0., 0.],
         [0., 0.]
      ],
      [
         [0., 0.],
         [0., 0.],
         [0., 0.]
      ]
   ])

arange is an array-valued version of the built-in Python range function:

In[32]: np.arange(15)
Out[32]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14])

The data type or dtype is a special object containing the information (or metadata, data about data) the ndarray needs to interpret a chunk of memory as a particular type of data:

In[33]: arr1 = np.array([1, 2, 3], dtype = np.float64)

In[34]: arr2 = np.array([1, 2, 3], dtype = np.int32)

In[35]: arr1.dtype
Out[35]: dtype('float64')

In[36]: arr2.dtype
Out[36]: dtype('int32')

You can explicitly convert or cast an array from one dtype to another using ndarray’s astype method:

In[37]: arr = np.array([1, 2, 3, 4, 5])

In[38]: arr.dtype
Out[38]: dtype('int64')

In[39]: float_arr = arr.astype(np.float64)

In[40]: float_arr.dtype
Out[40]: dtype('float64')

In this example, integers were cast to floating point. If I cast some floating-point numbers to be of integer dtype, the decimal part will be truncated:

In[41]: arr = np.array([3.7, -1.2, -2.6, 0.5, 12.9, 10.1])

In[42]: arr
Out[42]: array([3.7, -1.2, -2.6, 0.5, 12.9, 10.1])

In[43]: arr.astype(np.int32)
Out[43]: array([3, -1, -2, 0, 12, 10], dtype = int32)

You can also use another array’s dtype attribute:

In[46]: int_array = np.arange(10)

In[47]: calibers = np.array([.22, .270, .357, .380, .44, .50], dtype = np.float64)

In[48]: int_array.astype(calibers.dtype)
Out[48]: array([0., 1., 2., 3., 4., 5., 6., 7., 8., 9.])

There are shorthand type code strings you can also use to refer to a dtype:

In[49]: empty_uint32 = np.empty(8, dtype = 'u4')

In[50]: empty_uint32
Out[50]:
   array([0, 1075314688, 0, 1075707904, 0,
      1075838976, 0, 1072693248
   ], dtype = uint32)

Arrays are important because they enable you to express batch operations on data without writing any for loops. NumPy users call this vectorization. Any arithmetic operations between equal-size arrays applies the operation element-wise:

In[51]: arr = np.array([
   [1., 2., 3.],
   [4., 5., 6.]
])

In[52]: arr
Out[52]:
   array([
      [1., 2., 3.],
      [4., 5., 6.]
   ])

In[53]: arr * arr
Out[53]:
   array([
      [1., 4., 9.],
      [16., 25., 36.]
   ])

In[54]: arr - arr
Out[54]:
   array([
      [0., 0., 0.],
      [0., 0., 0.]
   ])

Arithmetic operations with scalars propagate the scalar argument to each element in the array:

In[55]: 1 / arr
Out[55]:
   array([
      [1., 0.5, 0.3333],
      [0.25, 0.2, 0.1667]
   ])

In[56]: arr ** 0.5
Out[56]:
   array([
      [1., 1.4142, 1.7321],
      [2., 2.2361, 2.4495]
   ])

Comparisons between arrays of the same size yield boolean arrays:

In[57]: arr2 = np.array([
   [0., 4., 1.],
   [7., 2., 12.]
])

In[58]: arr2
Out[58]:
   array([
      [0., 4., 1.],
      [7., 2., 12.]
   ])

In[59]: arr2 > arr
Out[59]:
   array([
      [False, True, False],
      [True, False, True]
   ])

NumPy array indexing is a rich topic, as there are many ways you may want to select a subset of your data or individual elements. One-dimensional arrays are simple; on the surface they act similarly to Python lists:

In[60]: arr = np.arange(10)

In[61]: arr
Out[61]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In[62]: arr[5]
Out[62]: 5

In[63]: arr[5: 8]
Out[63]: array([5, 6, 7])

In[64]: arr[5: 8] = 12

In[65]: arr
Out[65]: array([0, 1, 2, 3, 4, 12, 12, 12, 8, 9])

To give an example of this, I first create a slice of arr:

In[66]: arr_slice = arr[5: 8]

In[67]: arr_slice
Out[67]: array([12, 12, 12])

Now, when I change values in arr_slice, the mutations are reflected in the original array arr:

In[68]: arr_slice[1] = 12345

In[69]: arr
Out[69]:
   array([0, 1, 2, 3, 4, 12, 12345, 12, 8,
      9
   ])

With higher dimensional arrays, you have many more options. In a two-dimensional array, the elements at each index are no longer scalars but rather one-dimensional arrays:

In[72]: arr2d = np.array([
   [1, 2, 3],
   [4, 5, 6],
   [7, 8, 9]
])

In[73]: arr2d[2]
Out[73]: array([7, 8, 9])

Thus, individual elements can be accessed recursively. But that is a bit too much work, so you can pass a comma-separated list of indices to select individual elements. So these are equivalent:

In[74]: arr2d[0][2]
Out[74]: 3

In[75]: arr2d[0, 2]
Out[75]: 3

Like one-dimensional objects such as Python lists, ndarrays can be sliced with the familiar syntax:

In[88]: arr
Out[88]: array([0, 1, 2, 3, 4, 64, 64, 64, 8, 9])

In[89]: arr[1: 6]
Out[89]: array([1, 2, 3, 4, 64])

Consider the two-dimensional array from before, arr2d. Slicing this array is a bit different:

In[90]: arr2d
Out[90]:
   array([
      [1, 2, 3],
      [4, 5, 6],
      [7, 8, 9]
   ])

In[91]: arr2d[: 2]
Out[91]:
   array([
      [1, 2, 3],
      [4, 5, 6]
   ])

You can pass multiple slices just like you can pass multiple indexes:

In[92]: arr2d[: 2, 1: ]
Out[92]:
   array([
      [2, 3],
      [5, 6]
   ])

Similarly, I can select the third column but only the first two rows like so:

In[94]: arr2d[: 2, 2]
Out[94]: array([3, 6])

See Figure 4-2 for an illustration. Note that a colon by itself means to take the entire axis, so you can slice only higher dimensional axes by doing:

In[95]: arr2d[: ,: 1]
Out[95]:
   array([
      [1],
      [4],
      [7]
   ])

Let’s consider an example where we have some data in an array and an array of names with duplicates. I’m going to use here the randn function in numpy.random to generate some random normally distributed data:

In[98]: names = np.array(['Bob', 'Joe', 'Will', 'Bob', 'Will', 'Joe', 'Joe'])

In[99]: data = np.random.randn(7, 4)

In[100]: names
Out[100]: array(['Bob', 'Joe', 'Will', 'Bob', 'Will', 'Joe', 'Joe'], dtype = '<U4')

In[101]: data
Out[101]:
   array([
      [0.0929, 0.2817, 0.769, 1.2464],
      [1.0072, -1.2962, 0.275, 0.2289],
      [1.3529, 0.8864, -2.0016, -0.3718],
      [1.669, -0.4386, -0.5397, 0.477],
      [3.2489, -1.0212, -0.5771, 0.1241],
      [0.3026, 0.5238, 0.0009, 1.3438],
      [-0.7135, -0.8312, -2.3702, -1.8608]
   ])

Suppose each name corresponds to a row in the data array and we wanted to select all the rows with corresponding name 'Bob'. Like arithmetic operations, comparisons (such as ==) with arrays are also vectorized. Thus, comparing names with the string 'Bob' yields a boolean array:

In[102]: names == 'Bob'
Out[102]: array([True, False, False, True, False, False, False])

This boolean array can be passed when indexing the array:

In[103]: data[names == 'Bob']
Out[103]:
   array([
      [0.0929, 0.2817, 0.769, 1.2464],
      [1.669, -0.4386, -0.5397, 0.477]
   ])

To select everything but 'Bob', you can either use != or negate the condition using ~:

In[106]: names != 'Bob'
Out[106]: array([False, True, True, False, True, True, True])

In[107]: data[~(names == 'Bob')]
Out[107]:
   array([
      [1.0072, -1.2962, 0.275, 0.2289],
      [1.3529, 0.8864, -2.0016, -0.3718],
      [3.2489, -1.0212, -0.5771, 0.1241],
      [0.3026, 0.5238, 0.0009, 1.3438],
      [-0.7135, -0.8312, -2.3702, -1.8608]
   ])

The ~ operator can be useful when you want to invert a general condition:

In[108]: cond = names == 'Bob'

In[109]: data[~cond]
Out[109]:
   array([
      [1.0072, -1.2962, 0.275, 0.2289],
      [1.3529, 0.8864, -2.0016, -0.3718],
      [3.2489, -1.0212, -0.5771, 0.1241],
      [0.3026, 0.5238, 0.0009, 1.3438],
      [-0.7135, -0.8312, -2.3702, -1.8608]
   ])