quick way to find previous instance of a value in a pandas dataframe or numpy array?

  • Last Update :
  • Techknowledgy :

This is faster:

datafile['Prev_Price'] = datafile.groupby('OrderId')['Price'].shift(fill_value = 0)

It returns:

   Price Qty OrderId Prev_Price
   0 26690 3000 1213772 0
   1 26700 3000 1215673 0
   2 26705 6000 1216656 0
   3 26700 3000 1213772 26690
   4 26710 3000 1215673 26700

Suggestion : 2

I have an large data set (number of rows in anycodings_python millions) which I read into a pandas anycodings_python DataFrame called datafile. ,Each row has an Order ID number - this is anycodings_python non-unique. So my datafile looks something anycodings_python like this,Note: fill_value is a valid argument of anycodings_pandas pandas.DataFrame.shift since pandas anycodings_pandas 0.24.0. For older version, don't pass anycodings_pandas the argument and replace NaN values anycodings_pandas later using datafile.fillna(0).,Now, on a short dataframe like the one anycodings_pandas you posted this method is actually anycodings_pandas slower. But I did a couple of tests with anycodings_pandas bigger dataframes:

Each row has an Order ID number - this is anycodings_python non-unique. So my datafile looks something anycodings_python like this

Price Qty OrderId

26690 3000 1213772

26700 3000 1215673

26705 6000 1216656

26700 3000 1213772

26710 3000 1215673

Now, what I want is, for each row - get the anycodings_python OrderID, find the previous occurrence of anycodings_python that OrderID in the DataFrame and get the anycodings_python corresponding price, and populate it in a anycodings_python new column "Prev_Price". If no previous anycodings_python occurrence is found, keep the value as 0. So anycodings_python my output should look like this

Price Qty OrderId Prev_Price

26690 3000 1213772 0

26700 3000 1215673 0

26705 6000 1216656 0

26700 3000 1213772 26690

26710 3000 1215673 26700

I tried using numpy and wrote this function

def getPrevPrice_np(x):
   try:
   return list(datanp[np.where(datanp[0: x, 2] == datanp[x, 2])][: , 0])[-1]
except:
   return 0

This is faster:

datafile['Prev_Price'] = datafile.groupby('OrderId')['Price'].shift(fill_value = 0)

It returns:

   Price Qty OrderId Prev_Price
   0 26690 3000 1213772 0
   1 26700 3000 1215673 0
   2 26705 6000 1216656 0
   3 26700 3000 1213772 26690
   4 26710 3000 1215673 26700

Suggestion : 3

Last Updated : 26 Jul, 2020

Syntax: 

df.tail(n)

Use pandas.DataFrame.iloc to get last n rows. It is similar to the list slicing.
Syntax: 

df.iloc[-n: ]

Suggestion : 4

Where cond is True, keep the original value. Where False, replace with corresponding value from other. If cond is callable, it is computed on the Series/DataFrame and should return boolean Series/DataFrame or array. The callable must not change input Series/DataFrame (though pandas doesn’t check it).,Entries where cond is False are replaced with corresponding value from other. If other is callable, it is computed on the Series/DataFrame and should return scalar or Series/DataFrame. The callable must not change input Series/DataFrame (though pandas doesn’t check it).,Replace values where the condition is False.,The where method is an application of the if-then idiom. For each element in the calling DataFrame, if cond is True the element is used; otherwise the corresponding element from the DataFrame other is used.

>>> s = pd.Series(range(5)) >>>
   s.where(s > 0)
0 NaN
1 1.0
2 2.0
3 3.0
4 4.0
dtype: float64 >>>
   s.mask(s > 0)
0 0.0
1 NaN
2 NaN
3 NaN
4 NaN
dtype: float64
>>> s.where(s > 1, 10)
0 10
1 10
2 2
3 3
4 4
dtype: int64 >>>
   s.mask(s > 1, 10)
0 0
1 1
2 10
3 10
4 10
dtype: int64
>>> df = pd.DataFrame(np.arange(10).reshape(-1, 2), columns = ['A', 'B']) >>>
   df
A B
0 0 1
1 2 3
2 4 5
3 6 7
4 8 9
   >>>
   m = df % 3 == 0 >>>
   df.where(m, -df)
A B
0 0 - 1
1 - 2 3
2 - 4 - 5
3 6 - 7
4 - 8 9 >>>
   df.where(m, -df) == np.where(m, df, -df)
A B
0 True True
1 True True
2 True True
3 True True
4 True True
   >>>
   df.where(m, -df) == df.mask(~m, -df)
A B
0 True True
1 True True
2 True True
3 True True
4 True True

Suggestion : 5

5. Select Cell Value from DataFrame Using df[‘col_name’].values[],We can use df['col_name'].values[] to get 1×1 DataFrame as a NumPy array, then access the first and only value of that array to get a cell value, for instance, df["Duration"].values[3].,Pandas – Select Rows Based on Column Values,1. Using DataFrame.loc[] to Get a Cell Value by Column Name

1._
# Belwo are quick example
# Using loc[].Get cell value by name & index
print(df.loc['r4']['Duration'])
print(df.loc['r4'][2])

# Using iloc[].Get cell value by index & name
print(df.iloc[3]['Duration'])
print(df.iloc[3, 2])

# Using DataFrame.at[]
print(df.at['r4', 'Duration'])
print(df.at[df.index[3], 'Duration'])

# Using DataFrame.iat[]
print(df.iat[3, 2])

#Get a cell value
print(df["Duration"].values[3])

# Get cell value from last row
print(df.iloc[-1, 2])
print(df.iloc[-1]['Duration'])
print(df.at[df.index[-1], 'Duration'])

Now, let’s create a DataFrame with a few rows and columns and execute some examples and validate the results. Our DataFrame contains column names Courses, Fee, Duration, Discount.

import pandas as pd
technologies = {
   'Courses': ["Spark", "PySpark", "Hadoop", "Python", "pandas"],
   'Fee': [24000, 25000, 25000, 24000, 24000],
   'Duration': ['30day', '50days', '55days', '40days', '60days'],
   'Discount': [1000, 2300, 1000, 1200, 2500]
}
index_labels = ['r1', 'r2', 'r3', 'r4', 'r5']
df = pd.DataFrame(technologies, index = index_labels)
print(df)

Yields below output.

Courses Fee Duration Discount
r1 Spark 24000 30 day 1000
r2 PySpark 25000 50 days 2300
r3 Hadoop 25000 55 days 1000
r4 Python 24000 40 days 1200
r5 pandas 24000 60 days 2500

Yields below output. From the above examples df.loc['r4'] returns a pandas Series.

40 days

If you wanted to get a cell value by column number or index position use DataFrame.iloc[], index position starts from 0 to length-1 (index starts from zero). In order to refer last column use -1 as the column position.

# Using iloc[].Get cell value by index & name
print(df.iloc[3]['Duration'])
print(df.iloc[3][2])
print(df.iloc[3, 2])

Suggestion : 6

The easiest way to create an array is to use the array function. This accepts any sequence-like object (including other arrays) and produces a new NumPy array containing the passed data. For example, a list is a good candidate for conversion:,Whenever you see “array”, “NumPy array”, or “ndarray” in the text, with few exceptions they all refer to the same thing: the ndarray object.,NumPy array indexing is a rich topic, as there are many ways you may want to select a subset of your data or individual elements. One-dimensional arrays are simple; on the surface they act similarly to Python lists:,As a simple example, suppose we wished to evaluate the function sqrt(x^2 + y^2) across a regular grid of values. The np.meshgrid function takes two 1D arrays and produces two 2D matrices corresponding to all pairs of (x, y) in the two arrays:

In[13]: data1 = [6, 7.5, 8, 0, 1]

In[14]: arr1 = np.array(data1)

In[15]: arr1
Out[15]: array([6., 7.5, 8., 0., 1.])
In[27]: arr1 = np.array([1, 2, 3], dtype = np.float64)

In[28]: arr2 = np.array([1, 2, 3], dtype = np.int32)

In[29]: arr1.dtype In[30]: arr2.dtype
Out[29]: dtype('float64') Out[30]: dtype('int32')
In[45]: arr = np.array([
   [1., 2., 3.],
   [4., 5., 6.]
])

In[46]: arr
Out[46]:
   array([
      [1., 2., 3.],
      [4., 5., 6.]
   ])

In[47]: arr * arr In[48]: arr - arr
Out[47]: Out[48]:
   array([
      [1., 4., 9.], array([
         [0., 0., 0.],
         [16., 25., 36.]
      ])[0., 0., 0.]
   ])
In[51]: arr = np.arange(10)

In[52]: arr
Out[52]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In[53]: arr[5]
Out[53]: 5

In[54]: arr[5: 8]
Out[54]: array([5, 6, 7])

In[55]: arr[5: 8] = 12

In[56]: arr
Out[56]: array([0, 1, 2, 3, 4, 12, 12, 12, 8, 9])
In[75]: arr[1: 6]
Out[75]: array([1, 2, 3, 4, 64])
In[83]: names = np.array(['Bob', 'Joe', 'Will', 'Bob', 'Will', 'Joe', 'Joe'])

In[84]: data = np.random.randn(7, 4)

In[85]: names
Out[85]:
   array(['Bob', 'Joe', 'Will', 'Bob', 'Will', 'Joe', 'Joe'],
      dtype = '|S4')

In[86]: data
Out[86]:
   array([
      [-0.048, 0.5433, -0.2349, 1.2792],
      [-0.268, 0.5465, 0.0939, -2.0445],
      [-0.047, -2.026, 0.7719, 0.3103],
      [2.1452, 0.8799, -0.0523, 0.0672],
      [-1.0023, -0.1698, 1.1503, 1.7289],
      [0.1913, 0.4544, 0.4519, 0.5535],
      [0.5994, 0.8174, -0.9297, -1.2564]
   ])

Suggestion : 7

To work the examples, you’ll need matplotlib installed in addition to NumPy.,NumPy’s main object is the homogeneous multidimensional array. It is a table of elements (usually numbers), all of the same type, indexed by a tuple of non-negative integers. In NumPy dimensions are called axes.,When the indexed array a is multidimensional, a single array of indices refers to the first dimension of a. The following example shows this behavior by converting an image of labels into a color image using a palette.,To create sequences of numbers, NumPy provides the arange function which is analogous to the Python built-in range, but returns an array.

[
   [1., 0., 0.],
   [0., 1., 2.]
]
>>> import numpy as np
>>> a = np.arange(15).reshape(3, 5)
>>> a
array([[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14]])
>>> a.shape
(3, 5)
>>> a.ndim
2
>>> a.dtype.name
'int64'
>>> a.itemsize
8
>>> a.size
15
>>> type(a)
<class 'numpy.ndarray'>
   >>> b = np.array([6, 7, 8])
   >>> b
   array([6, 7, 8])
   >>> type(b)
   <class 'numpy.ndarray'>
>>>
import numpy as np
   >>>
   a = np.array([2, 3, 4]) >>>
   a
array([2, 3, 4]) >>>
   a.dtype
dtype('int64') >>>
   b = np.array([1.2, 3.5, 5.1]) >>>
   b.dtype
dtype('float64')
>>> a = np.array(1, 2, 3, 4) # WRONG
Traceback(most recent call last):
   ...
   TypeError: array() takes from 1 to 2 positional arguments but 4 were given >>>
   a = np.array([1, 2, 3, 4]) # RIGHT
>>> b = np.array([(1.5, 2, 3), (4, 5, 6)]) >>>
   b
array([
   [1.5, 2., 3.],
   [4., 5., 6.]
])
>>> c = np.array([
      [1, 2],
      [3, 4]
   ], dtype = complex) >>>
   c
array([
   [1. + 0. j, 2. + 0. j],
   [3. + 0. j, 4. + 0. j]
])