pandas: find first occurrences of elements that appear in a certain column

  • Last Update :
  • Techknowledgy :

use idxmax on df.A.ne('a')

df.A.ne('a').idxmax()

3

or the numpy equivalent

(df.A.values != 'a').argmax()

3

However, if A has already been sorted, then we can use searchsorted

df.A.searchsorted('a', side = 'right')

array([3])

I found there is first_valid_index function for Pandas DataFrames that will do the job, one could use it as follows:

df[df.A != 'a'].first_valid_index()

3

However, this function seems to be very slow. Even taking the first index of the filtered dataframe is faster:

df.loc[df.A != 'a', 'A'].index[0]

Below I compare the total time(sec) of repeating calculations 100 times for these two options and all the codes above:

                      total_time_sec ratio wrt fastest algo
                      searchsorted numpy: 0.0007 1.00
                      argmax numpy: 0.0009 1.29
                      for loop: 0.0045 6.43
                      searchsorted pandas: 0.0075 10.71
                      idxmax pandas: 0.0267 38.14
                      index[0]: 0.0295 42.14
                      first_valid_index pandas: 0.1181 168.71

The code to produce these results is below:

import timeit

# code snippet to be executed only once
mysetup = ''
'import pandas as pd
import numpy as np
df = pd.DataFrame({
   "A": ['a', 'a', 'a', 'b', 'b'],
   "B": [1] * 5
})
''
'

# code snippets whose execution time is to be measured
mycode_set = [''
      '
      df[df.A != 'a'].first_valid_index()
      ''
      ']
      message = ["first_valid_index pandas:"]

      mycode_set.append(''
         'df.loc[df.A!='
         a ','
         A '].index[0]'
         '')
      message.append("index[0]: ")

      mycode_set.append(''
         'df.A.ne('
         a ').idxmax()'
         '')
      message.append("idxmax pandas: ")

      mycode_set.append(''
         '(df.A.values != '
         a ').argmax()'
         '')
      message.append("argmax numpy: ")

      mycode_set.append(''
         'df.A.searchsorted('
         a ', side='
         right ')'
         '')
      message.append("searchsorted pandas: ")

      mycode_set.append(''
         'df.A.values.searchsorted('
         a ', side='
         right ')'
         '')
      message.append("searchsorted numpy: ")

      mycode_set.append(''
         'for index in range(len(df['
         A '])):
         if df['A'][index] != 'a':
         ans = index
         break ''
         ')
         message.append("for loop: ")

         total_time_in_sec = []
         for i in range(len(mycode_set)):
         mycode = mycode_set[i] total_time_in_sec.append(np.round(timeit.timeit(setup = mysetup, \
            stmt = mycode, number = 100), 4))

         output = pd.DataFrame(total_time_in_sec, index = message, \
            columns = ['total_time_sec']) output["ratio wrt fastest algo"] = \
         np.round(output.total_time_sec / output["total_time_sec"].min(), 2)

         output = output.sort_values(by = "total_time_sec") display(output)

For the larger dataframe:

mysetup = ''
'import pandas as pd
import numpy as np
n = 10000
lt = ['a'
   for _ in range(n)
]
b = ['b'
   for _ in range(5)
]
lt[-5: ] = b
df = pd.DataFrame({
   "A": lt,
   "B": [1] * n
})
''
'

Using pandas groupby() to group by column or list of columns. Then first() to get the first value in each group.

import pandas as pd

df = pd.DataFrame({
   "A": ['a', 'a', 'a', 'b', 'b'],
   "B": [1] * 5
})

#Group df by column and get the first value in each group
grouped_df = df.groupby("A").first()

#Reset indices to match format
first_values = grouped_df.reset_index()

print(first_values) >>>
   A B
0 a 1
1 b 1

Let's say we have:

s = pd.Series(['a', 'a', 'c', 'c', 'b', 'd'])

And we want to find the first item different than a and c, we do:

n = np.logical_and(s.values != 'a', s.values != 'c').argmax()

Times:

import numpy as np
import pandas as pd
from datetime
import datetime

ITERS = 1000

def pandas_multi_condition(s):
   ts = datetime.now()
for i in range(ITERS):
   n = s[(s != 'a') & (s != 'c')].index[0]
print(n)
print(datetime.now() - ts)

def numpy_bitwise_and(s):
   ts = datetime.now()
for i in range(ITERS):
   n = np.logical_and(s.values != 'a', s.values != 'c').argmax()
print(n)
print(datetime.now() - ts)

s = pd.Series(['a', 'a', 'c', 'c', 'b', 'd'])

print('pandas_multi_condition():')
pandas_multi_condition(s)
print()
print('numpy_bitwise_and():')
numpy_bitwise_and(s)

If you just want to find the first instance without going through the entire dataframe, you can go the for-loop way.

df = pd.DataFrame({
   "A": ['a', 'a', 'a', 'b', 'b'],
   "B": [1] * 5
})
for index in range(len(df['A'])):
   if df['A'][index] != 'a':
   print(index)
break

You can iterate by dataframe rows (it is slow) and create your own logic to get values that you wanted:

def getMaxIndex(df, col)
max = -999999
rtn_index = 0
for index, row in df.iterrows():
   if row[col] > max:
   max = row[col]
rtn_index = index
return rtn_index

Suggestion : 2

Last Updated : 23 Dec, 2021,GATE CS 2021 Syllabus,GATE Live Course 2023

Syntax:

data['column_name'].value_counts()[value]

Output:

3
2
1

Syntax:

(data['column_name'].value_counts(bins)

Suggestion : 3

Conclusion: Pandas Count Occurences in Column,Pandas Count Unique Values and Missing Values in a Column,Here’s how to count occurrences (unique values) in a column in Pandas dataframe:,How to Count Occurences in a Column with Pandas value_counts()

.wp - block - code {
      border: 0;
      padding: 0;
   }

   .wp - block - code > div {
      overflow: auto;
   }

   .shcb - language {
      border: 0;
      clip: rect(1 px, 1 px, 1 px, 1 px); -
      webkit - clip - path: inset(50 % );
      clip - path: inset(50 % );
      height: 1 px;
      margin: -1 px;
      overflow: hidden;
      padding: 0;
      position: absolute;
      width: 1 px;
      word - wrap: normal;
      word - break: normal;
   }

   .hljs {
      box - sizing: border - box;
   }

   .hljs.shcb - code - table {
      display: table;
      width: 100 % ;
   }

   .hljs.shcb - code - table > .shcb - loc {
      color: inherit;
      display: table - row;
      width: 100 % ;
   }

   .hljs.shcb - code - table.shcb - loc > span {
      display: table - cell;
   }

   .wp - block - code code.hljs: not(.shcb - wrap - lines) {
      white - space: pre;
   }

   .wp - block - code code.hljs.shcb - wrap - lines {
      white - space: pre - wrap;
   }

   .hljs.shcb - line - numbers {
      border - spacing: 0;
      counter - reset: line;
   }

   .hljs.shcb - line - numbers > .shcb - loc {
      counter - increment: line;
   }

   .hljs.shcb - line - numbers.shcb - loc > span {
      padding - left: 0.75 em;
   }

   .hljs.shcb - line - numbers.shcb - loc::before {
      border - right: 1 px solid #ddd;
      content: counter(line);
      display: table - cell;
      padding: 0 0.75 em;
      text - align: right; -
      webkit - user - select: none; -
      moz - user - select: none; -
      ms - user - select: none;
      user - select: none;
      white - space: nowrap;
      width: 1 % ;
   }
import pandas as pd

# URL to.csv file
data_url = 'https://vincentarelbundock.github.io/Rdatasets/csv/carData/Arrests.csv'
# Reading the data
df = pd.read_csv(data_url, index_col = 0) Code language: Python(python)

Here’s how to count occurrences (unique values) in a column in Pandas dataframe:

# pandas count distinct values in column
df['sex'].value_counts() Code language: Python(python)

Now, as with many Pandas methods, value_counts() has a couple of parameters that we may find useful at times. For example, if we want the reorder the output such as that the counted values (male and female, in this case) are shown in alphabetical order we can use the ascending parameter and set it to True:

# pandas count unique values ascending:
   df['sex'].value_counts(ascending = True) Code language: Python(python)

Here’s a code example to get the number of unique values as well as how many missing values there are:

# Counting occurences as well as missing values:
   df_na['sex'].value_counts(dropna = False) Code language: Python(python)

Now that we have counted the unique values in a column we will continue by using another parameter of the value_counts() method: normalize. Here’s how we get the relative frequencies of men and women in the dataset:

df['sex'].value_counts(normalize = True) Code language: Python(python)

Suggestion : 4

jQuery Tutorial ,Append pandas dataframe from other pandas dataframe that are values of a dictionary,python pandas - use dataframe value in ewm parameters,replacing all row values with shorter names in pandas

gnarly one. Compute multiple options independently and combine. Code below

#meets greater than equal to 3 rule
df['m'] = df['payment'].str.extract('(\d+)').astype(float).ge(3) #create temp ro
a = df[df['m']]

#meets BBB, CCC rule

b = df[df['code1'].isin(["BBB", "CCC"]) | df['code2'].isin(["BBB", "CCC"]) | df['code3'].isin(["BBB", "CCC"])].drop_duplicates(subset = ['code1', 'code1', 'code1'], keep = 'first')

#meets unique row rule

c = df.drop_duplicates(subset = ['ID'], keep = 'first')

#combine a, b, c and drop duplicates
df1 = pd.concat([a, b, c], axis = 0).drop_duplicates(subset = ['code1', 'code1', 'code1'], keep = 'first').drop('m', 1)

print(df1)

ID active_date datestamp code1 code2 code3 payment
5 3 18 / 01 / 2020 15 / 05 / 2020 CCC BBB AAA 4
8 4 20 / 01 / 2020 25 / 04 / 2020 AAA..3
2 1 01 / 01 / 2020 12 / 06 / 2020 BBB AAA.2
4 2 10 / 01 / 2020 .....
10 5 24 / 01 / 2020 06 / 05 / 2020 DDD..1

For convenience let's replace . with nan and convert payment datatype from object to float

df.replace(to_replace = '.', value = np.nan, inplace = True)
df.payment = df.payment.astype(np.float)

Now assign column which stores True/False depending on the conditions specified.

cond = (df[['code1', 'code2', 'code3']].isin(['BBB', 'CCC']).any(axis = 1) |
   (df.payment >= 3))
df['cond'] = cond

This checks first two conditions. Now let's assign all True values to the ids where no condition is satisfied(so we choose any random row).

df['cond'] = cond | df.groupby('ID')['cond'].transform(lambda x: ~np.any(x))

Output:

>>> df
ID active_date datestamp code1 code2 code3 payment
2 1 01 / 01 / 2020 12 / 06 / 2020 BBB AAA.2
4 2 10 / 01 / 2020 .....
5 3 18 / 01 / 2020 15 / 05 / 2020 CCC BBB AAA 4
8 4 20 / 01 / 2020 25 / 04 / 2020 AAA..3
10 5 24 / 01 / 2020 06 / 05 / 2020 DDD..1