pandas: how to get first positive number?

  • Last Update :
  • Techknowledgy :

Use first fast boolean indexing for filtering and then groupby + first:

df = df[df['b'] > 10].groupby('id', as_index = False).first()
print(df)
id a b
0 1 - 3 12
1 2 4 23

Solution is a bit complicated if in some group is no greater value as 10 - need expand mask with duplicated:

print(df)
a b id
1 7 6 3 < -no value b > 10
for id = 3
1 10 6 1
2 6 - 3 1
3 - 3 12 1
4 4 23 2
5 12 11 2
6 3 - 5 2

mask = ~df['id'].duplicated(keep = False) | (df['b'] > 10)
df = df[mask].groupby('id', as_index = False).first()
print(df)
id a b
0 1 - 3 12
1 2 4 23
2 3 7 6

Timings:

#[2000000 rows x 3 columns]
np.random.seed(123)
N = 2000000
df = pd.DataFrame({
   'id': np.random.randint(10000, size = N),
   'a': np.random.randint(10, size = N),
   'b': np.random.randint(15, size = N)
})
#print(df)

In[284]: % timeit(df[df['b'] > 10].groupby('id', as_index = False).first())
10 loops, best of 3: 67.6 ms per loop

In[285]: % timeit(df.query("b > 10").groupby('id').head(1))
10 loops, best of 3: 107 ms per loop

In[286]: % timeit(df[df['b'] > 10].groupby('id').head(1))
10 loops, best of 3: 90 ms per loop

In[287]: % timeit df.query("b > 10").groupby('id', as_index = False).first()
10 loops, best of 3: 83.3 ms per loop

#without sorting a bit faster
In[288]: % timeit(df[df['b'] > 10].groupby('id', as_index = False, sort = False).first())
10 loops, best of 3: 62.9 ms per loop
In[146]: df.query("b > 10").groupby('id').head(1)
Out[146]:
   a b id
3 - 3 12 1
4 4 23 2

For the last column being sorted case, here's a NumPy solution using np.searchsorted -

def numpy_searchsorted(df, thresh = 10):
   a = df.values
m = a[: , 1] > thresh
mask_idx = np.flatnonzero(m)

b = a[mask_idx, 2]
unq_ids = b[np.concatenate(([True], b[1: ] != b[: -1]))]
idx = np.searchsorted(b, unq_ids)
out = a[mask_idx[idx]]
return pd.DataFrame(out, columns = df.columns)

Runtime test -

In[2]: np.random.seed(123)
   ...: N = 2000000
   ...: df = pd.DataFrame({
      'id': np.sort(np.random.randint(10000, size = N)),
      ...: 'a': np.random.randint(10, size = N),
      ...: 'b': np.random.randint(15, size = N)
   })
   ...:

   # @MaxU 's soln
In[3]: % timeit df.query("b > 10").groupby('id').head(1)
10 loops, best of 3: 44.8 ms per loop

# @jezrael 's best soln that assumes last col as sorted too
In[4]: % timeit(df[df['b'] > 10].groupby('id', as_index = False, sort = False).first())
10 loops, best of 3: 30.1 ms per loop

# Proposed in this post
In[5]: % timeit numpy_searchsorted(df)
100 loops, best of 3: 12.6 ms per loop

Suggestion : 2

You can check which values in the dataframe are greater than 0, and take the idxmax (first True), then divide column-wise by the resulting values.,How to sort a group in a way that I get the largest number in the first row and smallest in the second and the second largest in the third and so on,I have a DataFrame which contains first and last number of some intervals. how can I get how many times each number was in each interval?,How to get the first index of a pandas DataFrame for which several undefined columns are not null?

You can check which values in the dataframe are greater than 0, and take the idxmax (first True), then divide column-wise by the resulting values.

ix = df.loc[: , 'a': ].gt(0).idxmax()
df.loc[: , 'a': ] = df.loc[: , 'a': ].div(df.lookup(ix.values, ix.index))

name a b c
0 jack - 2.50 - 2.000000 - 5.0
1 bill - 1.50 - 1.000000 - 2.5
2 ray - 0.75 - 4.000000 - 4.5
3 pew 1.00 - 7.666667 - 1.0
4 shaun 3.00 1.000000 1.0
5 mitch 0.75 1.666667 1.0

Here is another way:

m = df.set_index('name')
final = m.div(m.mask(m.lt(0)).bfill().iloc[0]).reset_index()

    name a b c
    0 jack - 2.50 - 2.000000 - 5.0
    1 bill - 1.50 - 1.000000 - 2.5
    2 ray - 0.75 - 4.000000 - 4.5
    3 pew 1.00 - 7.666667 - 1.0
    4 shaun 3.00 1.000000 1.0
    5 mitch 0.75 1.666667 1.0

Suggestion : 3

Last Updated : 08 Aug, 2022

1._
Input: list1 = [12, -7, 5, 64, -14]
Output: 12, 5, 64

Input: list2 = [12, 14, -95, 3]
Output: [12, 14, 3]

Output:

11 0 45 66

[12, 5, 64]

[12, 5, 64]

11 45 66

Suggestion : 4

Program to find lowest possible integer that is missing in the array in Python,Write a program in C++ to find the missing positive number in a given array of unsorted integers,Write a program in Java to find the missing positive number in a given array of unsorted integers,Suppose we have a list of sorted list of distinct integers of size n, we have to find the first positive number in range [1 to n+1] that is not present in the array.

class Solution:
   def solve(self, arr):
   target = 1
for i in arr:
   if i == target:
   target += 1
return target
ob = Solution()
nums = [0, 5, 1]
print(ob.solve(nums))

Input

[0, 5, 1]

Output

2

Suggestion : 5

If multiple values equal the minimum, the first row label with that value is returned.,This method is the Series version of ndarray.argmin. This method returns the label of the minimum, while ndarray.argmin returns the position. To get the position, use series.values.argmin().,Return the row label of the minimum value.,If skipna is False and there is an NA value in the data, the function returns nan.

>>> s = pd.Series(data = [1, None, 4, 1],
      ...index = ['A', 'B', 'C', 'D']) >>>
   s
A 1.0
B NaN
C 4.0
D 1.0
dtype: float64
>>> s.idxmin()
'A'
>>> s.idxmin(skipna = False)
nan