fastest way to find compute function on dataframe slices by column value (python pandas)

  • Last Update :
  • Techknowledgy :

Perform a groupby on 'id_col' and then a transform passing function 'min', this will return a Series aligned to your original df so you can add as a new column:

In[13]:

   df = pd.DataFrame({
      "id_col": [0, 0, 0, 1, 1, 1, 2, 2, 2],
      "val_col": [0.1, 0.2, 0.3, 0.6, 0.4, 0.5, 0.2, 0.1, 0.0]
   })
df['offset'] = df.groupby('id_col').transform('min')
df
Out[13]:
   id_col val_col offset
0 0 0.1 0.1
1 0 0.2 0.1
2 0 0.3 0.1
3 1 0.6 0.4
4 1 0.4 0.4
5 1 0.5 0.4
6 2 0.2 0.0
7 2 0.1 0.0
8 2 0.0 0.0

timings

In[15]:

   def apply_by_id_value(df, id_col = "id_col", val_col = "val_col", offset_col = "offset", f = min):
   for rid in set(df[id_col].values):
   df.loc[df[id_col] == rid, offset_col] = f(df[df[id_col] == rid][val_col])
return df %
   timeit apply_by_id_value(df) %
   timeit df.groupby('id_col').transform('min')
100 loops, best of 3: 8.12 ms per loop
100 loops, best of 3: 5.99 ms per loop

For a 800,000 row df I get the following timings:

1 loops, best of 3: 611 ms per loop
1 loops, best of 3: 438 ms per loop

Suggestion : 2

Fastest way to find compute function on DataFrame slices by column value (Python pandas),Fastest way to set elements of Pandas Dataframe based on a function with index and column value as input,Fastest way to find nearest nonzero value in array from columns in pandas dataframe,Fastest way to compute a column of a dataframe

Perform a groupby on 'id_col' and then a transform passing function 'min', this will return a Series aligned to your original df so you can add as a new column:

In[13]:

   df = pd.DataFrame({
      "id_col": [0, 0, 0, 1, 1, 1, 2, 2, 2],
      "val_col": [0.1, 0.2, 0.3, 0.6, 0.4, 0.5, 0.2, 0.1, 0.0]
   })
df['offset'] = df.groupby('id_col').transform('min')
df
Out[13]:
   id_col val_col offset
0 0 0.1 0.1
1 0 0.2 0.1
2 0 0.3 0.1
3 1 0.6 0.4
4 1 0.4 0.4
5 1 0.5 0.4
6 2 0.2 0.0
7 2 0.1 0.0
8 2 0.0 0.0

timings

In[15]:

   def apply_by_id_value(df, id_col = "id_col", val_col = "val_col", offset_col = "offset", f = min):
   for rid in set(df[id_col].values):
   df.loc[df[id_col] == rid, offset_col] = f(df[df[id_col] == rid][val_col])
return df %
   timeit apply_by_id_value(df) %
   timeit df.groupby('id_col').transform('min')
100 loops, best of 3: 8.12 ms per loop
100 loops, best of 3: 5.99 ms per loop

For a 800,000 row df I get the following timings:

1 loops, best of 3: 611 ms per loop
1 loops, best of 3: 438 ms per loop

Suggestion : 3

1 week ago Fastest way to find compute function on DataFrame slices by column value (Python pandas) Is there a way to turn a date-indexed dataframe containing durations of events, into a dataframe of binary data showing event for each day? Apply a function on all possible combination of columns in a dataframe in Python -- Better way; Is there a function ... ,Fastest Way To Find Compute Function On Dataframe Slices By Column Value Python, 1 week ago Jun 01, 2020  · Pandas DataFrame: Compute values based on column min/max. 5. Fastest way to find which of two lists of columns of each row is true in a pandas dataframe. 0. After finding max value, Find succeeding min value in separate column. 0. How to make function to do iteration over row in pandas? 1. , 1 week ago Return the column name(s) for a specific value in a pandas dataframe; Calculate percentile for every value in a column of dataframe; How to get column name for second largest row value in pandas DataFrame; find index of a value before the maximum for each column in python dataframe; Fastest way to find compute function on DataFrame slices by ...


def apply_by_id_value(df, id_col = "id_col", val_col = "val_col", offset_col = "offset", f = min): for rid in set(df[id_col].values): df.loc[df[id_col] == rid, offset_col] = f(df[df[id_col] == rid][val_col]) return df

In[13]: df = pd.DataFrame({
   "id_col": [0, 0, 0, 1, 1, 1, 2, 2, 2],
   "val_col": [0.1, 0.2, 0.3, 0.6, 0.4, 0.5, 0.2, 0.1, 0.0]
}) df['offset'] = df.groupby('id_col').transform('min') df Out[13]: id_col val_col offset 0 0 0.1 0.1 1 0 0.2 0.1 2 0 0.3 0.1 3 1 0.6 0.4 4 1 0.4 0.4 5 1 0.5 0.4 6 2 0.2 0.0 7 2 0.1 0.0 8 2 0.0 0.0
def apply_by_id_value(df, id_col = "id_col", val_col = "val_col", offset_col = "offset", f = min): for rid in set(df[id_col].values): df.loc[df[id_col] == rid, offset_col] = f(df[df[id_col] == rid][val_col]) return df
import pandas as pd
import numpy as np # create data frame df = pd.DataFrame({
   "id_col": [0, 0, 0, 1, 1, 1, 2, 2, 2],
   "val_col": [0.1, 0.2, 0.3, 0.6, 0.4, 0.5, 0.2, 0.1, 0.0]
}) print df.head(10) # output id_col val_col 000.1 100.2 200.3 310.6 410.4 510.5 620.2 720.1 820.0 df = apply_by_id_value(df) print df.head(10) # outputid_col val_col offset 000.10 .1 100.20 .1 200.30 .1 310.60 .4 410.40 .4 510.50 .4 620.20 .0 720.10 .0 820.00 .0
In[13]: df = pd.DataFrame({
   "id_col": [0, 0, 0, 1, 1, 1, 2, 2, 2],
   "val_col": [0.1, 0.2, 0.3, 0.6, 0.4, 0.5, 0.2, 0.1, 0.0]
}) df['offset'] = df.groupby('id_col').transform('min') df Out[13]: id_col val_col offset 000.10 .1 100.20 .1 200.30 .1 310.60 .4 410.40 .4 510.50 .4 620.20 .0 720.10 .0 820.00 .0
In[15]: def apply_by_id_value(df, id_col = "id_col", val_col = "val_col", offset_col = "offset", f = min): for rid in set(df[id_col].values): df.loc[df[id_col] == rid, offset_col] = f(df[df[id_col] == rid][val_col]) return df % timeit apply_by_id_value(df) % timeit df.groupby('id_col').transform('min') 100 loops, best of 3: 8.12 ms per loop 100 loops, best of 3: 5.99 ms per loop

Suggestion : 4

Overview of Jupyter Notebooks

# Make sure pandas is loaded
import pandas as pd

# Read in the survey CSV
surveys_df = pd.read_csv("data/surveys.csv")
# TIP: use the.head() method we saw earlier to make output shorter
# Method 1: select a 'subset' of the data using the column name
surveys_df['species_id']

# Method 2: use the column name as an 'attribute';
gives the same output
surveys_df.species_id
# Creates an object, surveys_species, that only contains the `species_id`
column
surveys_species = surveys_df['species_id']
# Select the species and plot columns from the DataFrame
surveys_df[['species_id', 'plot_id']]

# What happens when you flip the order ?
   surveys_df[['plot_id', 'species_id']]

# What happens
if you ask
for a column that doesn 't exist?
surveys_df['speciess']
# Create a list of numbers:
   a = [1, 2, 3, 4, 5]
a[0]

Suggestion : 5

The correct way to swap column values is by using raw values:,A slice object with labels 'a':'f' (Note that contrary to usual Python slices, both the start and the stop are included, when present in the index! See Slicing with labels and Endpoints are inclusive.),A slice object with labels 'a':'f' (Note that contrary to usual Python slices, both the start and the stop are included, when present in the index! See Slicing with labels.,With Series, the syntax works exactly as with an ndarray, returning a slice of the values and the corresponding labels:

In[1]: dates = pd.date_range('1/1/2000', periods = 8)

In[2]: df = pd.DataFrame(np.random.randn(8, 4),
      ...: index = dates, columns = ['A', 'B', 'C', 'D'])
   ...:

   In[3]: df
Out[3]:
   A B C D
2000 - 01 - 01 0.469112 - 0.282863 - 1.509059 - 1.135632
2000 - 01 - 02 1.212112 - 0.173215 0.119209 - 1.044236
2000 - 01 - 03 - 0.861849 - 2.104569 - 0.494929 1.071804
2000 - 01 - 04 0.721555 - 0.706771 - 1.039575 0.271860
2000 - 01 - 05 - 0.424972 0.567020 0.276232 - 1.087401
2000 - 01 - 06 - 0.673690 0.113648 - 1.478427 0.524988
2000 - 01 - 07 0.404705 0.577046 - 1.715002 - 1.039268
2000 - 01 - 08 - 0.370647 - 1.157892 - 1.344312 0.844885
In[4]: s = df['A']

In[5]: s[dates[5]]
Out[5]: -0.6736897080883706
In[6]: df
Out[6]:
   A B C D
2000 - 01 - 01 0.469112 - 0.282863 - 1.509059 - 1.135632
2000 - 01 - 02 1.212112 - 0.173215 0.119209 - 1.044236
2000 - 01 - 03 - 0.861849 - 2.104569 - 0.494929 1.071804
2000 - 01 - 04 0.721555 - 0.706771 - 1.039575 0.271860
2000 - 01 - 05 - 0.424972 0.567020 0.276232 - 1.087401
2000 - 01 - 06 - 0.673690 0.113648 - 1.478427 0.524988
2000 - 01 - 07 0.404705 0.577046 - 1.715002 - 1.039268
2000 - 01 - 08 - 0.370647 - 1.157892 - 1.344312 0.844885

In[7]: df[['B', 'A']] = df[['A', 'B']]

In[8]: df
Out[8]:
   A B C D
2000 - 01 - 01 - 0.282863 0.469112 - 1.509059 - 1.135632
2000 - 01 - 02 - 0.173215 1.212112 0.119209 - 1.044236
2000 - 01 - 03 - 2.104569 - 0.861849 - 0.494929 1.071804
2000 - 01 - 04 - 0.706771 0.721555 - 1.039575 0.271860
2000 - 01 - 05 0.567020 - 0.424972 0.276232 - 1.087401
2000 - 01 - 06 0.113648 - 0.673690 - 1.478427 0.524988
2000 - 01 - 07 0.577046 0.404705 - 1.715002 - 1.039268
2000 - 01 - 08 - 1.157892 - 0.370647 - 1.344312 0.844885
In[9]: df[['A', 'B']]
Out[9]:
   A B
2000 - 01 - 01 - 0.282863 0.469112
2000 - 01 - 02 - 0.173215 1.212112
2000 - 01 - 03 - 2.104569 - 0.861849
2000 - 01 - 04 - 0.706771 0.721555
2000 - 01 - 05 0.567020 - 0.424972
2000 - 01 - 06 0.113648 - 0.673690
2000 - 01 - 07 0.577046 0.404705
2000 - 01 - 08 - 1.157892 - 0.370647

In[10]: df.loc[: , ['B', 'A']] = df[['A', 'B']]

In[11]: df[['A', 'B']]
Out[11]:
   A B
2000 - 01 - 01 - 0.282863 0.469112
2000 - 01 - 02 - 0.173215 1.212112
2000 - 01 - 03 - 2.104569 - 0.861849
2000 - 01 - 04 - 0.706771 0.721555
2000 - 01 - 05 0.567020 - 0.424972
2000 - 01 - 06 0.113648 - 0.673690
2000 - 01 - 07 0.577046 0.404705
2000 - 01 - 08 - 1.157892 - 0.370647
In[12]: df.loc[: , ['B', 'A']] = df[['A', 'B']].to_numpy()

In[13]: df[['A', 'B']]
Out[13]:
   A B
2000 - 01 - 01 0.469112 - 0.282863
2000 - 01 - 02 1.212112 - 0.173215
2000 - 01 - 03 - 0.861849 - 2.104569
2000 - 01 - 04 0.721555 - 0.706771
2000 - 01 - 05 - 0.424972 0.567020
2000 - 01 - 06 - 0.673690 0.113648
2000 - 01 - 07 0.404705 0.577046
2000 - 01 - 08 - 0.370647 - 1.157892
In[14]: sa = pd.Series([1, 2, 3], index = list('abc'))

In[15]: dfa = df.copy()

Suggestion : 6

Updated: September 15, 2020

np.mean(arrayname)
object_name.method()
function(object_name)
# Import packages
import os

import matplotlib.pyplot as plt
import pandas as pd
import earthpy as et
# URL
for.csv with avg monthly precip data
avg_monthly_precip_url = "https://ndownloader.figshare.com/files/12710618"

# Download file
et.data.get_data(url = avg_monthly_precip_url)
'/root/earth-analytics/data/earthpy-downloads/avg-precip-months-seasons.csv'