You can isolate any values in dataframes using loc
. What gets returned is a Series, which can be indexed like a list. Use [0]
to get the first occurrence in the Series.
times = [
'2019-05-18 01:15:28',
'2019-05-18 01:28:11',
'2019-05-18 01:36:36',
'2019-05-18 01:39:47',
'2019-05-18 01:53:32',
'2019-05-18 02:05:37'
]
a = [9, 7, 7, 5, 12, 12]
df = pd.DataFrame({
'times': times,
'a': a
})
df.times = pd.to_datetime(df['times'])
pd.Timedelta(df.loc[df.a == 12, 'times'].values[0] - df.loc[df.a == 7, 'times'].values[0])
Or we can break that code apart for readability's sake and do the calculations on new variables:
times = [
'2019-05-18 01:15:28',
'2019-05-18 01:28:11',
'2019-05-18 01:36:36',
'2019-05-18 01:39:47',
'2019-05-18 01:53:32',
'2019-05-18 02:05:37'
]
a = [9, 7, 7, 5, 12, 12]
df = pd.DataFrame({
'times': times,
'a': a
})
df.times = pd.to_datetime(df['times'])
end = df.loc[df.a == 12, 'times'].values[0]
start = df.loc[df.a == 7, 'times'].values[0]
pd.Timedelta(end - start)
Sample:
times = [
'2019-05-18 01:15:28',
'2019-05-18 01:28:11',
'2019-05-18 01:36:36',
'2019-05-18 01:39:47',
'2019-05-18 01:53:32',
'2019-05-18 02:05:37'
]
a = [7, 7, 12, 7, 12, 7]
df = pd.DataFrame({
'times': pd.to_datetime(times),
'A': a
})
print(df)
times A
0 2019 - 05 - 18 01: 15: 28 7
1 2019 - 05 - 18 01: 28: 11 7
2 2019 - 05 - 18 01: 36: 36 12
3 2019 - 05 - 18 01: 39: 47 7
4 2019 - 05 - 18 01: 53: 32 12
5 2019 - 05 - 18 02: 05: 37 7
First create default index and filter rows with 7
and 12
only:
df = df.reset_index(drop = True)
df1 = df[df['A'].isin([7, 12])]
Then get first consecutive values in rows with compare with shifted values:
df1 = df1[df1['A'].ne(df1['A'].shift())]
print(df1)
times A
0 2019 - 05 - 18 01: 15: 28 7
2 2019 - 05 - 18 01: 36: 36 12
3 2019 - 05 - 18 01: 39: 47 7
4 2019 - 05 - 18 01: 53: 32 12
5 2019 - 05 - 18 02: 05: 37 7
Get datetimes with pair and unpairs rows:
out7 = df2.iloc[::2] out12 = df2.iloc[1::2]
And last subtract:
df['Time_difference'] = out12['times'] - out7['times'].to_numpy()
df['Time_difference'] = df['Time_difference'].fillna(pd.Timedelta(0))
print(df)
times A Time_difference
0 2019 - 05 - 18 01: 15: 28 7 00: 00: 00
1 2019 - 05 - 18 01: 28: 11 7 00: 00: 00
2 2019 - 05 - 18 01: 36: 36 12 00: 21: 08
3 2019 - 05 - 18 01: 39: 47 7 00: 00: 00
4 2019 - 05 - 18 01: 53: 32 12 00: 13: 45
5 2019 - 05 - 18 02: 05: 37 7 00: 00: 00
- (df["A"] == 7).cumsum() separates rows to each 7
- for each group of 7, if there is 12 the substract the 1st row with 12 from 1st row of group
- If not pass value of 1st row of group to next group until 12 is found
import pandas as pd import numpy as np np.random.seed(10) date_range = pd.date_range("25-9-2019", "27-9-2019", freq = "3H") df = pd.DataFrame({ 'Time': date_range, 'A': np.random.choice([5, 7, 12], len(date_range)) }) df["Seven"] = (df["A"] == 7).cumsum() # display(df) pass_to_next_group = { "val": None } def diff(group): group["Diff"] = 0 loc = group.index[group["A"] == 12] time_a = pass_to_next_group["val"] if pass_to_next_group["val"] else group["Time"].iloc[0] pass_to_next_group["val"] = None if group.name > 0 and len(loc) > 0: group.loc[loc[0], "Diff"] = time_a - group.loc[loc[0], "Time"] else: pass_to_next_group["val"] = time_a return group df.groupby("Seven").apply(diff)
Last Updated : 16 Dec, 2021,GATE CS 2021 Syllabus
Output
[1] "Original DataFrame"
col1 col3
1 8 2021-05-08 08:32:07
2 8 2021-07-18 00:21:07
3 7 2020-11-28 23:32:09
4 6 2021-05-11 18:32:07
5 7 2021-05-08 08:32:07
# A tibble: 5 x 3
# Groups: col1 [3]
col1 col3 diff
<int>
<dttm>
<drtn>
1 6 2021-05-11 18:32:07 NA secs
2 7 2020-11-28 23:32:09 NA secs
3 7 2021-05-08 08:32:07 13856398 secs
4 8 2021-05-08 08:32:07 NA secs
5 8 2021-07-18 00:21:07 6104940 secs
Output
[1]
"Original DataFrame"
col1 col3
1 7 2021 - 05 - 08 08: 32: 07
2 6 2021 - 07 - 18 00: 21: 07
3 8 2020 - 11 - 28 23: 32: 09
4 7 2021 - 05 - 11 18: 32: 07
5 6 2021 - 05 - 08 08: 32: 07[1]
"Modified DataFrame"
col1 col3 diff
1 7 2021 - 05 - 08 08: 32: 07 0
2 6 2021 - 07 - 18 00: 21: 07 - 6104940
3 8 2020 - 11 - 28 23: 32: 09 0
4 7 2021 - 05 - 11 18: 32: 07 295200
5 6 2021 - 05 - 08 08: 32: 07 0
November 16, 2021February 22, 2022
By default, the Pandas diff method will calculate the difference between subsequent rows, though it does offer us flexibility in terms of how we calculate our differences. Let’s take a look at the method and at the two arguments that it offers:
# Understanding the Pandas diff method df.diff( periods = 1, # Periods to shift for calculating difference axis = 0 # The axis to caclulate the different on )
In order to follow along with this tutorial, feel free to load the dataframe below by copying and pasting the code into your favourite code editor. Of course, feel free to use your own data, though your results will, of course, vary.
# Loading a Sample Pandas Dataframe import pandas as pd df = pd.DataFrame.from_dict({ 'Date': pd.date_range('2022-01-01', '2022-01-11'), 'Sales': [198, 123, 973, 761, 283, 839, 217, 666, 601, 992, 205] }) print(df.head()) # Returns: # Date Sales # 0 2022 - 01 - 01 198 # 1 2022 - 01 - 02 123 # 2 2022 - 01 - 03 973 # 3 2022 - 01 - 04 761 # 4 2022 - 01 - 05 283
The Pandas diff
method allows us to easily subtract two rows in a Pandas Dataframe. By default, Pandas will calculate the difference between subsequent rows. Let’s see how we can use the method to calculate the difference between rows of the Sales column:
# Calculating the difference between two rows df['Sales'] = df['Sales'].diff() print(df.head()) # Returns: # Date Sales # 0 2022 - 01 - 01 NaN # 1 2022 - 01 - 02 - 75.0 # 2 2022 - 01 - 03 850.0 # 3 2022 - 01 - 04 - 212.0 # 4 2022 - 01 - 05 - 478.0
Let’s see how we can calculate the difference between a periodicity of seven days:
# Changing the difference in periodicity between row differences df['Sales Difference'] = df['Sales'].diff(periods = 7) print(df.head(10)) # Returns: # Date Sales Sales Difference # 0 2022 - 01 - 01 198 NaN # 1 2022 - 01 - 02 123 NaN # 2 2022 - 01 - 03 973 NaN # 3 2022 - 01 - 04 761 NaN # 4 2022 - 01 - 05 283 NaN # 5 2022 - 01 - 06 839 NaN # 6 2022 - 01 - 07 217 NaN # 7 2022 - 01 - 08 666 468.0 # 8 2022 - 01 - 09 601 478.0 # 9 2022 - 01 - 10 992 19.0
In order to make this make more logical sense, let’s add a different column to our dataframe:
# Loading a Sample Pandas Dataframe import pandas as pd df = pd.DataFrame.from_dict({ 'Sales January': [198, 123, 973, 761, 283, 839, 217, 666, 601, 992, 205], 'Sales February': [951, 556, 171, 113, 797, 720, 570, 724, 153, 277, 932] }) df = df.diff(axis = 1) print(df.head()) # Returns: # Sales January Sales February # 0 NaN 753 # 1 NaN 433 # 2 NaN - 802 # 3 NaN - 648 # 4 NaN 514
Next, let’s check the format of the data. Run df.info() to return the details on the columns and data types present in the dataframe. As with many such datasets, the dates are actually stored as object data types or strings. We’ll need to convert them to datetime objects before we can use them in our calculations.,We can now check the data types of the columns by running df.info() again. As you can see from the output, the order_date and despatch_date columns are now datetime objects.,If you print the head() you’ll see that the data comprise three columns: the order_id, the order_date, and the despatch_date from an ecommerce website. We’ll be using Pandas to calculate how long it took the operations team to despatch each order placed.,First, open a Jupyter notebook and import the pandas library. Then use the Pandas read_csv() function to load the data. I’ve created a CSV file containing some dates you can use for this project which is hosted on my GitHub page.
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/flyandlure/datasets/master/dates.csv')
df.head()
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 order_id 100 non-null int64
1 order_date 100 non-null object
2 despatch_date 100 non-null object
dtypes: int64(1), object(2)
memory usage: 2.5+ KB
df['order_date'] = pd.to_datetime(df['order_date'], errors = 'coerce')
df['despatch_date'] = pd.to_datetime(df['despatch_date'], errors = 'coerce')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 order_id 100 non-null int64
1 order_date 100 non-null datetime64[ns]
2 despatch_date 100 non-null datetime64[ns]
dtypes: datetime64[ns](2), int64(1)
memory usage: 2.5 KB
Creating categorical column based on multiple column values in groupby,How to aggregate table by conditions?,Summarizing a dataset and creating new variables,Calculate the time difference between two hh:mm columns in a pandas dataframe
If performance is important, avoid aggregation and groupby, because slow, better is create Response
and Response
Series with MultiIndex and subtract Timestamp
s, sort_index
should also help with performance:
#if necessary
#df['Timestamp'] = pd.to_timedelta(df['Timestamp'])
cols = ['Service', 'Command', 'Message_ID']
s1 = df[df['Message_Type'] == 'Response'].set_index(cols)['Timestamp'].sort_index()
s2 = df[df['Message_Type'] == 'Request'].set_index(cols)['Timestamp'].sort_index()
df1 = s1.sub(s2).reset_index()
print(df1)
Service Command Message_ID Timestamp
0 FoodOrders SeeStock() 125 00: 00: 02
followed by this code from another post:
import time
start = time.time()
print("hello")
end = time.time()
print(end - start)
If you use jupiter notebook, you can try something like this:
% timeit df.sort_values('Time').groupby(['Service', 'Command', 'Message_Type', 'Message_ID']).apply(lambda x: x.iloc[1]['Time'] - x.iloc[0]['Time'])
In my sample I have this out:
2.97 ms± 310 µs per loop(mean± std.dev.of 7 runs, 100 loops each)
Periods to shift for calculating difference, accepts negative values.,Take difference over rows (0) or columns (1).,Calculates the difference of a Dataframe element compared with another element in the Dataframe (default is element in previous row).,For boolean dtypes, this uses operator.xor() rather than operator.sub(). The result is calculated according to current dtype in Dataframe, however dtype of the result is always float64.
>>> df = pd.DataFrame({
'a': [1, 2, 3, 4, 5, 6],
...'b': [1, 1, 2, 3, 5, 8],
...'c': [1, 4, 9, 16, 25, 36]
}) >>>
df
a b c
0 1 1 1
1 2 1 4
2 3 2 9
3 4 3 16
4 5 5 25
5 6 8 36
>>> df.diff()
a b c
0 NaN NaN NaN
1 1.0 0.0 3.0
2 1.0 1.0 5.0
3 1.0 1.0 7.0
4 1.0 2.0 9.0
5 1.0 3.0 11.0
>>> df.diff(axis = 1) a b c 0 NaN 0 0 1 NaN - 1 3 2 NaN - 1 7 3 NaN - 1 13 4 NaN 0 20 5 NaN 2 28
>>> df.diff(periods = 3) a b c 0 NaN NaN NaN 1 NaN NaN NaN 2 NaN NaN NaN 3 3.0 2.0 15.0 4 3.0 4.0 21.0 5 3.0 6.0 27.0
>>> df.diff(periods = -1) a b c 0 - 1.0 0.0 - 3.0 1 - 1.0 - 1.0 - 5.0 2 - 1.0 - 1.0 - 7.0 3 - 1.0 - 2.0 - 9.0 4 - 1.0 - 3.0 - 11.0 5 NaN NaN NaN
>>> df = pd.DataFrame({
'a': [1, 0]
}, dtype = np.uint8) >>>
df.diff()
a
0 NaN
1 255.0