Try this:
df['total_orders'] = df.groupby('city')['order_id'].transform('count')
April 15, 2019
import pandas as pd
import seaborn as sns
cm = sns.light_palette("lightgreen", as_cmap = True)
data = {
'close_date': ["2012-08-01", "2012-08-01", "2012-08-01", "2012-08-02", "2012-08-03", "2012-08-04", "2012-08-05", "2012-08-07"],
'seller_name': ["Lara", "Julia", "Julia", "Emily", "Julia", "Lara", "Julia", "Julia"]
}
df = pd.DataFrame(data)
df
df['close_date'] = pd.to_datetime(df['close_date'])
df['rank_seller_by_close_date'] = df.groupby('seller_name')['close_date'].rank(method = 'first')
You can use the SQL PARTITION BY clause with the OVER clause to specify the column on which we need to perform aggregation. PARTITION BY gives aggregated columns with each record in the specified table. If we have 15 records in the table, the query output SQL PARTITION BY also gets 15 rows. On the other hand, GROUP BY gives one row per group in result set. ,The OVER clause defines a window or user-specified set of rows within a query result set. A window function then computes a value for each row in the window. You can use the OVER clause with functions to compute aggregated values such as moving averages, cumulative aggregates, running totals, or a top N per group results.,PARTITION BY does not affect the number of rows returned, but it changes how a window function's result is calculated.,A GROUP BY normally reduces the number of rows returned by rolling them up and calculating averages or sums for each row.
# user$raw6 rows: name number_of_registered_entitiesUser_1 | 8 User_2 | 10 User_3 | 8 User_2 | 1 User_3 | 5 User_1 | 7 # SQL query 1 GROUP BY: SELECT name, SUM(number_of_registered_entities) entitysum from user$rawGROUP BY name # Output 1 3 rows: name entitysumUser_1 | 15 User_2 | 11 User_3 | 13 # SQL query 2 PARTITION BY: SELECT SUM(number_of_registered_entities) OVER(PARTITION BY name) AS name, entitysum FROM user$raw # Output 26 rows: name entitysumUser_1 | 15 User_1 | 15 User_2 | 11 User_2 | 11 User_3 | 13 User_3 | 13
You can use groupby('ID')[value].shift(1) to access the previous value in the same ID group.,calculating over partition in pandas dataframe,Calculate intersection over union (Jaccard's index) in pandas dataframe,How to conduct logical tests or mathematical operations in parallel cells across multiple pandas dataframes
You can use groupby('ID')[value].shift(1)
to access the previous value
in the same ID
group.
import pandas as pd df = pd.DataFrame({ 'ID': ['a', 'a', 'a', 'b', 'b', 'b'], 'time': [1, 2, 3, 1, 4, 5], 'status': ['x', 'y', 'z', 'xx', 'yy', 'zz'] }) df['previous_time'] = df.groupby('ID')['time'].shift(1) df['previous_status'] = df.groupby('ID')['status'].shift(1) df = df.dropna() df['duration'] = df['time'] - df['previous_time'] # change this line to calculate duration between time instead df['status_change'] = df['previous_status'] + '-' + df['status'] print(df[['ID', 'duration', 'status_change']].to_markdown(index = False))
You can use a groupby.diff
and groupby.shift
:
out = (df .assign( ** { 'Duration(min)': pd.to_datetime(df['Timestamp'], dayfirst = False) .groupby(df['ID']) .diff(-1).dt.total_seconds() # diff in seconds to next time in group .div(60), # convert to minutes 'Status change': df.groupby('ID')['Status'].shift(-1) + '-' + df['Status'] }) .dropna(subset = 'Duration(min)') # get rid of empty rows[['ID', 'Duration(min)', 'Status change']] )
Output:
ID Duration(min) Status change
0 A 126.0 In Progress - Run Ended
1 A 1.0 Prepared - In Progress
3 B 117.0 In Progress - Run Ended
4 B 503.0 Prepared - In Progress
Get better at data science interviews by solving a few questions per week
import pandas as pd
import numpy as np
#create a dataframe
df = pd.DataFrame({
'date': ['2013-04-01', '2013-04-01', '2013-04-01', '2013-04-02', '2013-04-02'],
'user_id': ['0001', '0001', '0002', '0002', '0002'],
'duration': [30, 15, 20, 15, 30]
})
df
#here we can count the number of distinct users viewing on a given day
df = df.groupby("date").agg({
"duration": np.sum,
"user_id": pd.Series.nunique
})
df
On a DataFrame, we obtain a GroupBy object by calling groupby(). We could naturally group by either the A or B columns, or both:,With the GroupBy object in hand, iterating through the grouped data is very natural and functions similarly to itertools.groupby():, Splitting an object into groups GroupBy sorting GroupBy dropna GroupBy object attributes GroupBy with MultiIndex Grouping DataFrame with Index levels and columns DataFrame column selection in GroupBy ,pandas objects can be split on any of their axes. The abstract definition of grouping is to provide a mapping of labels to group names. To create a GroupBy object (more on what the GroupBy object is later), you may do the following:
SELECT Column1, Column2, mean(Column3), sum(Column4)
FROM SomeTable
GROUP BY Column1, Column2
In[1]: df = pd.DataFrame(
...: [
...: ("bird", "Falconiformes", 389.0),
...: ("bird", "Psittaciformes", 24.0),
...: ("mammal", "Carnivora", 80.2),
...: ("mammal", "Primates", np.nan),
...: ("mammal", "Carnivora", 58),
...:
],
...: index = ["falcon", "parrot", "lion", "monkey", "leopard"],
...: columns = ("class", "order", "max_speed"),
...: )
...:
In[2]: df
Out[2]:
class order max_speed
falcon bird Falconiformes 389.0
parrot bird Psittaciformes 24.0
lion mammal Carnivora 80.2
monkey mammal Primates NaN
leopard mammal Carnivora 58.0
#
default is axis = 0
In[3]: grouped = df.groupby("class")
In[4]: grouped = df.groupby("order", axis = "columns")
In[5]: grouped = df.groupby(["class", "order"])
In[6]: df = pd.DataFrame(
...: {
...: "A": ["foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo"],
...: "B": ["one", "one", "two", "three", "two", "two", "one", "three"],
...: "C": np.random.randn(8),
...: "D": np.random.randn(8),
...:
}
...: )
...:
In[7]: df
Out[7]:
A B C D
0 foo one 0.469112 - 0.861849
1 bar one - 0.282863 - 2.104569
2 foo two - 1.509059 - 0.494929
3 bar three - 1.135632 1.071804
4 foo two 1.212112 0.721555
5 bar two - 0.173215 - 0.706771
6 foo one 0.119209 - 1.039575
7 foo three - 1.044236 0.271860
In[8]: grouped = df.groupby("A")
In[9]: grouped = df.groupby(["A", "B"])
In[10]: df2 = df.set_index(["A", "B"])
In[11]: grouped = df2.groupby(level = df2.index.names.difference(["B"]))
In[12]: grouped.sum()
Out[12]:
C D
A
bar - 1.591710 - 1.739537
foo - 0.752861 - 1.402938
In[13]: def get_letter_type(letter):
....: if letter.lower() in 'aeiou':
....: return 'vowel'
....:
else:
....: return 'consonant'
....:
In[14]: grouped = df.groupby(get_letter_type, axis = 1)