replace
'U' with NaN
then you want the logic of groupby
+ last
:
#df = df.sort_values(['PA', 'date'])
df.replace('U', np.NaN).groupby('PA').last()
date grade_conc grade_rebar grade_mason grade_work grade_timber grade_steel grade_total
PA
1 2018 - 10 - 15 A NR NR NR B Z NR
2 2018 - 10 - 15 Z NR NR NR B A NR
I am using ffill
with tail
df = df.sort_values(['date'])
df = df.mask(df == 'U')
df.groupby('PA').ffill().groupby('PA').tail(1)
Out[277]:
PA date grade_conc...grade_timber grade_steel grade_total
2 1 2018 - 10 - 15 A...B Z NR
5 2 2018 - 10 - 15 Z...B A NR[2 rows x 9 columns]
Or drop_duplicates
df.groupby('PA').ffill().drop_duplicates('PA', keep = 'last')
Maybe this using groupby
, apply
, replace
, ffill
and finally tail
:
print(df.groupby('PA', as_index = False).apply(lambda x: x.replace('U', np.nan).ffill().tail(1)))
Output:
PA date grade_conc grade_rebar grade_mason grade_work\
0 2 1 2018 - 10 - 15 A NR NR NR
1 5 2 2018 - 10 - 15 Z NR NR NR
grade_timber grade_steel grade_total
0 2 B Z NR
1 5 B A NR
Another aggregation example is to compute the number of unique values of each group. This is similar to the value_counts function, except that it only counts unique values.,Another simple aggregation example is to compute the size of each group. This is included in GroupBy as the size method. It returns a Series whose index are the group names and whose values are the sizes of each group.,Named aggregation is also valid for Series groupby aggregations. In this case there’s no column selection, so the values are just the functions.,pandas Index objects support duplicate values. If a non-unique index is used as the group key in a groupby operation, all values for the same index value will be considered to be in one group and thus the output of aggregation functions will only contain unique index values:
SELECT Column1, Column2, mean(Column3), sum(Column4)
FROM SomeTable
GROUP BY Column1, Column2
In[1]: df = pd.DataFrame(
...: [
...: ("bird", "Falconiformes", 389.0),
...: ("bird", "Psittaciformes", 24.0),
...: ("mammal", "Carnivora", 80.2),
...: ("mammal", "Primates", np.nan),
...: ("mammal", "Carnivora", 58),
...:
],
...: index = ["falcon", "parrot", "lion", "monkey", "leopard"],
...: columns = ("class", "order", "max_speed"),
...: )
...:
In[2]: df
Out[2]:
class order max_speed
falcon bird Falconiformes 389.0
parrot bird Psittaciformes 24.0
lion mammal Carnivora 80.2
monkey mammal Primates NaN
leopard mammal Carnivora 58.0
#
default is axis = 0
In[3]: grouped = df.groupby("class")
In[4]: grouped = df.groupby("order", axis = "columns")
In[5]: grouped = df.groupby(["class", "order"])
In[6]: df = pd.DataFrame(
...: {
...: "A": ["foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo"],
...: "B": ["one", "one", "two", "three", "two", "two", "one", "three"],
...: "C": np.random.randn(8),
...: "D": np.random.randn(8),
...:
}
...: )
...:
In[7]: df
Out[7]:
A B C D
0 foo one 0.469112 - 0.861849
1 bar one - 0.282863 - 2.104569
2 foo two - 1.509059 - 0.494929
3 bar three - 1.135632 1.071804
4 foo two 1.212112 0.721555
5 bar two - 0.173215 - 0.706771
6 foo one 0.119209 - 1.039575
7 foo three - 1.044236 0.271860
In[8]: grouped = df.groupby("A")
In[9]: grouped = df.groupby(["A", "B"])
In[10]: df2 = df.set_index(["A", "B"])
In[11]: grouped = df2.groupby(level = df2.index.names.difference(["B"]))
In[12]: grouped.sum()
Out[12]:
C D
A
bar - 1.591710 - 1.739537
foo - 0.752861 - 1.402938
In[13]: def get_letter_type(letter):
....: if letter.lower() in 'aeiou':
....: return 'vowel'
....:
else:
....: return 'consonant'
....:
In[14]: grouped = df.groupby(get_letter_type, axis = 1)
To get the unique values in multiple columns of a dataframe, we can merge the contents of those columns to create a single series object and then can call unique() function on that series object i.e. ,To fetch the unique values in column ‘Age’ of the above created dataframe, we will call unique() function on the column i.e. ,Suppose instead of getting the name of unique values in a column, if we are interested in count of unique elements in a column then we can use series.unique() function i.e. ,In this article we will discuss how to find unique elements in a single, multiple or each column of a dataframe.
It returns the a numpy array of unique elements in series object.
Series.unique(self)
Series.nunique(self, dropna = True)
First of all, create a dataframe,
# List of Tuples empoyees = [('jack', 34, 'Sydney', 5), ('Riti', 31, 'Delhi', 7), ('Aadi', 16, np.NaN, 11), ('Mohit', 31, 'Delhi', 7), ('Veena', np.NaN, 'Delhi', 4), ('Shaunak', 35, 'Mumbai', 5), ('Shaun', 35, 'Colombo', 11) ] # Create a DataFrame object empDfObj = pd.DataFrame(empoyees, columns = ['Name', 'Age', 'City', 'Experience'], index = ['a', 'b', 'c', 'd', 'e', 'f', 'g']) print("Contents of the Dataframe : ") print(empDfObj)
Suppose instead of getting the name of unique values in a column, if we are interested in count of unique elements in a column then we can use series.unique() function i.e.
# Count unique values in column 'Age' of the dataframe uniqueValues = empDfObj['Age'].nunique() print('Number of unique values in column "Age" of the dataframe : ') print(uniqueValues)
Using nunique() with default arguments doesn’t include NaN while counting the unique elements, if we want to include NaN too then we need to pass the dropna argument i.e.
# Count unique values in column 'Age' including NaN uniqueValues = empDfObj['Age'].nunique(dropna = False) print('Number of unique values in column "Age" including NaN') print(uniqueValues)
Pandas comes with a whole host of sql-like aggregation functions you can apply when grouping on one or more columns. This is Python’s closest equivalent to dplyr’s group_by + summarise logic. Here’s a quick example of how to group on one or multiple columns and summarise data with aggregation functions using Pandas.,Applying multiple aggregation functions to a single column will result in a multiindex. Working with multi-indexed columns is a pain and I’d recommend flattening this after aggregating by renaming the new columns.,It’s simple to extend this to work with multiple grouping variables. Say you want to summarise player age by team AND position. You can do this by passing a list of column names to groupby instead of a single string value.,You’ll also see that your grouping column is now the dataframe’s index. Reset your index to make this easier to work with later on.
import pandas as pd
data = {
"Team": ["Red Sox", "Red Sox", "Red Sox", "Red Sox", "Red Sox", "Red Sox", "Yankees", "Yankees", "Yankees", "Yankees", "Yankees", "Yankees"],
"Pos": ["Pitcher", "Pitcher", "Pitcher", "Not Pitcher", "Not Pitcher", "Not Pitcher", "Pitcher", "Pitcher", "Pitcher", "Not Pitcher", "Not Pitcher", "Not Pitcher"],
"Age": [24, 28, 40, 22, 29, 33, 31, 26, 21, 36, 25, 31]
}
df = pd.DataFrame(data)
print(df)
# group by Team, get mean, min, and max value of Age for each value of Team. grouped_single = df.groupby('Team').agg({ 'Age': ['mean', 'min', 'max'] }) print(grouped_single)
# rename columns grouped_single.columns = ['age_mean', 'age_min', 'age_max'] # reset index to get grouped columns back grouped_single = grouped_single.reset_index() print(grouped_single)
grouped_multiple = df.groupby(['Team', 'Pos']).agg({
'Age': ['mean', 'min', 'max']
})
grouped_multiple.columns = ['age_mean', 'age_min', 'age_max']
grouped_multiple = grouped_multiple.reset_index()
print(grouped_multiple)
You can also select column values based on another DataFrame column value by using DataFrame.loc[] property. The .loc[] property explains how to access a group of rows and columns by label(s) or a boolean array. Here, the condition can just be selecting rows and columns, but it can also be used to filter DataFrames. These filtered DataFrame can have values applied to them.,In this article, I will explain how to extract column values based on another column of pandas DataFrame using different ways, these can be used to can create conditional columns on padas DataFrame.,In this article, you have learned how to extract column values of pandas DataFrame based on another column by using DataFrame.loc[], DataFrame.iloc[], DataFrame.query(), DataFrame.values[] methods with simple examples.,Another method to extract columns of pandas DataFrame based on another column by using DataFrame.item() method.
# Below are some quick examples. # Extract column values by using DataFrame.loc[] property. df2 = df.loc[df['Fee'] == 30000, 'Courses'] # To get First Element by using.iloc[] method. df2 = df.loc[df['Fee'] == 30000, 'Courses'].iloc[0] # Extract column values by DataFrame.item() method df2 = df.loc[df['Fee'] == 30000, 'Courses'].item() # Using DataFrame.query() method extract column values. df2 = df.query('Fee==25000')['Courses'] # Using DataFrame.values() property. df2 = df[df['Fee'] == 22000]['Courses'].values[0] # Other example. df2 = df[df['Fee'] == 22000]['Courses']
Now, let’s create a Pandas DataFrame with a few rows and columns and execute the above examples. Our DataFrame contains column names Courses
, Fee
, Duration
, and Discount
.
# Create Pandas DataFrame. import pandas as pd import numpy as np technologies = { 'Courses': ["Spark", "PySpark", "Python", "pandas"], 'Fee': [20000, 25000, 22000, 30000], 'Duration': ['30days', '40days', '35days', '50days'], 'Discount': [1000, 2300, 1200, 2000] } index_labels = ['r1', 'r2', 'r3', 'r4'] df = pd.DataFrame(technologies, index = index_labels) print(df)
Yields below output.
Courses Fee Duration Discount r1 Spark 20000 30 days 1000 r2 PySpark 25000 40 days 2300 r3 Python 22000 35 days 1200 r4 pandas 30000 50 days 2000
You can also select column values based on another DataFrame column value by using DataFrame.loc[]
property. The .loc[]
property explains how to access a group of rows and columns by label(s) or a boolean array. Here, the condition can just be selecting rows and columns, but it can also be used to filter DataFrames. These filtered DataFrame can have values applied to them.
# Extract column values by using DataFrame.loc[] property. df2 = df.loc[df['Fee'] == 30000, 'Courses'] print(df2)
# To get First Element by using.iloc[] method. df2 = df.loc[df['Fee'] == 30000, 'Courses'].iloc[0] print(df2) # Output: pandas