pandas drop duplicates of one column with criteria

  • Last Update :
  • Techknowledgy :

Firstly sort on column 'B':

df.sort('B', inplace = True)

df
Out[24]:
   A B
5 239616418 name1
7 239616428 name1
10 239616429 name1
1 239616414 name2
0 239616412 NaN
2 239616417 NaN
3 239616417 NaN
4 239616417 NaN
6 239616418 NaN
8 239616429 NaN
9 239616429 NaN

Then drop duplicates w.r.t. column 'A':

df.drop_duplicates('A', inplace = True)

df
Out[26]:
   A B
5 239616418 name1
7 239616428 name1
10 239616429 name1
1 239616414 name2
0 239616412 NaN
2 239616417 NaN

You can re-sort the data frame to get exactly what you want:

df.sort(inplace = True)

df
Out[30]:
   A B
0 239616412 NaN
1 239616414 name2
2 239616417 NaN
5 239616418 name1
7 239616428 name1
10 239616429 name1

If you want to drop any duplicates, this should work. The sort will place all valid entries after NAs, so they will have preference in the drop_duplicate logic.

df.loc[df['B'] == 'none', 'B'] = np.nan
df = df.sort(['A', 'B']).drop_duplicates(subset = 'A')

If you'd rather keep duplicate valid values, you could do something like this, which splits the data into nulls/not-null, and recombines.

valids = df.dropna().drop_duplicates()

invalids = df[pd.isnull(df['B'])].drop_duplicates()
invalids = invalids[~invalids['A'].isin(valids['A'])]

df = pd.concat([valids, invalids])

Suggestion : 2

Get better at data science interviews by solving a few questions per week

import pandas as pd
import numpy as np
raw_data = {
   'name': ['Willard Morris', 'Al Jennings', 'Omar Mullins', 'Spencer McDaniel'],
   'age': [20, 19, 22, 21],
   'favorite_color': ['blue', 'blue', 'yellow', "green"],
   'grade': [88, 92, 95, 70]
}
df = pd.DataFrame(raw_data, index = ['Willard Morris', 'Al Jennings', 'Omar Mullins', 'Spencer McDaniel'])
df
#here we should drop Al Jennings ' record from the df, 
#since his favorite color, blue, is a duplicate with Willard Morris
df = df.drop_duplicates(subset = 'favorite_color', keep = "first")
df

Suggestion : 3

In this article, you have learned how to drop/remove/delete duplicate rows using pandas.DataFrame.drop_duplicates(), DataFrame.apply() and lambda function with examples.,4. Use DataFrame.drop_duplicates() to Drop Duplicates and Keep Last Row,Below is the syntax of the DataFrame.drop_duplicates() function that removes duplicate rows from the pandas DataFrame.,2. Pandas.DataFrame.drop_duplicates() Syntax & Examples

1._
# Below are quick example
# keep first duplicate row
df2 = df.drop_duplicates()

# Using DataFrame.drop_duplicates() to keep first duplicate row
df2 = df.drop_duplicates(keep = 'first')

# keep last duplicate row
df2 = df.drop_duplicates(keep = 'last')

# Remove all duplicate rows
df2 = df.drop_duplicates(keep = False)

# Delete duplicate rows based on specific columns
df2 = df.drop_duplicates(subset = ["Courses", "Fee"], keep = False)

# Drop duplicate rows in place
df.drop_duplicates(inplace = True)

# Using DataFrame.apply() and lambda
function
df2 = df.apply(lambda x: x.astype(str).str.lower()).drop_duplicates(subset = ['Courses', 'Fee'], keep = 'first')

Below is the syntax of the DataFrame.drop_duplicates() function that removes duplicate rows from the pandas DataFrame.

# Syntax of drop_duplicates
DataFrame.drop_duplicates(subset = None, keep = 'first', inplace = False, ignore_index = False)

Now, let’s create a DataFrame with a few duplicate rows on columns. Our DataFrame contains column names CoursesFeeDuration, and Discount.

import pandas as pd
import numpy as np
technologies = {
   'Courses': ["Spark", "PySpark", "Python", "pandas", "Python", "Spark", "pandas"],
   'Fee': [20000, 25000, 22000, 30000, 22000, 20000, 30000],
   'Duration': ['30days', '40days', '35days', '50days', '35days', '30days', '50days'],
   'Discount': [1000, 2300, 1200, 2000, 1200, 1000, 2000]
}
df = pd.DataFrame(technologies)
print(df)

You can use DataFrame.drop_duplicates() without any arguments to drop rows with the same values on all columns. It takes defaults values subset=None and keep=‘first’. The below example returns four rows after removing duplicate rows in our DataFrame.

# keep first duplicate row
df2 = df.drop_duplicates()
print(df2)

# Using DataFrame.drop_duplicates() to keep first duplicate row
df2 = df.drop_duplicates(keep = 'first')
print(df2)

Yields below output.

Courses Fee Duration Discount
0 Spark 20000 30 days 1000
1 PySpark 25000 40 days 2300
2 Python 22000 35 days 1200
3 pandas 30000 50 days 2000

Suggestion : 4

Related method on Index, indicating duplicate Index values.,‘first’ : Drop duplicates except for the first occurrence.,‘last’ : Drop duplicates except for the last occurrence.,The keep parameter controls which duplicate values are removed. The value ‘first’ keeps the first occurrence for each set of duplicated entries. The default value of keep is ‘first’.

>>> idx = pd.Index(['lama', 'cow', 'lama', 'beetle', 'lama', 'hippo'])
>>> idx.drop_duplicates(keep = 'first')
Index(['lama', 'cow', 'beetle', 'hippo'], dtype = 'object')
>>> idx.drop_duplicates(keep = 'last')
Index(['cow', 'beetle', 'lama', 'hippo'], dtype = 'object')
>>> idx.drop_duplicates(keep = False)
Index(['cow', 'beetle', 'hippo'], dtype = 'object')