select rows with duplicate observations in pandas

  • Last Update :
  • Techknowledgy :

Well, this would solve the case for you:

df[df.duplicated('Column Name', keep = False) == True]

use duplicated method of DataFrame:

df.duplicated(cols = [...])

You can use:

df[df.duplicated(cols = [...]) | df.duplicated(cols = [...], take_last = True)]

or, you can use groupby and filter:

df.groupby([...]).filter(lambda df: df.shape[0] > 1)

Suggestion : 2

Only consider certain columns for identifying duplicates, by default use all of the columns.,Return boolean Series denoting duplicate rows.,By default, for each set of duplicated values, the first occurrence is set on False and all others on True.,Boolean series for each duplicated rows.

>>> df = pd.DataFrame({
      ...'brand': ['Yum Yum', 'Yum Yum', 'Indomie', 'Indomie', 'Indomie'],
      ...'style': ['cup', 'cup', 'cup', 'pack', 'pack'],
      ...'rating': [4, 4, 3.5, 15, 5]
         ...
   }) >>>
   df
brand style rating
0 Yum Yum cup 4.0
1 Yum Yum cup 4.0
2 Indomie cup 3.5
3 Indomie pack 15.0
4 Indomie pack 5.0
>>> df.duplicated()
0 False
1 True
2 False
3 False
4 False
dtype: bool
>>> df.duplicated(keep = 'last')
0 True
1 False
2 False
3 False
4 False
dtype: bool
>>> df.duplicated(keep = False)
0 True
1 True
2 False
3 False
4 False
dtype: bool
>>> df.duplicated(subset = ['brand'])
0 False
1 True
2 False
3 True
4 True
dtype: bool

Suggestion : 3

September 16, 2021

Creating a DataFrame

# Create a DataFrame
import pandas as pd
data_df = {
   'Name': ['Arpit', 'Riya', 'Priyanka', 'Aman', 'Arpit', 'Rohan', 'Riya', 'Sakshi'],

   'Employment Type': ['Full-time Employee', 'Part-time Employee', 'Intern', 'Intern',
      'Full-time Employee', 'Part-time Employee', 'Part-time Employee', 'Full-time Employee'
   ],

   'Department': ['Administration', 'Marketing', 'Technical', 'Marketing',
      'Administration', 'Technical', 'Marketing', 'Administration'
   ]
}

df = pd.DataFrame(data_df)
df

When you directly use the DataFrame.duplicated() function, the default values will be passed to the parameters for searching duplicate rows in the DataFrame.

# Use the DataFrame.duplicated() method to
return a series of boolean values
bool_series = df.duplicated()
0 False
1 False
2 False
3 False
4 True
5 False
6 True
7 False
dtype: bool
8._
# Use the keep parameter to consider all instances of a row to be duplicates
bool_series = df.duplicated(keep = False)
print('Boolean series:')
print(bool_series)
print('\n')
print('DataFrame after removing all the instances of the duplicate rows:')
# The `~`
sign is used
for negation.It changes the boolean value True to False and False to True
df[~bool_series]

This is extremly useful as you might be intrested only in finding duplicate values for only few columns.

# Use the subset parameter to search
for duplicate values only in the Name column of the DataFrame

bool_series = df.duplicated(subset = 'Name')

print('Boolean series:')
print(bool_series)
print('\n')
print('DataFrame after removing duplicates found in the Name column:')
df[~bool_series]

Suggestion : 4

In this article, you have learned how to get/select a list of all duplicate rows (all or multiple columns) using pandas DataFrame duplicated() method with examples.,If you are in a hurry, below are some quick examples of how to get a list of all duplicate rows in pandas DataFrame.,Pandas DataFrame.duplicated() function is used to get/find/select a list of all duplicate rows(all or selected columns) from pandas. Duplicate rows means, having multiple rows on all columns. Using this method you can get duplicate rows on selected multiple columns or all columns. In this article, I will explain these with several examples.,You can set 'keep=False' in the duplicated function to get all the duplicate items without eliminating duplicate rows.

If you are in a hurry, below are some quick examples of how to get a list of all duplicate rows in pandas DataFrame.

# Below are quick example
# Select duplicate rows except first occurrence based on all columns
df2 = df[df.duplicated()]

# Select duplicate row based on all columns
df2 = df[df.duplicated(keep = False)]

# Get duplicate last rows based on all columns
df2 = df[df.duplicated(keep = 'last')]

# Get list Of duplicate rows using single columns
df2 = df[df['Courses'].duplicated() == True]

# Get list of duplicate rows based on 'Courses'
column
df2 = df[df.duplicated('Courses')]

# Get list Of duplicate rows using multiple columns
df2 = df[df[['Courses', 'Fee', 'Duration']].duplicated() == True]

# Get list of duplicate rows based on list of column names
df2 = df[df.duplicated(['Courses', 'Fee', 'Duration'])]
2._
import pandas as pd
technologies = {
   'Courses': ["Spark", "PySpark", "Python", "pandas", "Python", "Spark", "pandas"],
   'Fee': [20000, 25000, 22000, 30000, 22000, 20000, 30000],
   'Duration': ['30days', '40days', '35days', '50days', '40days', '30days', '50days'],
   'Discount': [1000, 2300, 1200, 2000, 2300, 1000, 2000]
}
df = pd.DataFrame(technologies)
print(df)

Yields below output.

Courses Fee Duration Discount
0 Spark 20000 30 days 1000
1 PySpark 25000 40 days 2300
2 Python 22000 35 days 1200
3 pandas 30000 50 days 2000
4 Python 22000 40 days 2300
5 Spark 20000 30 days 1000
6 pandas 30000 50 days 2000
5._
Courses Fee Duration Discount
5 Spark 20000 30 days 1000
6 pandas 30000 50 days 2000

You can set 'keep=False' in the duplicated function to get all the duplicate items without eliminating duplicate rows.

# Select duplicate row based on all columns
df2 = df[df.duplicated(keep = False)]
print(df2)