Bring in the entire column to deduplicate on, in this case 'id'. Then create a bitmask of the rows that AREN'T duplicates. DataFrame.duplicated() returns the rows that are duplicates and the ~ inverts that. Now we have our 'dupemask'.
dupemask = ~df.duplicated(subset = ['id'])
Then create an iterator to bring the file in in chunks. Once that is done loop over the iterator and create a new index for each chunk. This new index matches the small chunk dataframe with its position in the 'dupemask' bitmask, which we can then use to only keep the lines that aren't duplicates.
for i, df in enumerate(chunked_data_iterator):
df.index = range(i * chunksize, i * chunksize + len(df.index))
df = df[dupemask]
Also supports optionally iterating or breaking of the file into chunks.,Read a comma-separated values (csv) file into DataFrame.,Write DataFrame to a comma-separated values (csv) file.,A comma-separated values (csv) file is returned as two-dimensional data structure with labeled axes.
>>> pd.read_csv('data.csv')
In this article, you have learned how to drop/remove/delete duplicate rows using pandas.DataFrame.drop_duplicates(), DataFrame.apply() and lambda function with examples.,Below is the syntax of the DataFrame.drop_duplicates() function that removes duplicate rows from the pandas DataFrame.,By using pandas.DataFrame.drop_duplicates() method you can drop/remove/delete duplicate rows from DataFrame. Using this method you can drop duplicate rows on selected multiple columns or all columns. In this article, we’ll explain several ways of how to drop duplicate rows from Pandas DataFrame with examples by using functions like DataFrame.drop_duplicates(), DataFrame.apply() and lambda function with examples.,You can use DataFrame.drop_duplicates() without any arguments to drop rows with the same values on all columns. It takes defaults values subset=None and keep=‘first’. The below example returns four rows after removing duplicate rows in our DataFrame.
# Below are quick example # keep first duplicate row df2 = df.drop_duplicates() # Using DataFrame.drop_duplicates() to keep first duplicate row df2 = df.drop_duplicates(keep = 'first') # keep last duplicate row df2 = df.drop_duplicates(keep = 'last') # Remove all duplicate rows df2 = df.drop_duplicates(keep = False) # Delete duplicate rows based on specific columns df2 = df.drop_duplicates(subset = ["Courses", "Fee"], keep = False) # Drop duplicate rows in place df.drop_duplicates(inplace = True) # Using DataFrame.apply() and lambda function df2 = df.apply(lambda x: x.astype(str).str.lower()).drop_duplicates(subset = ['Courses', 'Fee'], keep = 'first')
Below is the syntax of the DataFrame.drop_duplicates()
function that removes duplicate rows from the pandas DataFrame.
# Syntax of drop_duplicates DataFrame.drop_duplicates(subset = None, keep = 'first', inplace = False, ignore_index = False)
Now, let’s create a DataFrame with a few duplicate rows on columns. Our DataFrame contains column names Courses
, Fee
, Duration
, and Discount
.
import pandas as pd
import numpy as np
technologies = {
'Courses': ["Spark", "PySpark", "Python", "pandas", "Python", "Spark", "pandas"],
'Fee': [20000, 25000, 22000, 30000, 22000, 20000, 30000],
'Duration': ['30days', '40days', '35days', '50days', '35days', '30days', '50days'],
'Discount': [1000, 2300, 1200, 2000, 1200, 1000, 2000]
}
df = pd.DataFrame(technologies)
print(df)
You can use DataFrame.drop_duplicates()
without any arguments to drop rows with the same values on all columns. It takes defaults values subset=None
and keep=‘first’
. The below example returns four rows after removing duplicate rows in our DataFrame.
# keep first duplicate row df2 = df.drop_duplicates() print(df2) # Using DataFrame.drop_duplicates() to keep first duplicate row df2 = df.drop_duplicates(keep = 'first') print(df2)
Yields below output.
Courses Fee Duration Discount 0 Spark 20000 30 days 1000 1 PySpark 25000 40 days 2300 2 Python 22000 35 days 1200 3 pandas 30000 50 days 2000
By Justin Walgran on August 30th, 2019
We tell Dedupe about the fields in our data we want to compare and the type of each field. “Type” in this context does not refer to a traditional programming language data type but rather how the data in the field should be interpreted and compared. The dedupe documentation includes a list of these types. For the OAR, our field configuration is simple:
fields = [{
'field': 'country',
'type': 'Exact'
},
{
'field': 'name',
'type': 'String'
},
{
'field': 'address',
'type': 'String'
},
]
Here is an abbreviated example of the interactive training session for our facility data:
root @8174a7f28009: /usr/local / src / dedupe # python oar_dedupe_example.py importing data... starting interactive labeling... country: kh name: grace glory(cambodia) garment ltd address national road 4, prey kor village, kandal country: kh name: grace glory(cambodia) garment ltd address: preykor village lum hach commune angsnoul district kandal province cambodia 0 / 10 positive, 0 / 10 negative Do these records refer to the same thing ? (y) es / (n) o / (u) nsure / (f) inished y country: cn name: h u zh o u ti anbao d r e ss c o ltd address: west of bus station 318 road nanxun town huzhou city zhejiang province country: cn name: jiaxing realm garment fashion co.ltd. address: no.619 shuanglong road xinfeng town nanhu district jiaxing city zhejiang 314005 1 / 10 positive, 1 / 10 negative Do these records refer to the same thing ? (y) es / (n) o / (u) nsure / (f) inished n
In production the performance of our Dedupe models has met our highest expectations. Here are a few examples of some challenging matches that our models have handled.
confidence | 0.50 name | Manufacturing Sportwear JSC - Thai Binh Br - Fty No 8 name | Manufacturing Sportwear JSC.(Thai Binh No.8) address | Xuan Quang Industrial Cluster, Dong Xuan Commune Thai Binh Thai Binh Vietnam address | Lot With Area 51765.9 M2 Xuan Quang Industrial Cluster Dong Xuan Commune Dong Hung District Thai Binh
Above is an example of another low-scoring match that would be presented to the contributor for confirmation. The trained model picked up on the similarities between the significantly different addresses but we need some additional domain expertise to confirm that these are likely two distinct facilities.
confidence | 0.61 name | Orient Craft Limited name | Orient Craft Ltd(Freshtex) address | Plot No.15, Sector 5, IMT Manesar 122050 Gurgaon Gurgaon Haryana address | Plot No 15 Sector - 5 Imt Manesar Gurgaon - 122050
The relatively low score on these very similar records highlights the fact that our model has learned that names with a suffix have an increased probability of not matching.
confidence | 0.89 name | Anhui Footforward Socks Co., Ltd. name | Anhui Footforward Socks Co.Ltd. address | West Baiyang Road, Economic Development Zone, , Huaibei, Suixi, Anhui address | West Baiyang Road, Suixi Economic Development Zone, Huaibei, Anhui, China