Dedupe is a Python library that uses supervised machine learning and statistical techniques to efficiently identify multiple references to the same real-world entity.,Dedupe takes its name from its primary application, looking through a single set of records and attempting to find duplicates. The workflow of the OAR involves comparing a new set of records submitted by a contributor to an existing set of mapped facilities. Dedupe has first-class support for this type of workflow through the use of a gazetteer matcher.,Amazingly, there is not a great deal of configuration required beyond that list of fields. While choosing those types is important, the power and efficiency of the library is derived not from static configuration, but from interactive training. Dedupe can match large lists accurately because it uses blocking and active learning to intelligently reduce the amount of work required.,Apparel brands, multi-stakeholder initiatives, and other participants in the supply chain maintain their own lists of facilities. These lists are often compiled by staff without standard formatting and can often contain duplicate or otherwise “dirty” data. When attempting to compare lists compiled by different organizations and find the facilities common to them we are quickly confronted with several issues:
We tell Dedupe about the fields in our data we want to compare and the type of each field. “Type” in this context does not refer to a traditional programming language data type but rather how the data in the field should be interpreted and compared. The dedupe documentation includes a list of these types. For the OAR, our field configuration is simple:
fields = [{
'field': 'country',
'type': 'Exact'
},
{
'field': 'name',
'type': 'String'
},
{
'field': 'address',
'type': 'String'
},
]
Here is an abbreviated example of the interactive training session for our facility data:
root @8174a7f28009: /usr/local / src / dedupe # python oar_dedupe_example.py importing data... starting interactive labeling... country: kh name: grace glory(cambodia) garment ltd address national road 4, prey kor village, kandal country: kh name: grace glory(cambodia) garment ltd address: preykor village lum hach commune angsnoul district kandal province cambodia 0 / 10 positive, 0 / 10 negative Do these records refer to the same thing ? (y) es / (n) o / (u) nsure / (f) inished y country: cn name: h u zh o u ti anbao d r e ss c o ltd address: west of bus station 318 road nanxun town huzhou city zhejiang province country: cn name: jiaxing realm garment fashion co.ltd. address: no.619 shuanglong road xinfeng town nanhu district jiaxing city zhejiang 314005 1 / 10 positive, 1 / 10 negative Do these records refer to the same thing ? (y) es / (n) o / (u) nsure / (f) inished n
In production the performance of our Dedupe models has met our highest expectations. Here are a few examples of some challenging matches that our models have handled.
confidence | 0.50 name | Manufacturing Sportwear JSC - Thai Binh Br - Fty No 8 name | Manufacturing Sportwear JSC.(Thai Binh No.8) address | Xuan Quang Industrial Cluster, Dong Xuan Commune Thai Binh Thai Binh Vietnam address | Lot With Area 51765.9 M2 Xuan Quang Industrial Cluster Dong Xuan Commune Dong Hung District Thai Binh
Above is an example of another low-scoring match that would be presented to the contributor for confirmation. The trained model picked up on the similarities between the significantly different addresses but we need some additional domain expertise to confirm that these are likely two distinct facilities.
confidence | 0.61 name | Orient Craft Limited name | Orient Craft Ltd(Freshtex) address | Plot No.15, Sector 5, IMT Manesar 122050 Gurgaon Gurgaon Haryana address | Plot No 15 Sector - 5 Imt Manesar Gurgaon - 122050
The relatively low score on these very similar records highlights the fact that our model has learned that names with a suffix have an increased probability of not matching.
confidence | 0.89 name | Anhui Footforward Socks Co., Ltd. name | Anhui Footforward Socks Co.Ltd. address | West Baiyang Road, Economic Development Zone, , Huaibei, Suixi, Anhui address | West Baiyang Road, Suixi Economic Development Zone, Huaibei, Anhui, China
dedupe is a python library that uses machine learning to perform fuzzy matching, deduplication and entity resolution quickly on structured data., 🆔 A python library for accurate and scalable fuzzy matching, record deduplication and entity-resolution. , 🆔 A python library for accurate and scalable fuzzy matching, record deduplication and entity-resolution. ,Dedupe is based on Mikhail Yuryevich Bilenko's Ph.D. dissertation: Learnable Similarity Functions and their Application to Record Linkage and Clustering.
pip install dedupe
mkvirtualenv dedupe
git clone git: //github.com/dedupeio/dedupe.git
cd dedupe
pip install "numpy>=1.9"
pip install - r requirements.txt
cython src
/*.pyx
pip install -e .
pytest
workon dedupe
python - m pip install - e. / benchmarks python benchmarks / benchmarks / canonical.py
python - m pip install - e. / benchmarks python benchmarks / benchmarks / canonical_matching.py
dedupe is a library that uses machine learning to perform de-duplication and entity resolution quickly on structured data.,dedupe takes in human training data and comes up with the best rules for your dataset to quickly and automatically find similar records, even with very large databases.,machine learning - reads in human labeled data to automatically create optimum weights and blocking rules,csvdedupe Command line tool for de-duplicating and linking CSV files. Read about it on Source Knight-Mozilla OpenNews.
pip install dedupe
By Prasad KulkarniSep 11, 2021, 15:25 pm0
- pandas_dedupe
- record_linkage
pip install recordlinkage pip install pandas_dedupe
This is a built-in dataset of the recordlinkage library. Let’s read the febrl data into a dataframe.
df_febrl = load_febrl1() df_febrl.head()
Here is the training code:
df_febrl_dedup = pandas_dedupe.dedupe_dataframe(df_febrl, df_febrl.columns, canonicalize = True, sample_size = 1)
Once the training is finished, let’s get the cluster id and confidence from the trained results:
df_febrl_dedup_final = df_febrl_dedup[['given_name',
'surname',
'street_number',
'address_1',
'address_2',
'suburb',
'postcode',
'state',
'date_of_birth',
'soc_sec_id', 'cluster id', 'confidence'
]]
df_febrl_dedup_final.sort_values(['cluster id']).head(50)
This code demonstrates how to use dedupe with a comma separated values (CSV) file. All operations are performed in memory, so will run very quickly on datasets up to ~10,000 rows.,We start with a CSV file containing our messy data. In this example, it is listings of early childhood education centers in Chicago compiled from several different sources.,partition will return sets of records that dedupe believes are all referring to the same entity.,Read in our data from a CSV file and create a dictionary of records, where the key is a unique record ID and each value is dict
import os
import csv
import re
import logging
import optparse
import dedupe
from unidecode
import unidecode
def preProcess(column):
column = unidecode(column)
column = re.sub(' +', ' ', column)
column = re.sub('\n', ' ', column)
column = column.strip().strip('"').strip("'").lower().strip()
if not column:
column = None
return column
def readData(filename):
data_d = {}
with open(filename) as f:
reader = csv.DictReader(f)
for row in reader:
clean_row = [(k, preProcess(v)) for (k, v) in row.items()]
row_id = int(row['Id'])
data_d[row_id] = dict(clean_row)
return data_d
if __name__ == '__main__':