First Name*

Last Name*

Email ID

Phone*

College - Where did you study?*

One of the IITs

One of the NITs

One of the BITs

One of the IIITs

One of the NIDs

Agnel Charities' FR. C. Rodrigues Institute of Technology, Vashi, Navi Mumbai

Atal Bihari Vajpayee Indian Institute of Information Technology & Management Gwalior (IIIT)

B M S College of Engineering Basavanagudi,Bangalore(BMSCE)

B.R.A.C.T's Vishwakarma Institute of Information Technology, Kondhwa(VIIT)

Bansilal Ramnath Agarawal Charitable Trust's Vishwakarma Institute of Technology, Bibwewadi, Pune (VIT Pune)

Bhartiya Vidya Bhavan's Sardar Patel Institute of Technology , Andheri, Mumbai (SPIT)

Bhilai Institute of Technology, Bhilai House, Durg(BIT)

Bhilai Institute of Technology.

Birla Institute of Technology, Goa

Birla Institute of Technology, Hydrabad

Birla Institute of Technology, Mesra, Ranchi

Birla Institute of Technology, Pilani, Rajasthan

CHAITANYA BHARATHI INSTITUTE OF TECHNOLOGY(CBIT)

Coimbatore Institute Of Technology(CIT) (Autonomous)

College of Engineering, Pune (COEP)

CV Raman Global University

Dayananda Sagar College of Engineering Bangalore (DSCE)

Delhi Technological University, DTU Delhi

Desai University, (DDU), Nadiad

Dhirubhai Ambani Institute of Info. & Comm. Tech.,(DA-IICT)

Don Bosco Institute of Technology, Mumbai

Dr. Ambedkar Institute Of Technology Bangalore

Faculty Of Technology & Engineering(MSU), Vadodara

Faculty Of Technology And Engineering(GIA), Dharmsinh

Fr. Conceicao Rodrigues College of Engineering, Bandra,Mumbai

Garv Institute of Management & Technology.

Government College of Engineering, Amravati

Govt Engineering College, Bilaspur.

Govt Engineering College, Raipur.

Govt. Engineering College, Raipur (GEC Raipur)

IIIT Hyderabad

Indian Institute of Art and Design(IIAD), Delhi

Indian Institute of Engineering Science and Technology, Shibpur (IIEST Shibpur)

Indian Institute of Information Technology (IIIT) Pune

Indian Institute of Information Technology (IIIT)Kota, Rajasthan

Indian Institute of Information Technology Surat (IIIT)

Indian Institute of Information Technology(IIIT) Kilohrad, Sonepat, Haryana

Indian Institute of Information Technology(IIIT), Vadodara, Gujrat

Indian Institute of Information Technology, Design & Manufacturing, Kancheepuram (IIIT)

Indian Institute of Technology (BHU) Varanasi

Indian Institute of Technology (ISM) Dhanbad

Indian Institute of Technology Bhilai

Indian Institute of Technology Bhubaneswar

Indian Institute of Technology Bombay

Indian Institute of Technology Delhi

Indian Institute of Technology Dharwad

Indian Institute of Technology Gandhinagar

Indian Institute of Technology Goa

Indian Institute of Technology Guwahati

Indian Institute of Technology Hyderabad

Indian Institute of Technology Indore

Indian Institute of Technology Jammu

Indian Institute of Technology Jodhpur

Indian Institute of Technology Kanpur

Indian Institute of Technology Kharagpur

Indian Institute of Technology Madras

Indian Institute of Technology Mandi

Indian Institute of Technology Palakkad

Indian Institute of Technology Patna

Indian Institute of Technology Roorkee

Indian Institute of Technology Ropar

Indian Institute of Technology Tirupati

Indraprastha Institute of Information Technology Delhi (IIIT-Delhi)

INSTITUTE OF ENGINEERING & TECHNOLOGY,LUCKNOW (0052)(IET Lucknow)

Institute of Engineering and Management, Kolkata

Institute of Engineering and Technology, DAVV, Indore (1996)

Institute Of Technology, Nirma University Of Science & Technology, Ahmedabad

International Institute of Information Technology, Bhubaneswar

International Institute of Information Technology, Naya Raipur

Jabalpur Engineering College, Jabalpur, (JEC) (1947)

Jadavpur Uni

Jadavpur University

JSS Science and Technology University(Formerly SJCE) Mysore

K J Somaiya Institute of Engineering and Information Technology, Sion, Mumbai

K.J.Somaiya College of Engineering, Vidyavihar, Mumbai

Kalinga Institute of Industrial Technology

L.D.College Of Engineering, Ahmedabad (LDCE)

M S Ramaiah Institute of Technology Bangalore (MSRIT)

Madhav Institute of Technology & Science, Gwalior (1957)

MAEER’S MIT, Pune

Maharashtra Academy of Engineering and Educational Research

Maharashtra Institute of Technology (MIT)

Malaviya National Institute of Technology Jaipur

Manipal Institute of Technology (MIT)

Maulana Abul Kalam Azad University of Technology, Kolkata

Maulana Azad National Institute of Tehnology Bhopal

MIT Academy of Engineering,Alandi, Pune

MKSSS's Cummins College of Engineering for Women, Karvenagar,Pune

Motilal Nehru National Institute of Technology Allahabad

National Institute of Design(NID)

National Institute of Technology Calicut

National Institute of Technology Delhi

National Institute of Technology Durgapur

National Institute of Technology Hamirpur

National Institute of Technology Jalandhar

National Institute of Technology Karnataka, Surathkal

National Institute of Technology Patna

National Institute of Technology Raipur

National Institute of Technology, Andhra Pradesh

National Institute of Technology, Jamshedpur

National Institute of Technology, Kurukshreta

National Institute of Technology, Rourkela

National Institute of Technology, Silchar

National Institute of Technology, Tiruchirappalli

National Institute of Technology, Warangal

Netaji Subhas University of Technology, New Delhi (NSUT Delhi)

O U COLLEGE OF ENGG HYDERABAD (UCE)

P E S University (Electronic City Campus) Bangalore(PES)

P E S University (Ring Road Campus) Bangalore(PES)

Pandit Deendayal Petroleum University ,Gandhinagar(PDPU)

Pimpri Chinchwad Education Trust, Pimpri Chinchwad College of Engineering, Pune(PCCOE)

PSG College of Engineering and Technology

Pt. Dwarka Prasad Mishra Indian Institute of Information Technology, Design & Manufacture Jabalpur

Pune Institute of Computer Technology, Dhankavdi, Pune(PICT)

Punjab Engineering College, Chandigarh (PEC)

R. V. College of Engineering Bangalore(RVCE)

Sardar Patel Institute of Technology, Andheri, Mumbai

Sardar Vallabhbhai National Institute of Technology, Surat

School of Engineering and Applied Science, Ahmedabad (SEAS)

Shri G.S. Institute of Technology & Science, Indore (M.P.) (1952)

Shri Guru Gobind Singhji Institute of Engineering and Technology, Nanded

Shri Shankaracharya Technical Campus,(Shri Shankaracharya Group of Institutions).

Shri Vile Parle Kelvani Mandal's Dwarkadas J. Sanghvi College of Engineering, Vile Parle,Mumbai (DJSCE)

Silicon Institute of Technology

Sir M.Visveswaraya Institute of Technology Hunasemaranahalli,Bangalore,

SOA ITER, Bhubaneshwar

Sri Jayachamarajendra College of Engineering(Const. of JSS Univ.) Mysore

Sri Sivasubramaniya Nadar College Of Engg (Autonomous) (SSN)

Srishti Institute of Art and Design, Bangaluru

SSN CoE, Kalavakkam

Symbiosis Institute of Design(SID),Pune

The National Institute of Engineering Mysore (NIE)

Thiagarajar College Of Engineering (Autonomous) (TCE)

University Institute of Technology RGPV, Bhopal (1986)

University of Kalyani, Kalyani

University Visveswariah College of Engineering Bangalore (UVCE)

VASAVI COLLEGE OF ENGINEERING (VCE)

Veer Surendra Sai University of Technology

Veermata Jijabai Technological Institute(VJTI), Matunga, Mumbai

Vellore Institute of Technology(VIT Vellore)

Vidyalankar Institute of Technology,Wadala, Mumbai

Vishwakarma Government Engineering College, Chandkheda,Gandhinagar (VGECG)

Visvesvaraya National Institute of Technology, Nagpur

Vivekanand Education Society's Institute of Technology, Chembur, Mumbai

Walchand College of Engineering, Sangli (WCE)

Field of Study (Graduation)*

BTech

BDES/MDES

BCA

BSc

Others

Upload your CV*

Yes, I would like Talentica Software to contact me. Click here to read our full Privacy Policy.

First Name*

Last Name*

Email ID

Phone*

Message

Yes, I would like Talentica Software to contact me. Click here to read our full Privacy Policy.

Duplicate Records No More: Leveraging AI for Clean Data (Part II)

October 09, 2023

Smrutiranjan Sahu

Senior Software Engineer - Data Science

October 09, 2023

Smrutiranjan Sahu

Senior Software Engineer - Data Science

In the previous blog in this series (Part 1), I discussed the overall concepts of deduplication, available tools, and the challenges involved in solving problems where we only have short text as names.

In this blog, I will delve into how we can utilize custom machine-learning methods to group similar company names without considering other contextual information or cases where the context is inconsistent.

The overall algorithmic pipeline will resemble other open-source solutions in the literature, incorporating processing steps such as indexing, pairing, feature extraction, pair binary classification, and grouping.

Analysis and Name Cleaning
Pairing
Extracting features
Model Training & Inferencing
Grouping
Evaluation
Productionalization

By constructing a modular pipeline, we can employ more suitable and highly customizable steps, rather than relying solely on the configurations provided by the listed frameworks. Custom code in each step can enhance the detection and performance of each component independently.

Preprocessing

While text processing might require different steps based on data variations, some of the steps that can be followed are listed below.

To detect and clean industry/sector specific keywords:

Split each text and take the frequency of each word across all-names
Inspect words with a higher frequency, and filter as required

To filter and keep only relevant name portions:

Parse using probablepeople or python-nameparser

For cleaning location entities:

Use iso3166 package for country name removal
Geocoder using usaddress package to remove address variations

For correcting spelling mistakes:

Use TextBlob package

In this section, we will dedicate the majority of our time. Enhanced cleaning processes will assist us in improving the similarity scores of related records and eliminating noise from irrelevant sections. It’s important to note that under-cleaning is preferable to over-cleaning, as over-cleaning can lead to the merging of irrelevant texts into the same group due to a lack of differentiation.

Pairing

This step is essential for generating candidate record pairs for feature extraction. It significantly reduces computational complexity from N^2 to a more manageable linear N x M (with M being the window size).

[Source: https://www.semanticscholar.org/paper/Key-based-Blocking-of-Duplicates-in-Probabilistic-Panse-Wingerath/0b8c5717b85c3d53f723a12d2a29fc4bc5decd3c]

There are various other options available like Q-Gram-based indexing, Suffix-array-based index, Sorted Neighbourhood, Canopy Clustering, Mapping Based index. For Sorted Neighbourhood blocking, we can use recordlinkage, or Magellan package (pyentity-matching).

While these techniques can significantly reduce the search space and computational complexity, we may observe the exclusion of some important pairs in both training and test data, leading to a loss of accuracy. The quality of pairs can also be a challenge if the key is not constructed correctly with the appropriate cleaning and processing. I recommend conducting a key-pairs loss analysis and comparing performance at various thresholds to minimize this loss.

Feature Extraction

We can use the textdistance package containing 30+ algorithms across Edit-based, Sequence-based, and Phonetic similarity functions and simple algorithms for prefix/suffix or exact similarity.

Other packages for similarity functions can be explored from python-string-similarity.

Some selected features may not improve performance but can consume significant computational resources. These can be eliminated post-training using feature-importance plots or sensitivity analysis.

While structural similarity can be quite helpful, especially with longer texts, in the case of short text, names may be represented as synonyms, requiring an understanding of their meanings. Semantic similarity using language models (E.g. Transformer) can help us in these cases with better deep feature representation.

Model Training

We can experiment with various ML algorithms, from traditional to deep learning ones. But, given that we have already extracted the features, starting with a classical ensemble model like RandomForest or XGBoost is a good option.

We must carefully split the dataset to capture diverse variations across domains in the training data while ensuring that the test data remains separate from the training set.

We can evaluate the model’s performance using precision metrics and fine-tune it to reduce false positives.

Grouping

Why do we need grouping? The pairing steps before feature extraction have transformed the data domain from the original N to NxL (where L represents the window size). However, we require our group results on the original per-record basis. To achieve this, we need to create groups based on ML predictions that consider the probability of similarity between pairs.

To remap the pairs from the N^2 or NxL complexity back to the original individual record domain, we can employ graph-partitioning methods or clustering techniques for community detection. Additionally, we can utilize algorithms available in packages like dedupe (such as Hierarchical Clustering) or recordlinkage (including K-Means Clustering).

Evaluation

In this step, we evaluate the overall performance of the pipeline in grouping similar records and report metrics as % of cases where the predicted group name matches the actual one. Additionally, we can report at the group level, whether the group content matches exactly or partially, even if the group head names don’t match.

To determine the confidence probability of each group, consider the strength of the weakest link within each group. This approach will allow us to calibrate the threshold according to the required precision.

Productionalization

Creating the first cut POC is easy but deduplication usually doesn’t get solved in one go. Businesses have varied thresholds for the acceptable accuracy of the algorithms in production. Achieving accuracy in specific contexts necessitates robust testing to ensure generalization across various domains.

To analyze which of these groups require more attention, we can compare the number of names in the expected group to the ML-predicted groups and check where names are common to both groups. We can further segregate matches based on cases where the group content is the same and cases where only the name matches but the group content is not the same. Below are some cases we can examine.

Check where group content matches but the expected and predicted names don’t match
Check which predicted groups have significantly different sizes than expected
Check which group sizes are similar but constituent names are different
Check which original groups are split by ML
Check which groups are merged by ML

Splits and merge analysis will enable us to detect if we are overcleaning or if more cleaning is required.

During POCs, we can start by annotating a subset for initial ML and evaluation. However, when we delve into the entire category, we may discover cases that were incorrectly annotated due to the underlying assumptions of certain fields not always holding true. Consequently, we may need to review annotations multiple times based on learned consistency patterns.

In production, while it’s possible to predict the group of a single new record by sampling a few records from existing groups and creating pairs for ML features, it is best achieved through batch predictions for all records and then creating a dictionary from the outputs. The dictionary can be updated periodically based on the frequency of new record arrivals. Usually, a weekly batch job or on-demand updating of the dictionary is required for support.

Detecting Grouping Issues

To check why a given name is part of a predicted group, we can first list which are names predicted to be similar for each record in the test data, by querying the pair prediction file output of ML inference, where the record is present in either index of a tuple.

To find which groups were split by ML, we can group on predicted groups, and check which groups contain more than 1 true group. Similarly, to identify which groups were merged by ML, create group by each true group and check for ML groups containing more than one true group. Initially, we should focus on cases where the expected group size is one to detect merging issue patterns.

Sometimes ML prediction can be correct but there can be levelling mistakes. These issues can be detected by examining predicted groups with high probabilities and manually validating the similarity of the manual group with nearby similar manual group names sorted alphabetically. Sorting is one of the best tools to detect issues, and can also be applied to predicted group-names to detect inconsistent manual groups.

Best Practices/ Helpful Tips:

Complexity of the solution will depend on the data complexity and variability. But a structured approach can expedite the path to production.

Data cleaning is the most significant but proper analysis of variations is paramount
Prioritize tasks based on the largest gain in accuracy with minimal coding efforts
Keep track of experiments with git commit of changes in each experiment with descriptions
Keep code clean and review well before merging with the git main branch
To find possible annotation issues, use predictions of multiple algorithms, and check if the majority vote matches the labels.

Conclusion

Variations in data across industries always trigger new challenges. But the suggested method can assist you in gaining fresh and practical insights into the potential for duplications based solely on textual data fields, particularly for names of organizations, individuals, or products. ML-based solutions can learn intricate patterns and excel when tuned meticulously through structured analysis.