The French Hack: Address similarity detection using bags of words using python

jeudi 17 août 2023

Address similarity detection using bags of words using python

Principles

Bags of word allow the comparison of text by splitting changing text into a vector. Then, the distance between each can be computed and we can measure the proximity.We would connect the close elements using some graph to have a formal relation between the elements and let understand thay are similar.


 Loading the dataset
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import euclidean_distances
import numpy as np

## I use a corpus that is not address in order not tu divulgate the dataset I used

corpus=pd.read_csv("addresses.csv")

Preprocessing data

Before we convert it into a bags of words, we need to perform some transformation. Typically, our dataset contains some address in various countries and the country name can differ depending on the language of the emitter country.


  
import re
#Read a file were countries are mappend to their english translation
countries=pd.read_csv("countries.csv",index_col=0)

# Select the countries when there is a need to substitute
replacements=countries[countries.originalcountry!=countries.tobename]
replacements.originalcountry="\\s"+replacements.originalcountry+"[\\s$]"
replacements.tobename=" "+replacements.tobename

replacements=replacements.set_index('originalcountry')
replacements.tobename=replacements.tobename.replace("\s","",regex=True)
# The replacement functions needs a disctionnary. Hence, we get that représentation from the dataframe.
replacements=replacements.tobename.to_dict()
# Replacing the society abreviations
replacements.update({"s.a.r.l":"sarl"})
replacements.update({"s.a.u":"sau"})
replacements.update({"s.a.s":"sas"})
replacements.update({"s.a":"sa"})
replacements.update({"limited":"ltd"})


#iterate(countriestoreplace)
# use these three lines to do the replacement
rep = dict((re.escape(k), v) for k, v in replacements.items()) 
#Python 3 renamed dict.iteritems to dict.items so use rep.items() for latest versions
pattern = re.compile("|".join(rep.keys()))

Converting into bags of words



 # The vectorizer convert the sentences into a bags of word representation
vectorizer = CountVectorizer(strip_accents="ascii")
features = vectorizer.fit_transform(corpus).todense()

Computing the distance



from sklearn.metrics.pairwise import pairwise_distances

distances=pairwise_distances(features,metric='cosine')
np.save("cosinedistances",distances)
print("Sauvegarde terminée")

Identifying the synonyms

Consideration on performances

Aucun commentaire:

Enregistrer un commentaire

The French Hack