jeudi 17 août 2023

Address similarity detection using bags of words using python


Principles

Bags of word allow the comparison of text by splitting changing text into a vector. Then, the distance between each can be computed and we can measure the proximity.We would connect the close elements using some graph to have a formal relation between the elements and let understand thay are similar.
 


 

Loading the dataset

import pandas as pd from sklearn.feature_extraction.text import CountVectorizer from sklearn.metrics.pairwise import euclidean_distances import numpy as np ## I use a corpus that is not address in order not tu divulgate the dataset I used corpus=pd.read_csv("addresses.csv") 

Preprocessing data

Before we convert it into a bags of words, we need to perform some transformation. Typically, our dataset contains some address in various countries and the country name can differ depending on the language of the emitter country. 
 

  
import re
#Read a file were countries are mappend to their english translation
countries=pd.read_csv("countries.csv",index_col=0)

# Select the countries when there is a need to substitute
replacements=countries[countries.originalcountry!=countries.tobename]
replacements.originalcountry="\\s"+replacements.originalcountry+"[\\s$]"
replacements.tobename=" "+replacements.tobename

replacements=replacements.set_index('originalcountry')
replacements.tobename=replacements.tobename.replace("\s","",regex=True)
# The replacement functions needs a disctionnary. Hence, we get that représentation from the dataframe.
replacements=replacements.tobename.to_dict()
# Replacing the society abreviations
replacements.update({"s.a.r.l":"sarl"})
replacements.update({"s.a.u":"sau"})
replacements.update({"s.a.s":"sas"})
replacements.update({"s.a":"sa"})
replacements.update({"limited":"ltd"})


#iterate(countriestoreplace)
# use these three lines to do the replacement
rep = dict((re.escape(k), v) for k, v in replacements.items()) 
#Python 3 renamed dict.iteritems to dict.items so use rep.items() for latest versions
pattern = re.compile("|".join(rep.keys()))

    

Converting into bags of words



 # The vectorizer convert the sentences into a bags of word representation
vectorizer = CountVectorizer(strip_accents="ascii")
features = vectorizer.fit_transform(corpus).todense() 
  
  

Computing the distance

from sklearn.metrics.pairwise import pairwise_distances distances=pairwise_distances(features,metric='cosine') np.save("cosinedistances",distances) print("Sauvegarde terminée")

Identifying the synonyms 

Consideration on performances


 


Aucun commentaire: