Principles
Bags of word allow the comparison of text by splitting changing text into a vector. Then, the distance between each can be computed and we can measure the proximity.We would connect the close elements using some graph to have a formal relation between the elements and let understand thay are similar.
Loading the dataset
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import euclidean_distances
import numpy as np
## I use a corpus that is not address in order not tu divulgate the dataset I used
corpus=pd.read_csv("addresses.csv")
Preprocessing data
Before we convert it into a bags of words, we need to perform some transformation. Typically, our dataset contains some address in various countries and the country name can differ depending on the language of the emitter country.
import re
#Read a file were countries are mappend to their english translation
countries=pd.read_csv("countries.csv",index_col=0)
# Select the countries when there is a need to substitute
replacements=countries[countries.originalcountry!=countries.tobename]
replacements.originalcountry="\\s"+replacements.originalcountry+"[\\s$]"
replacements.tobename=" "+replacements.tobename
replacements=replacements.set_index('originalcountry')
replacements.tobename=replacements.tobename.replace("\s","",regex=True)
# The replacement functions needs a disctionnary. Hence, we get that représentation from the dataframe.
replacements=replacements.tobename.to_dict()
# Replacing the society abreviations
replacements.update({"s.a.r.l":"sarl"})
replacements.update({"s.a.u":"sau"})
replacements.update({"s.a.s":"sas"})
replacements.update({"s.a":"sa"})
replacements.update({"limited":"ltd"})
#iterate(countriestoreplace)
# use these three lines to do the replacement
rep = dict((re.escape(k), v) for k, v in replacements.items())
#Python 3 renamed dict.iteritems to dict.items so use rep.items() for latest versions
pattern = re.compile("|".join(rep.keys()))
Converting into bags of words
# The vectorizer convert the sentences into a bags of word representation
vectorizer = CountVectorizer(strip_accents="ascii")
features = vectorizer.fit_transform(corpus).todense()
Computing the distance
from sklearn.metrics.pairwise import pairwise_distances
distances=pairwise_distances(features,metric='cosine')
np.save("cosinedistances",distances)
print("Sauvegarde terminée")