vendredi 11 décembre 2015

How to implement an intuitive search engine in SQL Server


Introduction and principle

What we call  an intuitive search engine is a search engine like google able to find what you have in mind by a few word entered in a text input.

Ituition


The first thing we must understand is what is intuition. It might seems a little theoretical as a start, but this is actually the key to make a good search engine and it comes before the technology:
  • Intuition hardly algorithmic (but it works) : You wont be intuitive by thinking a lot to complex queries
  • Intuition is based on experience :  Implementing a good intuitive search engine is not straight forward, it takes works, but also time. You need to take feedback of your users and check what they type in the search box to get to know what is intuitive. You will be intuitive by being iterative and spanning the development with several intermediate phases. Ideally, you go with an agile methodology.
  • Intuition depends on context : What is actually intuitive depends often on the context, so there is no standard way to make the best search engine. So once for all, forget about 'standards', this wont be the shortest way to intuitive and don’t expect to get out the box solution. There are of course some frameworks in the market to do a search engine, but this is not because you integrate a sophisticated search engine that you will make a good search engine. The most important thing is to understand the principles that lay beyond the scene. For instance Lucene is a good crawler and search engine, but if you spend your time on its integration issue, you will miss the point.

How human mind stores the complexity

Also, intuitive relies on how human mind. Sometimes, you would not find a key,but then you think with who you were the day before and then you will remember were your keys are. Isn't it weird ? It is like if all the information were all mixed up in a blob and your brain was inspecting this mess to get a clue on a precise information. This 'blobuscular' vision of thing is the key to search engine. You have to pass with an intermediate, messy representation of the information. Fortunately, the notion of blob exists in database. So here is the first step:  gather all the relevant data of your data object model into a blob.



 



It is very alike that your application is not as omnivorous as Google. A search engine of this kind will  be centered on an entity. Identify this entity and consider the various data related to that central object.

For instance, let's consider a document. You want it to be found by its title, by its content. But also, it may be found by its author, reviewer or by the organization that issued the document. 









Intuitive search engine must be smart, meaning that relevance of the data depends also on its nature. So you cannot assume the same importance (weight) on the different component of you object and  you must weight the related properties.Going to the document example, this would be.

For instance:
  • Document title is weighted 10 (most important criteria)
  • Oganisation issuing the doc is weighted 5 (medium importance)
  • Author is weighted 5 (medium importance)
  • Reviewer is weighted 5 (medium importance)
  • Document content is weighted 2 (least important)


Then, you should determine if you search engine is capable of phonetizing the item. At the end, you should have something like:



Now this is the functional part of the analysis, now lets go to the technical implementation.

Technical implementation


A search engine breaks down in two parts:

  • Indexation
  • Search

Indexation


Indexation is the operation that converts an object in to a hash.



From the table assign the weight to the object part you will build hashes for each object. The key to a smart indexation is that you not use a single hash for all the part of your object.

This is from this that you will build your hashes for each of your entity.



Now, you have two ways to perform your indexation:
  • Transactional (actually it is always pseudo transactionnal)
  • Asynchronous

Transactional(pseudo transaction): When an indexation is pseudo transactional, this means that if a users update an object. This is immediately propagated into the hash, so the actual data is always consistent with its indexation. Transactional indexing should be your first option because, this mean you indexed data is up to date anytime. If you are indexing data that are issued, modified by your system, that is to say if you 'own' the data you are indexing you should go for this except if you could have volumetry and performance issue. Conversely, if your indexation requires first a crawling (like google does), you wont be able to have such a consistancy between the data you have indexed and the actual data.

Asynchronous option: is when you do not own the data, or if you need to crawl the data.

Microsoft SQL Server propose some full text indexed column with built-in functions for language specific.




Search 

Search is the most complex part, it should produce an ordered list of items by its relevance. Relevance is tedious to specify, as a start, we can consider the following relevance criteria:


- If it contains match of the words that were typed into the search input
- If the word matching are relevant in a general context. For instance, if you type: 'the man and the sea' into the input field, you might consider only the two words 'man' and 'sea', ignoring the word 'the' and 'and' (that bring little information). We call that type of word, noise.
- If the word searched is a code or a general word. For instance, if you are looking for 'patch KB5636554', it is very alike that you know exactly what you are looking for. Therefore, if you type: 'update KB5636554', it is best to display in first 'patch KB5636554' rather than 'update KB666666', because the word 'KB5636554' is more significant than 'update' in that context.

- If the word in the input fields matches some important information of the data of the object searched. For instance if there is a matching word in the title, then the entry returned is more important that the entry stored in the content.
- If the same word with slight differences is found in a document. For instance, if you look for 'men and seas', it can make sense to retrieve 'the old man and the sea'
- If the word sounds like the one you typed. For instance if you type 'emingway', you should have the
books written by 'hemingway'

As you can see there could many rule. Already implementing those one, you should get a good sort of search engine.

Technical implementation

In details the technical implementation will go in three steps.

Pre-parse
Often, it will be more easier to make an ad-hoc processing by coding it (configuration could be not sufficient)
Build the query (To be repeated for each word of the query) you will get a list of document id


Sort: compute the score of each of the entry to sort the entry returned