Overblog
Edit post Follow this blog Administration + Create my blog
BLACK SQUARE WEB SOLUTIONS

Full service digital strategy planning and implementation.

Creating a Custom Search Engine Part 2 - Searching

This is part two of our article on how to create a custom search engine for your website. Read Part 1, which deals with preparation and indexing, first.

Searching

MySQL offer two distinct types of full text searching: natural language searching and Boolean searching. We use the MATCH ... AGAINST (...) syntax for performing searches – for specific information on how to use this syntax, see http://dev.mysql.com/doc/refman/5.0/en/fulltext-search.html.

Natural Language Searching

Natural language searching attempts to take the search phrase as entered, and find the text that most resembles this phrase. It takes into account all the words in the phrase (except stop words), how many times they appear in the text, and their relative location. It will ignore any grouping of keywords in the search phrase (with inverted commas or brackets), and will also ignore any attempts to give weight to one keyword over another. It simply takes the phrase as-is, and searches for it. This form of search is intended for searchers who simply type their query, either as a list of words, or as a fully grammatical question.

Natural language searches order themselves by the relevance score generated for each search result, and these scores are useful and relevant. So, a natural language search will pretty much do all the work for you. You can safely use the LIMIT command to return a subset of the resultset, for maximum efficiency, if you’re using paging.

Boolean Searching

Invariably, however, a client will want their search to more accurately emulate Google, and that means you need to also implement Boolean searching. Boolean searches allow the searcher to indicate which keywords in the search phrase are more or less important, and to group keywords together into phrases. A Boolean search has very different rules for searching, and needs to be handled quite differently. See http://dev.mysql.com/doc/refman/5.0/en/fulltext-boolean.html for specific details on the rules and operators used by Boolean searches.

Luckily, Boolean searches use a small, defined set of non-word characters as operators, so it is easy to decide whether or not to use Boolean searching; if the search phrase contains one or more of these significant characters, use Boolean searching, otherwise use natural language searching.

Unfortunately, Boolean searches do not return useful relevance scores unless the search phrase is long. The average search phrase used to search a website is two words long, and in this case, the Boolean relevance scores are all pretty much the same, making it almost impossible to order the results in a useful way. There are several ways to tackle this issue:

Use Natural Language Relevance

The most commonly suggested solution to this problem on the internet is to use the natural language relevance scores to order your results. Essentially, you use the MATCH ... AGAINST syntax in both the WHERE statement, as Boolean, and as one of the selected fields, as Natural Language. You then order the whole lot by the score field, as in this example:

SELECT id, content, MATCH (content) AGAINST ('keyword') AS score

FROM contentindex

WHERE MATCH (content) AGAINST ('keyword' IN BOOLEAN MODE)

ORDER BY score

This approach will provide an order, but it won’t provide an accurate idea of relevance. It will only return results that pass the Boolean filter, but the order will be that of the Natural Language full-text index, which doesn’t take into account any of the Boolean operators. You may as well do a Natural Language search in the first place.

Implement Your Own Index

A second option is to implement your own word index, and include in that more accurate relevance scores. To do this you need to maintain a second table when indexing. The first will contain the content to be searched, for use with Natural Language searches. The second will contain a complete list of all words used in the content, each word having its own line. The words are linked to the item by ID, and are given a weight according to how often they appear in the content. You can also give more weight to words that appear in certain parts of the content (eg: in the title, or that appear within heading tags). This is the easy part.

The hard part is to build the query. You will need to manually parse the search phrase, and construct a query that implements all the elements of the Boolean search. This becomes particularly difficult with grouped phrases, and you may need to perform several nested queries to achieve accurate results. Remember to include the weight with every selection, so you can order your results by relevance. You can still use full-text searching to identify the words, but it may be quicker and more accurate to use simple searching. You can also use SOUNDEX to cater for misspellings, plurals, etc.

The problem with this, of course, is that it is really difficult to implement. It will require a lot of time to develop and test the algorithm, and may well produce very resource-intensive searches.

Compromise

The solution I’ve chosen for my implementations is a compromise. I have implemented a rough, simple version of my own index, which is not very accurate, doesn’t take grouped phrases into account well, and ignores negative results. It does, however, return me a list of content items which contain the words in the search phrase, and provides a weighted score for each one.

I use this to order a normal Boolean search result set to produce a useful result. This takes advantage of the far more advanced Boolean search algorithm used by the MySQL server, and the speed with which it can perform that, and combines it with a custom weighing system which allows me to place more or less importance on words according to where they appear in the content.

Next: Part 3 - Displaying the Results

Share this post
Repost0
To be informed of the latest articles, subscribe:
Comment on this post
B
[...] custom search engine for your website. Read Part 1, which deals with preparation and indexing, and Part 2, which deals with searching, [...]
Reply