Overblog
Edit post Follow this blog Administration + Create my blog
BLACK SQUARE WEB SOLUTIONS

Full service digital strategy planning and implementation.

Creating a Custom Search Engine Part 1 - Indexing

I wrote a post a while ago about adding search to your site, where I waxed lyrical about the benefits of using Google to provide these search functions. That still stands. Use Google if you can. Google does, however, have certain limitations. It requires that all the pages be available to the public, for example, and it will only index whole pages. If you have information behind a login, or you have more specific search requirements (eg: you wish to limit the search to a database of products), you will need to provide a custom search.

Getting started

If the dataset you’re searching is relatively small, you will do best to organise some kind of categorisation, and allow your users to simply browse by category. For small datasets (anything less than several thousand items), with short descriptions (500 words or less), a keyword search is far more likely to turn up no results than it is to turn up anything useful. Rather, tag or categorise each item, and allow the searchers to filter their results with tags and categories that actually exist.

If, however, your data is too large for a simple category filter, or your category filters become too large and complicated, you will need to implement a keyword search. You will not be able to implement anything like Google offers. You just won’t. Moderate your expectations now. Google has tens of thousands of servers in multiple data centres around the world dedicated to delivering fast, efficient search results. You have one computer, and you are most likely sharing that with hundreds of other websites. Google has spent billions of dollars on an army of the brightest minds of our generation, who have spent decades (ok, a decade) refining their search algorithm to make it as efficient, accurate and useful as possible. You have you, and me, and probably about two weeks to make this work. So, we’ll do our best, but Google won’t be worrying about us as competition.

To create the custom search, we’re going to use the MySQL server to its full extent. MySQL has a form of index called a Full Text index, which can be used with special SQL commands to perform full text searches, and that’s exactly what we want. If you use a different database (eg: SQL Server), you should look up the documentation to find out if they also have full text indexing, and if so, how it works. All the big databases have the facility, but each works slightly differently.

Now, to create a custom search engine, you need to perform two tasks: indexing and searching. The indexing gets the information into the database in a form that will best be searched, and searching, obviously, gets the information out.

Indexing

The exact manner of indexing will depend on your data. It is usually a good idea to separate the searchable data from the other data in your database, so you aren’t putting pressure on the same data from two separate directions. You should also remember that the smaller the data you have stored, the faster and more efficient your searches. So, if you have a database of products, each of which has a title and an HTML formatted description, you might create a second table which contains one column for the ID of the product, a second for the title and a third for the description, stripped of all HTML tags. The HTML is for presentation only, and the tags will just bulk up the search data, making the search less efficient. Similarly, if you have a list of articles, you may keep a second table with columns for each article’s ID,  title, keywords, description and the article itself, stripped of HTML tags.

You should always provide a function somewhere for the administrator to re-build the index completely. This will be useful initially, when you have to build the first index -  you never know when something might go wrong -  and when the index gets out of synch with the data. A good search system, however, will update itself automatically, so you should identify every function that might affect the data, and update the index accordingly at the same time. So, for example, if an item is deleted from the database, it should also be deleted from the index.

You may also decide you need a word index of your data. The word index takes all the searchable text, and breaks it up into individual words. It then builds a table linking each word to the item ID, along with a weight. The weight is essentially the number of times the word appears in the text, but you may choose to assign more weight to words from different sources. So, for example, you may choose to assign a score of 3 to each word in the title, but only 1 to each word in the description. Then, if a word appears once in the title, and three times in the description, it would have a weight score of 6. This technique is used to give more weight, or relevance, to words that appear in the title of an article, assuming that these are more important. The word index will be used for Boolean searches (discussed below).

You will need to maintain a list of stop words for your search system. Stop words are words that are ignored by the search engine when performing the search – simple or common words like “and”, “with” or “because”. These have important meaning in conversation, but the search engine is not intelligent enough to understand the meaning of these words, and, if they are considered significant in a search, too many irrelevant results will be returned. MySQL maintains its own list of stop words, which you can edit only if you have direct access to the server installation (i.e. you have your own hosting server – shared servers do not allow you to modify the stop word list). If you create a word index, however, you will need your own stop word list to filter out any stop words before they get into the index. This will significantly reduce the size of your index, and improve the efficiency of its use. You will also need it if you parse your search phrases yourself. MySQL publish their stop word list (http://dev.mysql.com/doc/refman/5.0/en/fulltext-stopwords.html), and I simply created a stop word table, and populated it from that published list.

Next: Part 2 - Searching

Share this post
Repost0
To be informed of the latest articles, subscribe:
Comment on this post
B
[...] This is part two of our article on how to create a custom search engine for your website. Read Part 1, which deals with preparation and indexing, [...]
Reply