Creating a Custom Search Engine Part 3 - Displaying Results

July 6 2011

This is part three of our article on how to create a custom search engine for your website. Read Part 1, which deals with preparation and indexing, and Part 2, which deals with searching, first.

Displaying the Results

As with all things on a website, presentation is important. You are doing this to provide a useful service to your users, so you need to present the results in a sensible way in order to fulfil your goals; it doesn’t help to build the best search engine in the world if the results you present are confusing and unreadable.

There are several techniques that are standardly employed to present search results – generally copied from Google – that will affect your development: paging, individualised results and contextual descriptions.

Paging

Paging is simply the practice of presenting a limited number of results on a page (usually between 10 and 20), and providing navigation to get more. This seems simple, and isn’t terribly difficult to develop on the front end. All you need to do is limit the number of results presented, keep track of where you are in the resultset, and provide a set of links that allow your user to request results from a different part of the result set. In the back end, however, you have a number of things to consider.

First of all, you need to consider whether to fetch the entire resultset from the database, and work with that, or to fetch just the required results, using the LIMIT function. Using the LIMIT function will mean you’re working with much smaller bodies of data, and will usually make the system more efficient. You will, however, need to perform the search a second time to count the number of results you would have had. You cannot manage your paging without that. So, you will need to compare the memory usage of the full resultsets against the time and CPU resources used by the two searches, and decide which way to go.

If you use the compromise method for Boolean searches that I mentioned in Part 2 of this article, your choice is easy; you have to collect the complete resultset, because you need to order the results yourself.

If you fetch the entire resultset, you will also need to decide whether to perform the search with every request, or to store the results in a session variable. Using the session variable will make requests for further pages in the resultset much faster, but it will take up significant memory. Remember that the session variables typically remain in memory for 20 to 30 minutes after the user has left the site, so even a small number of concurrent searches can make your session memory very large.

One tool that is often used is to place a maximum on the number of results that will be fetched. Most users won’t go anywhere beyond the 10th page of search results, so it is fairly safe to place a limit of 100 or 200 (depending on whether you display 10 or 20 results per page) on your resultsets.

Individualised Results

It may seem obvious, but it's important to provide the user with as much information as possible about the search results. At a minimum, you should provide a title, description and link to the result. You should also consider including the URL of the page, the date it was published - or last modified - an icon or thumbnail image representing the result if possible, and any other information that the user might be able to use to make a decision on the relevance of this result. If your results might come from several sources, you should clearly identify the source of the result as well.

You may also want to extend your search engine to include filters in addition to the basic keyword search. So, for example, if your site has four different sections, you may allow the searcher to filter their results by section. This is only really necessary for large sites.

All this information about each result will create very large resultsets, so we recommend only collecting the information for the ten or twenty results that will actually be displayed. If your search uses the LIMIT function, and only fetches the required results anyway, this is a moot point, but if not, your resultset should only consist of the minimum information about each result (usually an ID, possibly a source identifier, and a relevance score). Once your paging has decided which items to display, do another query to fetch all the required information for just those results.

Contextual Descriptions

Contextual descriptions are, I believe, another Google innovation. Previously, search results standardly used the description meta-tag data as the result description, and were happy with that. Google decided (correctly) that it was much more useful to show the part of the page copy that actually contains the user’s search term.

The problem with contextual descriptions, of course, is the added processing required to identify them. You will need to load the entire content of the page again in order to do a second search for the key phrase, and identify where to extract the information. You cannot rely on the MySQL search functions here – you need to write all your own logic to handle the case where the exact phrase doesn’t actually appear in the text at all. This can get quite complicated as you try to find the maximum relevance for the description.

At absolute minimum, you should first search for the whole phrase and, if that doesn’t work, strip out all non-word characters and stop words, split the phrase into words, and search for each in turn until you find an instance of one in the text. You then extract the copy around that word (consider using 150 characters before and after) and strip off any incomplete words. Remember that the copy you’re searching should not contain any HTML markup (Tip: the strip_html function removes all HTML tags, and replaces them with nothing. This means that words separated only by tags, and not by whitespace, will be concatenated by strip_html, and this may distort your search, and the resulting description. Consider using regular expressions to replace all tags with a space instead). If none of the words can be found, use the description meta-tag, or the first 300 characters of the copy. This is a minimum – you may want to work on this to identify more relevant descriptions (eg: by taking grouping operators into account).

You should also consider highlighting every instance of a word in the search phrase that appears in either the title or the description of the result. This can easily be done by looping through the words, and performing a regular expression replacement.

Conclusion

The implementation of a custom search engine described here will probably be effective on most websites, but there is enormous potential for improvement. The MySQL full text searching methods can easily be improved, and the technique of building your own indexes allows for all sorts of improvements and refinements. You are limited only by your timeframe, budget and imagination.