Register | Forgot password
After my previous blog post I received some nice and useful feedback about the post itself and also about the GX WebManager search engine, thank you for that! I also received a lot of questions. Two questions I received several times are:
Both questions will be addressed today, but both won’t be fully answered until my third blog post about this topic. I promise I won’t keep you waiting as long as this time :- ). Oh - and another comment I received is to use more images. Alrighty!
Let’s start off with some theory. To judge whether a document in a result set is relevant, you have to look at it from a higher perspective. This is also where two new terms kick in: “recall” and “precision”.
An example: say you have 100 documents in your index, and by manually reviewing them one by one you know that 16 of them are in some way relevant to the word “WebManager”. If you query your index with the term “WebManager” and get 24 results, of which 12 are part of the 16 good results you then you have:
A recall of 12/16 = 0.75 = 75% and a precision of 12 /24 = 0.5 = 50%
All these calculations are mainly interesting for scientific studies about search engines, but what these terms show is actually quite important: having good results is not only about finding the most relevant document, but just as much about leaving out the documents that are not relevant, and also about making sure you don’t miss relevant documents that are not even returned in the search results.
So, suppose you have done as much as possible to get all the relevant document returned in a result set and you know that on average 80% of the returned documents is relevant. Then the next is to order these results based on their relevance. This is called relevance ranking.
There are many ways to decide whether document A is more important than document B and it mainly depends on the ranking algorithm that is used. Most classic search engines (and also the GX WebManager search engine) use a boolean method plus some extra methods and/or logic. This boils down to searching for queried words in the index and ranking the results based on the number of found query terms plus a possible extra calculation based on custom parameters or business logic.
Google appliance
Other search engines such as Google for example use totally different ranking algorithms. Google basically ranks its documents based on the number of incoming links to a document. The more incoming links, the more important the site or document. This is however quite a problem when using Google for intranet environments, because intranet content is not available to the outside world. So the smaller the total amount of documents, the harder it is to use the Google PageRank algorithm, which means even Google has to fall back on other search algorithms. By talking to people that use Google mini or appliance I noticed that customers are sometimes surprised by the search results, which are not always as good as they’re used to on Google.com. The problem is that Google can’t use the power, depth and reach of the internet or intranet environments, which is why I - and also industry analysts - advise to carefully think before moving to Google mini or appliance without considering this.
Another example is Autonomy, an enterprise search engine. They use a combination of natural language algorithms and so called ‘Bayesian network’ models. This basically means that Autonomy sees words as nodes in a network that have relations. Based on these relations they can predict which words have meaning, how important they are and how they are related to other words. Interesting, or not?
Returning to the classic search engines and the GX WebManager search engine there are two very useful features to sort of steer the ranking algorithm: the first is to use metadata priorities and the second is to use a taxonomy to cluster documents sets and result sets.
When indexing content not everything is stored in one large text field, but a lot of metadata is stored in separate fields. This is very useful because someone who’s familiar with the data can judge whether certain metadata is more likely to contain relevant information than others. By applying a factor to metadata fields you can steer the ranking algorithm. For example: when you have 5 metadata fields: ‘Title’, ‘description’, ‘author’, ‘keywords’ and ‘main content’ you could decide that Title is a factor 2 more important than the description and main content and that the author is not important (because there are other ways to search for authors for example) and that the keywords are a best bet (because they are manually picked so they are surely relevant) so they get a factor 5. You have to decide for your organization which metadata you need in order to make this work. My next blog post will also explain how to use this in the GX WebManager search engine to tune your results.
Users of a content management system are very lucky, because most enterprise CMS’es contain a taxonomy. A taxonomy is a hierarchical categorization.
Example taxonomy
Taxonomies can be helpful in several ways when it comes to relevance ranking. By indexing taxonomy information that is related to certain content you can see taxonomy information as additional metadata, which is also hierarchically structured. You could use this to add a factor based on the location in the hierarchy. For example: say you have a taxonomy with the following structure: Vehicles > Cars > Fiat > Fiat 500. When a document is related to ‘Fiat 500’ you could assign a factor 5 to that level, a factor 3 to ‘Fiat’ a factor ‘1.5’ to ‘Cars’ etc. This way when the user searches for ‘cars’ he will see a lot of cars, including the Fiat 500 and when a user searches for ‘Fiat 500’ the user will almost certainly see this in the top positions.
A second way to use a taxonomy is to enable searching in result sets, or in other words: use clustering. Clustering is not the IT term for multiple servers, but to cluster sets of documents. Clustering can be mapped to a taxonomy which could allow you to narrow down your search based on the taxonomy. An example: let’s say you search for ‘cars’ and get many, many results. The interface could provide you with a list of brands, based on the taxonomy. The user might be interested in Fiat and by selecting Fiat the user will only see results related to Fiat.
So now we know more about what influences our search results and we even know some things we can do to influence this. Next time we put this into practice and learn how to analyze – think – act - analyze, etc. In other words: we learn how to tune our search engine.
Meanwhile: keep sending your thoughts and questions to martinvm [at] gx.nl
Martin van Mierloo is Product Manager and has many years of experience with GX WebManager. Martin writes about the GX WebManager roadmap, new product features and WCMS related topics..
Read all Martins blog entries
Other blog entries: