Sign in

E-mail *, (xx@domain.com)
Password *

Register | Forgot password

Blogs

  • Bram de Kruijff
  • Ivo Ladage
  • Mark van Cuijk
  • Martin van Mierloo
  • Martijn van Berkum
  • Michel Teunissen
  • Patrick Atoon

Recent blogs

RSS - Blogs
March 9, 2010
State of OSGi in the Java world
March 4, 2010
Reach more people with Google Translate
March 3, 2010
Get My Advice
February 26, 2010
What? Where!?!
February 11, 2010
Split it!

All Blogs...


Search quest [2/3] - Relevance

May 28, 2008

After my previous blog post I received some nice and useful feedback about the post itself and also about the GX WebManager search engine, thank you for that! I also received a lot of questions. Two questions I received several times are:

  1. What about Google/Google Mini/Google Custom/Google appliance? How does it compare to your search engine and can we and should we use it?
  2. What can we do to improve our search results?

Both questions will be addressed today, but both won’t be fully answered until my third blog post about this topic. I promise I won’t keep you waiting as long as this time :- ). Oh - and another comment I received is to use more images. Alrighty!

Recall & Precision

Let’s start off with some theory. To judge whether a document in a result set is relevant, you have to look at it from a higher perspective. This is also where two new terms kick in: “recall” and “precision”.

  • Recall is the percentage of relevant documents that you retrieved.
  • Precision is the percentage of retrieved documents that is relevant.

An example: say you have 100 documents in your index, and by manually reviewing them one by one you know that 16 of them are in some way relevant to the word “WebManager”.  If you query your index with the term “WebManager” and get 24 results, of which 12 are part of the 16 good results you then you have:

A recall of 12/16 = 0.75 = 75% and a precision of 12 /24 = 0.5 = 50%


All these calculations are mainly interesting for scientific studies about search engines, but what these terms show is actually quite important: having good results is not only about finding the most relevant document, but just as much about leaving out the documents that are not relevant, and also about making sure you don’t miss relevant documents that are not even returned in the search results.

Relevance ranking

So, suppose you have done as much as possible to get all the relevant document returned in a result set and you know that on average 80% of the returned documents is relevant. Then the next is to order these results based on their relevance. This is called relevance ranking.

There are many ways to decide whether document A is more important than document B and it mainly depends on the ranking algorithm that is used. Most classic search engines (and also the GX WebManager search engine) use a boolean method plus some extra methods and/or logic. This boils down to searching for queried words in the index and ranking the results based on the number of found query terms plus a possible extra calculation based on custom parameters or business logic.



Google appliance Google appliance

Other search engines such as Google for example use totally different ranking algorithms. Google basically ranks its documents based on the number of incoming links to a document. The more incoming links, the more important the site or document. This is however quite a problem when using Google for intranet environments, because intranet content is not available to the outside world. So the smaller the total amount of documents, the harder it is to use the Google PageRank algorithm, which means even Google has to fall back on other search algorithms. By talking to people that use Google mini or appliance I noticed that customers are sometimes surprised by the search results, which are not always as good as they’re used to on Google.com. The problem is that Google can’t use the power, depth and reach of the internet or intranet environments, which is why I - and also industry analysts -  advise to carefully think before moving to Google mini or appliance without considering this.

Another example is Autonomy, an enterprise search engine. They use a combination of natural language algorithms and so called ‘Bayesian network’ models. This basically means that Autonomy sees words as nodes in a network that have relations. Based on these relations they can predict which words have meaning, how important they are and how they are related to other words. Interesting, or not?

Returning to the classic search engines and the GX WebManager search engine there are two very useful features to sort of steer the ranking algorithm: the first is to use metadata priorities  and the second is to use a taxonomy to cluster documents sets and result sets.

Metadata

When indexing content not everything is stored in one large text field, but a lot of metadata is stored in separate fields. This is very useful because someone who’s familiar with the data can judge whether certain metadata is more likely to contain relevant information than others. By applying a factor to metadata fields you can steer the ranking algorithm. For example: when you have 5 metadata fields: ‘Title’, ‘description’, ‘author’, ‘keywords’ and ‘main content’ you could decide that Title is a factor 2 more important than the description and main content and that the author is not important (because there are other ways to search for authors for example) and that the keywords are a best bet (because they are manually picked so they are surely relevant) so they get a factor 5. You have to decide for your organization which metadata you need in order to make this work. My next blog post will also explain how to use this in the GX WebManager search engine to tune your results.

Taxonomy

Users of a content management system are very lucky, because most enterprise CMS’es contain a taxonomy. A taxonomy is a hierarchical categorization.

Example taxonomy Example taxonomy

Taxonomies can be helpful in several ways when it comes to relevance ranking. By indexing taxonomy information that is related to certain content you can see taxonomy information as additional metadata, which is also hierarchically structured. You could use this to add a factor based on the location in the hierarchy. For example: say you have a taxonomy with the following structure: Vehicles > Cars >  Fiat > Fiat 500. When a document is related to ‘Fiat 500’ you could assign a factor 5 to that level, a factor 3 to ‘Fiat’ a factor ‘1.5’ to ‘Cars’ etc. This way when the user searches for ‘cars’ he will see a lot of cars, including the Fiat 500 and when a user searches for ‘Fiat 500’ the user will almost certainly see this in the top positions.

A second way to use a taxonomy is to enable searching in result sets, or in other words: use clustering. Clustering is not the IT term for multiple servers, but to cluster sets of documents. Clustering can be mapped to a taxonomy which could allow you to narrow down your search based on the taxonomy. An example: let’s say you search for ‘cars’ and get many, many results. The interface could provide you with a list of brands, based on the taxonomy. The user might be interested in Fiat and by selecting Fiat the user will only see results related to Fiat.

Next time

So now we know more about what influences our search results and we even know some things we can do to influence this. Next time we put this into practice and learn how to analyze – think – act - analyze, etc. In other words: we learn how to tune our search engine.

Meanwhile: keep sending your thoughts and questions to martinvm [at] gx.nl




About the Author

Return to all blogs


Martin van Mierloo is Product Manager and has many years of experience with GX WebManager. Martin writes about the GX WebManager roadmap, new product features and WCMS related topics..
Read all Martins blog entries

Other blog entries:

March 4, 2010
Reach more people with Google Translate
July 20, 2009
How to benefit from the improved inline mode
April 17, 2009
The new Community Forum in 980
April 2, 2009
10 Years Cluetrain Manifesto
March 18, 2009
The CMS Vendor Meme
March 3, 2009
jQuery and GX WebManager
December 24, 2008
The year has almost ended...
October 17, 2008
Search quest [3/3] - improvements
September 17, 2008
Using Google Custom Search on your site
July 16, 2008
New in WebManager 9.5 part 2: Personalization API


Share:

del.icio.us
digg
Technorati
Slashdot
Reddit
YahooMyWeb
NewsVine
ekudos
© 2010 GX creative online development B.V.

Disclaimer

This website (GXdeveloperweb.com) may discuss or contain opinions, (sample) coding, software or other information that does not include GX official interfaces, instructions or guidelines and therefore is not supported by GX. Changes made based on this information are not supported.  GX will not be held liable for any damages caused by using or misusing the information, software, instructions, code or methods suggested on this website, and anyone using these methods does so at his/her own risk. GX offers no guarantees and assumes no responsibility or liability of any type with respect to the content of this website, including any liability resulting from incompatibility between the content of this website and the materials and services offered by GX. By using this website you will not hold, or seek to hold, GX responsible or liable with respect to the content of this website.