Word splitters

In LIFTI a word splitter is a class implementing IWordSplitter - these classes are capable of breaking a piece of text into individual words that should be contained in the index. There are currently three word splitters defined in the LIFTI assembly:

  • WordSplitter (Currently the default)
  • StemmingWordSplitter
  • XmlWordSplitter

Word splitters are used in two parts of the LIFTI code:

  1. During the indexing process - to split the text associated to an item into constituent words.
  2. During the searching process - search words are also passed through a splitter, ensuring that adjustments made to a word when it was indexed are also applied to these criteria.
You can set the word splitter on an index using the WordSplitter property:
index.WordSplitter = new StemmingWordSplitter(); 

If you need different behaviour to be applied to the indexing and searching process, then you can additionally set the SearchWordSplitter property. If SearchWordSplitter is left null it defaults to the value held against the WordSplitter property, meaning that you won’t normally need to set it. The XmlWordSplitter section of this page shows an example of when you might use the SearchWordSplitter property.

WordSplitter

This is the default word splitter that a LIFTI full text index is constructed with. It simply breaks a string apart on white spaces. Characters that are not letters or digits are treated as whitespace, apart from apostrophes which are just ignored. This means that the following statement:

Simon's latest phrase is "Simon’s space-monkeys rule"

Would be broken into the following words:

  • Simons
  • latest
  • phrase
  • is
  • space
  • monkeys
  • rule
Each word is returned only once as an instance of SplitWord, which also provides an array of integers indicating the positions in the document at which the words were encountered. For the word “Simons” above, that array would have contained 0 and 4.

StemmingWordSplitter

The stemming word splitter breaks words apart using the same rules as the standard word splitter. Each individual word is subsequently processed using an implementation of a Porter stemmer which strips off word suffixes. For example, connection, connects and connecting all become connect.

Although this incurs a small cost, it also makes the index more accessible when the text being indexed is free-form English, as the user is more likely to get relevant results. Additionally it has the effect of making the index smaller, as the number and length of indexed words will be reduced.

Currently only English words are officially supported, as the stemming algorithms differ between languages.

XmlWordSplitter

The XML word splitter is designed to only return words that appear in text content contained within nodes in an XML document.

Consider the following XML:

<document type="phrase">
    The quick brown <emphasis>fox</emphasis> jumped
    over the lazy <emphasis>dog</emphasis>.
</document>

When run through the XML word splitter, the following words (and associated positions) will be returned:

Word Word index
THE 0, 6
QUICK 1
BROWN 2
FOX 3
JUMPED 4
OVER 5
LAZY 7
DOG 8

Notice that the word “the” is reported at positions 0 and 6 – the word index is relative to first word in the document, regardless of whether there are XML elements that interrupt the flow of the text.

One thing to consider when using the XmlWordSplitter is that you probably won’t want

When you construct an XmlWordSplitter instance, you need to specify the type of word splitting behaviour that should be applied to the chunks of text within the XML document itself.

This example shows how you would configure an index to use an XmlWordSplitter that stems the words that it returns using a StemmingWordSplitter:

var wordSplitter = new StemmingWordSplitter();
index.WordSplitter = new XmlWordSplitter(wordSplitter);
index.SearchWordSplitter = wordSplitter;

Here you can see usage of the SearchWordSplitter property of the index to differentiate the indexing and searching word splitting behaviours. If you really wanted to search using XML text as well, then you wouldn’t need to specify SearchWordSplitter.

Last edited Jun 9, 2011 at 3:54 PM by MikeGoatly, version 3

Comments

No comments yet.