Word splitters
In LIFTI a word splitter is a class implementing IWordSplitter - these classes are capable of breaking a piece of text into individual words that should be contained in the index. There are currently three word splitters defined in the LIFTI assembly:
- WordSplitter (Currently the default)
- StemmingWordSplitter
- XmlWordSplitter
Word splitters are used in two parts of the LIFTI code:
- During the indexing process - to split the text associated to an item into constituent words.
- During the searching process - search words are also passed through a splitter, ensuring that adjustments made to a word when it was indexed are also applied to these criteria.
You can set the word splitter on an index using the
WordSplitter property:
index.WordSplitter = new StemmingWordSplitter();
If you need different behaviour to be applied to the indexing and searching process, then you can additionally set the
SearchWordSplitter property. If SearchWordSplitter is left null it defaults to the value held against the WordSplitter property, meaning that you won’t normally need to set it. The XmlWordSplitter section of this page shows an example of when
you might use the SearchWordSplitter property.
WordSplitter
This is the default word splitter that a LIFTI full text index is constructed with. It simply breaks a string apart on white spaces. Characters that are not letters or digits are treated as whitespace, apart from apostrophes which are just ignored. This
means that the following statement:
Simon's latest phrase is "Simon’s space-monkeys rule"
Would be broken into the following words:
- Simons
- latest
- phrase
- is
- space
- monkeys
- rule
Each word is returned only once as an instance of
SplitWord, which also provides an array of integers indicating the positions in the document at which the words were encountered. For the word “Simons” above, that array would have contained 0 and 4.
StemmingWordSplitter
The stemming word splitter breaks words apart using the same rules as the standard word splitter. Each individual word is subsequently processed using an implementation of a Porter stemmer which strips off word suffixes. For example,
connection, connects and connecting all become
connect.
Although this incurs a small cost, it also makes the index more accessible when the text being indexed is free-form English, as the user is more likely to get relevant results. Additionally it has the effect of making the index smaller, as the number and
length of indexed words will be reduced.
Currently only English words are officially supported, as the stemming algorithms differ between languages.
XmlWordSplitter
The XML word splitter is designed to only return words that appear in text content contained within nodes in an XML document.
Consider the following XML:
<document type="phrase">
The quick brown <emphasis>fox</emphasis> jumped
over the lazy <emphasis>dog</emphasis>.
</document>
When run through the XML word splitter, the following words (and associated positions) will be returned:
| Word |
Word index |
| THE |
0, 6 |
| QUICK |
1 |
| BROWN |
2 |
| FOX |
3 |
| JUMPED |
4 |
| OVER |
5 |
| LAZY |
7 |
| DOG |
8 |
Notice that the word “the” is reported at positions 0 and 6 – the word index is relative to first word in the document, regardless of whether there are XML elements that interrupt the flow of the text.
One thing to consider when using the XmlWordSplitter is that you probably won’t want
When you construct an XmlWordSplitter instance, you need to specify the type of word splitting behaviour that should be applied to the chunks of text within the XML document itself.
This example shows how you would configure an index to use an XmlWordSplitter that stems the words that it returns using a StemmingWordSplitter:
var wordSplitter = new StemmingWordSplitter();
index.WordSplitter = new XmlWordSplitter(wordSplitter);
index.SearchWordSplitter = wordSplitter;
Here you can see usage of the SearchWordSplitter property of the index to differentiate the indexing and searching word splitting behaviours. If you really wanted to search using XML text as well, then you wouldn’t need to specify SearchWordSplitter.