Solr Text Parsing

Documentation on the ways in which Solr parses and tokenizes text.

The basic important concepts for Solr go something like this. An “analyzer” in Solr will be applied to every field – both to incoming text fields as they are created, and to fields that the user types in, e.g., when searching. An analyzer, then, is composed of a tokenizer – which turns a string into an array of tokens – and zero or more filters, which then process the token chain.

Tokenizers

The following tokenizers are available:

  1. StandardTokenizerFactory: This splits on whitespace and punctuation, and discards delimiters, except when you have periods that aren’t followed by whitespace (i.e., are probably part of something like a web address). It also splits words on UAX#29 boundaries (the fancy Unicoe standard that handles text segmentation in languages that do not split on whitespace).
  2. ClassicTokenizerFactory: This does the exact same thing, but does not use UAX#29 word boundaries. So it’s still splitting on delimiters and discarding delimiters. It also has the weird behavior of, when splitting on a hyphen, looking to see if there is a number in the word, and if so, not splitting on hyphen.
  3. KeywordTokenizer: Does literally nothing to the field. Makes 1 token containing the entire input string.
  4. LetterTokenizer: Makes keywords out of all contiguous strings of letters, discarding everything else.
  5. LowerCaseTokenizer: same thing as 4, but lowercasing.
  6. WhiteSpaceTokenizer: Splits on whitespace, leaving all punctuation behind.

For Sciveyor, none of these tokenizers are straightforwardly ideal. The perfect answer would be a UAX#29 tokenizer that doesn’t have the strange whitespace or hyphenation behavior. That doesn’t seem to exist in current versions of Solr.

That means the only way to preserve “weird” tokens that have hyphens and whatnot is to use WhiteSpaceTokenizer, as far as I can tell. This is therefore what we’ve chosen to do in the majority of our fields.

Filters

Now, filters.

  1. Stemming: There are a number of stemming filters. EnglishMinimalStemFilter removes only plurals. EnglishPossessiveFilter removes only apostrophe-s possessives. PorterStemFilter does proper Porter stemming. SnowballPorterFilter is that but language-configurable, and a newer algorithm than the former.
  2. LowerCaseFilter does what it sounds like.
  3. StopFilter removes stopwords.
  4. WordDelimiterGraphFilter is a complex swiss-army-knife filter. It can do any or all of the following (configurable):
    1. Split words at CamelCase and/or hyphen delimiters ([“CamelCase hot-spot”] becomes [“Camel” “Case” “hot” “spot”])
    2. Split numeric strings at delimiters ([“1947-32”] becomes [“1947”, “32”])
    3. Split words on transitions from alpha to numeric ([“BigBlaster3000”] becomes [“Big”, “Blaster”, “3000”])
    4. Split off possessive ’s from those words
    5. Also produce tokens that result from joining together those word parts ([“hot-spot”] becomes [“hot-spot” “hotspot”])
    6. Produce tokens that result from joining number parts ([“174-78”] becomes [“17478”])
    7. Produce tokens that result from mixed concatenation ([“Big-Blaster-3000”] becomes [“BigBlaster3000”])
    8. When creating tokens in any of the ways listed in 1-7, either preserve the original tokens, or remove them (if preserving, then, [“hot-spot”] might become [“hot-spot” “hot” “spot” “hotspot”])

Obviously, there are lots of ways to combinatorially mix together all of these. Information about the combinations that are currently in use in our Solr schema can be found in our Solr schema documentation.