Schema

The Sciveyor Solr schema.

Note that while we try to keep this documentation as up to date as possible, the authoritative XML source for our Solr schema can be found in our Solr docker image repository.

Non-Text Field Types

We store a few kinds of data that don’t receive any kind of text parsing at all:

  • int: Integer value
  • date: Date value
  • string: String, stored verbatim (used for IDs, etc.)
  • strings: Multi-valued string
  • string_facet: String, stored verbatim but available for faceted queries
  • strings_facet: Multi-valued string_facet
  • text_page_numbers: A special field type for page numbers that will split them on hyphens, to allow for searches for the start or end of a page range to produce results

Text Field Types

There are three levels of tokenization implemented in the current Sciveyor Solr schema. (For more information about the tokenizers and filters that are discussed here, see our documentation on Solr text parsing.)

Note that we evaluated Solr’s support for OpenNLP text tagging and lemmatization, but it proved to be too slow for our use-case (imports were taking around 1–3 seconds per document, an unacceptable amount of time for a corpus of multiple millions of articles.)

  1. text_raw: Only splitting, preserving precise original tokens, with punctuation, capitalization, etc. Break on whitespace and include a sanity check, deleting all tokens longer than 1024 characters. WhitespaceTokenizer + LengthFilter
  2. text_clean: Same as 1, then removing English possessives, lowercasing, removing stop words, and finally removing any character that’s not a letter or digit (matched using the Unicode property \p{L}\p{N}). This is the “basic clean text” version of the corpus, and probably normally the most useful – it’s just a bag of words. WhitespaceTokenizer + LengthFilter + EnglishPossessiveFilter + LowerCaseFilter + StopFilter + PatternReplaceFilter [[^\p{L}\p{N}] -> ""]
  3. text_stem: Same as 1, but also split on transitions from alpha to numeric within a word. Then perform the same extra parsing as in 2 (removing possessives, lowercasing, removing stop words). Remove all characters that are not letters. Finally, pass through the Porter stemmer. WhitespaceTokenizer + LengthFilter + WordDelimiterGraphFilter [stemEnglishPossessive + generateWordParts - splitOnCaseChange] + LowerCaseFilter + StopFilter + PatternReplaceFilter [\P{L} -> ""] + SnowballPorterFilter [English]

At the moment, all of these text fields only appear suffixed with _en to indicate that they are specialized for English-language content. Creating a multi-lingual corpus would require different stop-word lists and stemming choices, of course.

Fields

The fields in this schema are closely related to those in the Sciveyor JSON schema that we use to store our internal journal data, and perhaps will make more sense if you have already consulted that documentation.

First, a few fields are included for internal Solr purposes:

Field Type Description
_version_ int Internal Solr document versioning
_root_ string Nested document support
_nest_path_ _nest_path_ Nested document support
_nest_parent_ string Nested document support

Then, the document fields themselves. For further information about many of these values, consult the JSON schema:

Field Type Description
schema string URL to document schema, currently always https://data.sciveyor.com/schema
version int The version number of the document schema
id string Unique document identifier
doi string
externalIds strings
license string
license_clean text_clean_en copied from license
licenseUrl string
dataSource string
dataSource_clean text_clean_en copied from dataSource
dataSourceVersion int
type string_facet
title string_facet
title_clean text_clean_en copied from title
title_stem text_stem_en copied from title
name string_facet present only in authors
name_clean text_clean_en copied from name
first string_facet present only in authors
middle string present only in authors
last string_facet present only in authors
prefix string present only in authors
suffix string present only in authors
affiliation string_facet present only in authors
affiliation_clean text_clean_en copied from affiliation
date date
dateElectronic date
dateAccepted date
dateReceived date
journal string_facet
journal_clean text_clean_en copied from journal
volume string
number string
pages text_page_numbers
keywords strings_facet
tags strings_facet
abstract text_raw
abstract_clean text_clean_en copied from abstract
abstract_stem text_stem_en copied from abstract
fullText text_raw
fullText_clean text_clean_en copied from fullText, term vectors available
fullText_stem text_stem_en copied from fullText, term vectors available

Notes

Note that Solr’s uniqueKey is the id value; there is thus no possibility of finding two documents in the corpus with the same value of id.

Fields described as “copied” are available in multiple versions, tokenized in different ways, presenting different pre-cached versions of the text available for analysis.

The way in which Solr handles nested child documents (in our case, the records for authors that are found within each document) means that attributes of authors are also present as fields in the schema in general, they have been noted above as “present only in authors.” A “document” that represents an author is stored with type set to author.