Schema
Note that while we try to keep this documentation as up to date as possible, the authoritative XML source for our Solr schema can be found in our Solr docker image repository.
Non-Text Field Types
We store a few kinds of data that don’t receive any kind of text parsing at all:
- int: Integer value
- date: Date value
- string: String, stored verbatim (used for IDs, etc.)
- strings: Multi-valued string
- string_facet: String, stored verbatim but available for faceted queries
- strings_facet: Multi-valued string_facet
- text_page_numbers: A special field type for page numbers that will split them on hyphens, to allow for searches for the start or end of a page range to produce results
Text Field Types
There are three levels of tokenization implemented in the current Sciveyor Solr schema. (For more information about the tokenizers and filters that are discussed here, see our documentation on Solr text parsing.)
Note that we evaluated Solr’s support for OpenNLP text tagging and lemmatization, but it proved to be too slow for our use-case (imports were taking around 1–3 seconds per document, an unacceptable amount of time for a corpus of multiple millions of articles.)
- text_raw: Only splitting, preserving precise original tokens, with
punctuation, capitalization, etc. Break on whitespace and include a sanity
check, deleting all tokens longer than 1024 characters.
WhitespaceTokenizer + LengthFilter
- text_clean: Same as 1, then removing English possessives, lowercasing,
removing stop words, and finally removing any character that’s not a letter
or digit (matched using the Unicode property
\p{L}\p{N}
). This is the “basic clean text” version of the corpus, and probably normally the most useful – it’s just a bag of words.WhitespaceTokenizer + LengthFilter + EnglishPossessiveFilter + LowerCaseFilter + StopFilter + PatternReplaceFilter [[^\p{L}\p{N}] -> ""]
- text_stem: Same as 1, but also split on transitions from alpha to numeric
within a word. Then perform the same extra parsing as in 2 (removing
possessives, lowercasing, removing stop words). Remove all characters that
are not letters. Finally, pass through the Porter stemmer.
WhitespaceTokenizer + LengthFilter + WordDelimiterGraphFilter [stemEnglishPossessive + generateWordParts - splitOnCaseChange] + LowerCaseFilter + StopFilter + PatternReplaceFilter [\P{L} -> ""] + SnowballPorterFilter [English]
At the moment, all of these text fields only appear suffixed with _en to indicate that they are specialized for English-language content. Creating a multi-lingual corpus would require different stop-word lists and stemming choices, of course.
Fields
The fields in this schema are closely related to those in the Sciveyor JSON schema that we use to store our internal journal data, and perhaps will make more sense if you have already consulted that documentation.
First, a few fields are included for internal Solr purposes:
Field | Type | Description |
---|---|---|
_version_ |
int |
Internal Solr document versioning |
_root_ |
string |
Nested document support |
_nest_path_ |
_nest_path_ |
Nested document support |
_nest_parent_ |
string |
Nested document support |
Then, the document fields themselves. For further information about many of these values, consult the JSON schema:
Field | Type | Description |
---|---|---|
schema |
string |
URL to document schema, currently always https://data.sciveyor.com/schema |
version |
int |
The version number of the document schema |
id |
string |
Unique document identifier |
doi |
string |
|
externalIds |
strings |
|
license |
string |
|
license_clean |
text_clean_en |
copied from license |
licenseUrl |
string |
|
dataSource |
string |
|
dataSource_clean |
text_clean_en |
copied from dataSource |
dataSourceVersion |
int |
|
type |
string_facet |
|
title |
string_facet |
|
title_clean |
text_clean_en |
copied from title |
title_stem |
text_stem_en |
copied from title |
name |
string_facet |
present only in authors |
name_clean |
text_clean_en |
copied from name |
first |
string_facet |
present only in authors |
middle |
string |
present only in authors |
last |
string_facet |
present only in authors |
prefix |
string |
present only in authors |
suffix |
string |
present only in authors |
affiliation |
string_facet |
present only in authors |
affiliation_clean |
text_clean_en |
copied from affiliation |
date |
date |
|
dateElectronic |
date |
|
dateAccepted |
date |
|
dateReceived |
date |
|
journal |
string_facet |
|
journal_clean |
text_clean_en |
copied from journal |
volume |
string |
|
number |
string |
|
pages |
text_page_numbers |
|
keywords |
strings_facet |
|
tags |
strings_facet |
|
abstract |
text_raw |
|
abstract_clean |
text_clean_en |
copied from abstract |
abstract_stem |
text_stem_en |
copied from abstract |
fullText |
text_raw |
|
fullText_clean |
text_clean_en |
copied from fullText , term vectors available |
fullText_stem |
text_stem_en |
copied from fullText , term vectors available |
Notes
Note that Solr’s uniqueKey
is the id
value; there is thus no possibility of
finding two documents in the corpus with the same value of id
.
Fields described as “copied” are available in multiple versions, tokenized in different ways, presenting different pre-cached versions of the text available for analysis.
The way in which Solr handles nested child documents (in our case, the records
for authors that are found within each document) means that attributes of
authors are also present as fields in the schema in general, they have been
noted above as “present only in authors.” A “document” that represents an author
is stored with type
set to author
.