[next]
[1. Units annotated] [2. ANNIS and TXM versions]
[3. Text in other languages]
1. Units annotated
There are four basic annotated units in the OGR:
- Segments: A single phonological segment. See Annotation 2: Segments.
- Words: A syntactically independent sequence of Segments.
Clitics are treated as Words. See Annotation 3: Words.
- Syllables: Sequence of Segments consisting minimally of a vocalic nucleus.
Since Words are syntactically rather than prosodically defined, a Syllable can
contain more than one word.
See Annotation 4: Syllables and meter.
- Lines: Sequence of both Words and Syllables forming a line of verse.
Units larger than the Word (e.g. laisses or paragraphs, manuscript pagination) are
encoded using TEI markup.
2. ANNIS and TXM versions
Annotation differs between the ANNIS and TXM versions.
- ANNIS supports multiple layers of tokenization and all units are represented as separate spans.
- TXM supports only word-level tokenization and all annotation is (as far as possible) realized as word-level tags.
This annotation guide refers to the ANNIS version unless otherwise specified.
3. Text in other languages
Passages which are not written in Gallo-Romance, most of which are in
Latin, are not annotated, with the following exceptions:
- All Words in verse texts are phonologically and metrically annotated.
A simplified transcription is used for Latin words.
- Borrowed Words within clauses written in Romance are pos-tagged
and lemmatized.
- In Jonas, some Words written in shorthand are transcribed in
Latin orthography but interpreted as Romance words. These are pos-tagged
and lemmatized as if they were Romance terms.
[next]