Skip to content

Commit 525cee4

Browse files
committed
Add README files.
1 parent cdea4ed commit 525cee4

File tree

5 files changed

+160
-164
lines changed

5 files changed

+160
-164
lines changed

README.md

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
# Syntacticus treebank data
2+
3+
Raw annotated data for the treebanks in the Syntacticus collection.
4+
5+
Releases of the collection are hosted on
6+
[Github](https://github.com/syntacticus/syntacticus-treebank-data).
7+
8+
## Data formats
9+
10+
The texts in the collection are available in two formats:
11+
12+
1. PROIEL XML: These files are the authoritative source files and the only ones
13+
that contain all available annotation. They contain the complete morphological,
14+
syntactic and information-structure annotation, as well as the complete text,
15+
including punctuation, section headers etc. The schema is defined in
16+
[`proiel.xsd`](https://github.com/syntacticus/syntacticus-treebank-data/blob/master/proiel.xsd).
17+
18+
2. [CoNLL-X format](http://nextens.uvt.nl/depparse-wiki/DataFormat)

iswoc/README.md

Lines changed: 28 additions & 47 deletions
Original file line numberDiff line numberDiff line change
@@ -1,59 +1,40 @@
1-
The ISWOC Treebank
2-
==================
1+
## The ISWOC Treebank
32

43
The _ISWOC Treebank_ is a dependency treebank with morphosyntactic and
5-
information-structure annotation. It includes texts in several older
6-
Indo-European languages and is freely available under a [Creative Commons
7-
Attribution-NonCommercial-ShareAlike 3.0 License](
8-
http://creativecommons.org/licenses/by-nc-sa/3.0/us/).
4+
information-structure annotation.
5+
6+
It includes texts in several older Indo-European languages and is freely
7+
available under a [Creative Commons Attribution-NonCommercial-ShareAlike 4.0
8+
License](https://creativecommons.org/licenses/by-nc-sa/4.0/).
99

1010
Please cite as
1111

1212
> Bech, Kristin and Kristine Eide. 2014. The ISWOC corpus. Department of Literature, Area Studies and European Languages, University of Oslo. http://iswoc.github.com.
1313
14-
Releases of the ISWOC Treebank are hosted on
15-
[Github](https://github.com/iswoc/iswoc-treebank).
14+
Please see the XML files for detailed metadata and a full list of contributors.
1615

17-
Contents
18-
--------
16+
### Contents
1917

2018
The following texts are included in this release of the treebank:
2119

20+
(The _size_ column in the table below shows the number of annotated tokens in a
21+
text. The number of tokens will be slightly larger than the number of words in
22+
the original printed edition as some words have been split into multiple tokens
23+
and some tokens have been inserted during annotation.)
2224
Text | Language | Filename | Size
23-
---- | -------- | -------- | ----
24-
Ælfric's Lives of Saints | Old English | æls | 3137 tokens
25-
Apollonius of Tyre | Old English | apt | 5541 tokens
26-
Anglo-Saxon Chronicles | Old English | chrona | 5939 tokens
27-
Orosius | Old English | or | 1728 tokens
28-
West-Saxon Gospels | Old English | wscp | 13061 tokens
29-
La Vie Saint Eustace | Old French | eustace | 2340 tokens
30-
Crónica Geral de Espanha 2-12 | Portuguese | cge1 | 12074 tokens
31-
Crónica Geral de Espanha 155-167 | Portuguese | cge2 | 10547 tokens
32-
Décadas Livro 5, VIII, 9-14 | Portuguese | coutdec-v-8 | 13794 tokens
33-
Crónica de Alfonso XI | Spanish | alfonso-xi | 7942 tokens
34-
Crónica de España | Spanish | ce | 4627 tokens
35-
El Conde Lucanor | Spanish | cdeluc | 17551 tokens
36-
Estoria de Espanna I | Spanish | ee1 | 9488 tokens
37-
General Estoria parte IV Daniel | Spanish | ge4 | 9233 tokens
38-
Libro delos claros varones | Spanish | varones | 5820 tokens
39-
40-
41-
(The 'size' column in the table above shows the number of annotated tokens in
42-
a text. The number of tokens will be slightly larger than the number of words
43-
in the original printed edition as some words have been split into multiple
44-
tokens and some tokens have been inserted during annotation.)
45-
46-
Please see the XML files for detailed metadata and a full list of contributors.
47-
48-
Data formats
49-
------------
50-
51-
The texts are available on two formats:
52-
53-
1. PROIEL XML: These files are the authoritative source files and the only ones
54-
that contain all available annotation. They contain the complete morphological,
55-
syntactic and information-structure annotation, as well as the complete text,
56-
including punctuation, section headers etc. The schema is defined in
57-
[`proiel.xsd`](https://github.com/proiel/proiel-treebank/blob/master/proiel.xsd).
58-
59-
2. [CoNLL-X format](http://nextens.uvt.nl/depparse-wiki/DataFormat)
25+
----------------------------------------------------|---------------------|-------------|---------------
26+
Ælfric's Lives of Saints | Old English | æls | 3,137 tokens
27+
Crónica de Alfonso XI | Spanish | alfonso-xi | 7,941 tokens
28+
Apollonius of Tyre | Old English | apt | 5,541 tokens
29+
El Conde Lucanor | Spanish | cdeluc | 17,553 tokens
30+
Crónica de España | Spanish | ce | 4,627 tokens
31+
Crónica Geral de Espanha 2-12 (ed. Lindley 1951) | Portuguese | cge1 | 12,074 tokens
32+
Crónica Geral de Espanha 155-167 (ed. Lindley 1951) | Portuguese | cge2 | 10,547 tokens
33+
Anglo-Saxon Chronicles | Old English | chrona | 5,939 tokens
34+
Décadas Livro 5, VIII, 9-14 (ed. 1. 1947) | Portuguese | coutdec-v-8 | 13,974 tokens
35+
Estoria de Espanna I | Spanish | ee1 | 9,488 tokens
36+
La Vie Saint Eustace | Old French | eustace | 2,340 tokens
37+
General Estoria parte IV Daniel | Spanish | ge4 | 9,289 tokens
38+
Orosius | Old English | or | 1,728 tokens
39+
Libro delos claros varones | Spanish | varones | 5,820 tokens
40+
West-Saxon Gospels | Old English | wscp | 13,061 tokens

menotec/README.md

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
## Menotec
2+
3+
### Contents
4+
5+
The following texts are included in this release of the treebank:
6+
7+
(The _size_ column in the table below shows the number of annotated tokens in a
8+
text. The number of tokens will be slightly larger than the number of words in
9+
the original printed edition as some words have been split into multiple tokens
10+
and some tokens have been inserted during annotation.)
11+
Text | Language | Filename | Size
12+
----------------------------------------------------|---------------------|-------------|---------------
13+
Konungs skuggsjá (in AM 243 bα fol, Old Norw., ca. 1275) (ed. Holm-Olsen 1945) | Old Norse | am243 | 44 tokens
14+
The Old Norwegian homily book (in AM 619 4to, Old Norw., ca. 1200-1225) (ed. Indrebø 1931) | Old Norse | hom | 60,822 tokens
15+
Landslǫg Magnúss Hákonarsónar (in Holm perg 34 4to, Old Norw., ca. 1275) | Old Norse | mll | 56,889 tokens
16+
Óláfs saga ins helga (in Upps DG 8 II, Old Norw., ca. 1225-1250) (ed. Johnsen 1922) | Old Norse | olavssaga | 42,830 tokens
17+
Pamphilus saga (in Upps DG 4-7, Old Norw., ca. 1270) | Old Norse | pamphilus | 4,254 tokens
18+
Strengleikar (in Upps DG 4-7, Old Norw., ca. 1270) (ed. Keyser 1850) | Old Norse | strleik | 38,549 tokens

proiel/README.md

Lines changed: 29 additions & 45 deletions
Original file line numberDiff line numberDiff line change
@@ -1,46 +1,20 @@
1-
The PROIEL Treebank
2-
===================
1+
## The PROIEL Treebank
32

43
The _PROIEL Treebank_ is a dependency treebank with morphosyntactic and
5-
information-structure annotation. It includes texts in several ancient
6-
Indo-European languages and is freely available under a [Creative Commons
7-
Attribution-NonCommercial-ShareAlike 3.0 License](
8-
http://creativecommons.org/licenses/by-nc-sa/3.0/us/).
4+
information-structure annotation.
5+
6+
It includes texts in several ancient Indo-European languages and is freely
7+
available under a [Creative Commons Attribution-NonCommercial-ShareAlike 4.0
8+
License](https://creativecommons.org/licenses/by-nc-sa/4.0/).
99

1010
Please cite as
1111

1212
> Dag T. T. Haug and Marius L. Jøhndal. 2008. 'Creating a Parallel Treebank of the Old Indo-European Bible Translations'. In Caroline Sporleder and Kiril Ribarov (eds.). Proceedings of the Second Workshop on Language Technology for Cultural Heritage Data (LaTeCH 2008) (2008), pp. 27-34.
1313
14-
Releases of the PROIEL Treebank are hosted on
15-
[Github](https://github.com/proiel/proiel-treebank).
16-
17-
Contents
18-
--------
19-
20-
The following texts are included in this release of the treebank:
21-
22-
Text | Language | Filename | Size
23-
----------------------------------------------------|---------------------|-------------|---------------
24-
The Greek New Testament (ed. Tischendorf 1869) | Ancient Greek | greek-nt | 140,763 tokens
25-
The Armenian New Testament (ed. Künzle 1984) | Classical Armenian | armenian-nt | 23,513 tokens
26-
The Gothic Bible (ed. Streitberg 1919) | Gothic | gothic-nt | 57,211 tokens
27-
Codex Marianus (ed. Jagić 1883) | Old Church Slavonic | marianus | 58,269 tokens
28-
Jerome's Vulgate | Latin | latin-nt | 112,454 tokens
29-
Caesar, Commentarii belli Gallici (ed. Holmes 1914) | Latin | caes-gal | 28,607 tokens
30-
Cicero, De officiis (ed. Miller 1913) | Latin | cic-off | 10,644 tokens
31-
Cicero, Epistulae ad Atticum (ed. Purser 1901) | Latin | cic-att | 42,855 tokens
32-
Palladius, Opus agriculturae (ed. Schmitt 1898) | Latin | pal-agr | 12,148 tokens
33-
Peregrinatio Aetheriae (ed. Heraeus 1908) | Latin | per-aeth | 18,356 tokens
34-
Herodotus, Histories (ed. Godley 1920) | Ancient Greek | hdt | 85,080 tokens
35-
Sphrantzes, Chronicles (post-1453) (ed. Grecu 1966) | Ancient Greek | chron | 24,612 tokens
36-
37-
(The 'size' column in the table above shows the number of annotated tokens in
38-
a text. The number of tokens will be slightly larger than the number of words
39-
in the original printed edition as some words have been split into multiple
40-
tokens and some tokens have been inserted during annotation.)
41-
4214
Please see the XML files for detailed metadata and a full list of contributors.
4315

16+
### Completeness
17+
4418
Some sentences have not yet been annotated. This is an overview of where in the
4519
texts unannotated sentences occur:
4620

@@ -64,17 +38,27 @@ Sections or section ranges in which there are gaps:
6438
* `marianus`: MATT 5, MARK 16, LUKE 2, LUKE 24, JOHN 1-2, JOHN 18, JOHN 20
6539
* `pal-agr`: 1.4-1.12, 1.35-1.40, 2.3, 2.9-2.23, 3.9-3.10
6640

67-
These gaps will be completed in future releases.
41+
These gaps may be closed in future releases.
6842

69-
Data formats
70-
------------
43+
### Contents
7144

72-
The texts are available on two formats:
73-
74-
1. PROIEL XML: These files are the authoritative source files and the only ones
75-
that contain all available annotation. They contain the complete morphological,
76-
syntactic and information-structure annotation, as well as the complete text,
77-
including punctuation, section headers etc. The schema is defined in
78-
[`proiel.xsd`](https://github.com/proiel/proiel-treebank/blob/master/proiel.xsd).
45+
The following texts are included in this release of the treebank:
7946

80-
2. [CoNLL-X format](http://nextens.uvt.nl/depparse-wiki/DataFormat)
47+
(The _size_ column in the table below shows the number of annotated tokens in a
48+
text. The number of tokens will be slightly larger than the number of words in
49+
the original printed edition as some words have been split into multiple tokens
50+
and some tokens have been inserted during annotation.)
51+
Text | Language | Filename | Size
52+
----------------------------------------------------|---------------------|-------------|---------------
53+
The Armenian New Testament (ed. Künzle 1984) | Classical Armenian | armenian-nt | 23,513 tokens
54+
Commentarii belli Gallici (ed. Holmes 1914) | Latin | caes-gal | 28,657 tokens
55+
Chronicles (post-1453) (ed. Grecu 1966) | Ancient Greek | chron | 24,612 tokens
56+
Epistulae ad Atticum (ed. Purser 1901) | Latin | cic-att | 47,528 tokens
57+
De officiis (ed. Miller 1913) | Latin | cic-off | 11,995 tokens
58+
The Gothic Bible (ed. Streitberg 1919) | Gothic | gothic-nt | 57,212 tokens
59+
The Greek New Testament (ed. Tischendorf 1869) | Ancient Greek | greek-nt | 140,773 tokens
60+
Histories (ed. Godley 1920) | Ancient Greek | hdt | 85,166 tokens
61+
Jerome's Vulgate | Latin | latin-nt | 112,454 tokens
62+
Codex Marianus (ed. Jagić 1883) | Church Slavic | marianus | 64,138 tokens
63+
Opus agriculturae (ed. Schmitt 1898) | Latin | pal-agr | 12,148 tokens
64+
Peregrinatio Aetheriae (ed. Heraeus 1908) | Latin | per-aeth | 18,356 tokens

0 commit comments

Comments
 (0)