1
- The PROIEL Treebank
2
- ===================
1
+ ## The PROIEL Treebank
3
2
4
3
The _ PROIEL Treebank_ is a dependency treebank with morphosyntactic and
5
- information-structure annotation. It includes texts in several ancient
6
- Indo-European languages and is freely available under a [ Creative Commons
7
- Attribution-NonCommercial-ShareAlike 3.0 License] (
8
- http://creativecommons.org/licenses/by-nc-sa/3.0/us/ ).
4
+ information-structure annotation.
5
+
6
+ It includes texts in several ancient Indo-European languages and is freely
7
+ available under a [ Creative Commons Attribution-NonCommercial-ShareAlike 4.0
8
+ License] ( https://creativecommons.org/licenses/by-nc-sa/4.0/ ) .
9
9
10
10
Please cite as
11
11
12
12
> Dag T. T. Haug and Marius L. Jøhndal. 2008. 'Creating a Parallel Treebank of the Old Indo-European Bible Translations'. In Caroline Sporleder and Kiril Ribarov (eds.). Proceedings of the Second Workshop on Language Technology for Cultural Heritage Data (LaTeCH 2008) (2008), pp. 27-34.
13
13
14
- Releases of the PROIEL Treebank are hosted on
15
- [ Github] ( https://github.com/proiel/proiel-treebank ) .
16
-
17
- Contents
18
- --------
19
-
20
- The following texts are included in this release of the treebank:
21
-
22
- Text | Language | Filename | Size
23
- ----------------------------------------------------|---------------------|-------------|---------------
24
- The Greek New Testament (ed. Tischendorf 1869) | Ancient Greek | greek-nt | 140,763 tokens
25
- The Armenian New Testament (ed. Künzle 1984) | Classical Armenian | armenian-nt | 23,513 tokens
26
- The Gothic Bible (ed. Streitberg 1919) | Gothic | gothic-nt | 57,211 tokens
27
- Codex Marianus (ed. Jagić 1883) | Old Church Slavonic | marianus | 58,269 tokens
28
- Jerome's Vulgate | Latin | latin-nt | 112,454 tokens
29
- Caesar, Commentarii belli Gallici (ed. Holmes 1914) | Latin | caes-gal | 28,607 tokens
30
- Cicero, De officiis (ed. Miller 1913) | Latin | cic-off | 10,644 tokens
31
- Cicero, Epistulae ad Atticum (ed. Purser 1901) | Latin | cic-att | 42,855 tokens
32
- Palladius, Opus agriculturae (ed. Schmitt 1898) | Latin | pal-agr | 12,148 tokens
33
- Peregrinatio Aetheriae (ed. Heraeus 1908) | Latin | per-aeth | 18,356 tokens
34
- Herodotus, Histories (ed. Godley 1920) | Ancient Greek | hdt | 85,080 tokens
35
- Sphrantzes, Chronicles (post-1453) (ed. Grecu 1966) | Ancient Greek | chron | 24,612 tokens
36
-
37
- (The 'size' column in the table above shows the number of annotated tokens in
38
- a text. The number of tokens will be slightly larger than the number of words
39
- in the original printed edition as some words have been split into multiple
40
- tokens and some tokens have been inserted during annotation.)
41
-
42
14
Please see the XML files for detailed metadata and a full list of contributors.
43
15
16
+ ### Completeness
17
+
44
18
Some sentences have not yet been annotated. This is an overview of where in the
45
19
texts unannotated sentences occur:
46
20
@@ -64,17 +38,27 @@ Sections or section ranges in which there are gaps:
64
38
* ` marianus ` : MATT 5, MARK 16, LUKE 2, LUKE 24, JOHN 1-2, JOHN 18, JOHN 20
65
39
* ` pal-agr ` : 1.4-1.12, 1.35-1.40, 2.3, 2.9-2.23, 3.9-3.10
66
40
67
- These gaps will be completed in future releases.
41
+ These gaps may be closed in future releases.
68
42
69
- Data formats
70
- ------------
43
+ ### Contents
71
44
72
- The texts are available on two formats:
73
-
74
- 1 . PROIEL XML: These files are the authoritative source files and the only ones
75
- that contain all available annotation. They contain the complete morphological,
76
- syntactic and information-structure annotation, as well as the complete text,
77
- including punctuation, section headers etc. The schema is defined in
78
- [ ` proiel.xsd ` ] ( https://github.com/proiel/proiel-treebank/blob/master/proiel.xsd ) .
45
+ The following texts are included in this release of the treebank:
79
46
80
- 2 . [ CoNLL-X format] ( http://nextens.uvt.nl/depparse-wiki/DataFormat )
47
+ (The _ size_ column in the table below shows the number of annotated tokens in a
48
+ text. The number of tokens will be slightly larger than the number of words in
49
+ the original printed edition as some words have been split into multiple tokens
50
+ and some tokens have been inserted during annotation.)
51
+ Text | Language | Filename | Size
52
+ ----------------------------------------------------|---------------------|-------------|---------------
53
+ The Armenian New Testament (ed. Künzle 1984) | Classical Armenian | armenian-nt | 23,513 tokens
54
+ Commentarii belli Gallici (ed. Holmes 1914) | Latin | caes-gal | 28,657 tokens
55
+ Chronicles (post-1453) (ed. Grecu 1966) | Ancient Greek | chron | 24,612 tokens
56
+ Epistulae ad Atticum (ed. Purser 1901) | Latin | cic-att | 47,528 tokens
57
+ De officiis (ed. Miller 1913) | Latin | cic-off | 11,995 tokens
58
+ The Gothic Bible (ed. Streitberg 1919) | Gothic | gothic-nt | 57,212 tokens
59
+ The Greek New Testament (ed. Tischendorf 1869) | Ancient Greek | greek-nt | 140,773 tokens
60
+ Histories (ed. Godley 1920) | Ancient Greek | hdt | 85,166 tokens
61
+ Jerome's Vulgate | Latin | latin-nt | 112,454 tokens
62
+ Codex Marianus (ed. Jagić 1883) | Church Slavic | marianus | 64,138 tokens
63
+ Opus agriculturae (ed. Schmitt 1898) | Latin | pal-agr | 12,148 tokens
64
+ Peregrinatio Aetheriae (ed. Heraeus 1908) | Latin | per-aeth | 18,356 tokens
0 commit comments