The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.
This package contains utility functions for importing into R corpus in various formats containing interlinearized corpora or dictionaries produced by descriptive linguistics softwares, such as SIL Toolbox of SIL Fieldworks.
All functions reading interlinearized texts return a list of data frame, where each data frame correspond to an unit (text, sentence, word, morpheme) and each row in the data frame describe an occurrence of the corresponding unit. The set of tables is relational: in each data frame, some columns give IDs pointing to rows in the other data frame: you can join morphemes to the words, sentences or texts they belong to.
This pivot format allows for various reshaping into R (for instance, grouping morphemes by words) as well as conversion between formats.
EMELD is an XML vocabulary introduced in Baden Hughes, Steven Bird and Catherine Bow, Encoding and Presenting Interlinear Text Using XML Technologies, [http://www.aclweb.org/anthology/U03-1008], it is used by SIL Fieldworks as an export format.
corpuspath <- system.file("exampleData", "tuwariInterlinear.xml", package="interlineaR")
corpus <- read.emeld(corpuspath, vernacular.languages="tww")
The returned object is a named list. Each slot of the list contain a data.frame. They are named ‘morphemes’, ‘words’, ‘sentences’, ‘texts’ (unless some of them have been discarted through the function arguments). Each row in the data frame describe an occurrences of a linguistic unit (texts, sentences, words, morphemes.)
Let’s look at the first rows of the “texts” data.frame:
head(corpus$morphemes)
text_id | sentence_id | word_id | morphem_id | type | txt.tww | cf.tww | gls.en | msa.en | hn.en |
---|---|---|---|---|---|---|---|---|---|
1 | 1 | 1 | 1 | stem | a | a | I | pers | 2 |
1 | 1 | 2 | 2 | stem | otoiso | otoiso | tomorrow | adv | NA |
1 | 1 | 3 | 3 | stem | naham | naham | Naham | n_Npr | NA |
1 | 1 | 3 | 4 | suffix | -we | -we | -MASC.SING | n_Npr:(NounGenderNumber) | NA |
1 | 1 | 3 | 5 | suffix | -lo | -lo | COMITATIVE | n | 3 |
1 | 1 | 4 | 6 | stem | na | na | to_find | v | NA |
The first columns contain ‘’ids’’ (referencing to which text, sentence or word each morpheme belongs to). Other columns contains information extracted from the document. The names of the column are made of the field name and the language of the field, separated by a dot. (The parameters of read.emeld allow you to indicate which field, and in wich language(s), you are interested in, for each unit). Each field may be repeated in different languages.
The “words”, “sentences” and “texts” table are made according to the same principles:
head(corpus$words)
text_id | sentence_id | word_id | txt.tww | gls.en | pos.en |
---|---|---|---|---|---|
1 | 1 | 1 | a | NA | NA |
1 | 1 | 2 | otoiso | NA | NA |
1 | 1 | 3 | nahamwelo | NA | NA |
1 | 1 | 4 | na | NA | NA |
1 | 1 | 5 | balusesapo | NA | NA |
1 | 1 | 6 | holotuafemamo | NA | NA |
head(corpus$sentences)
text_id | sentence_id | note.en | segnum.en | gls.en | lit.en |
---|---|---|---|---|---|
1 | 1 | [mainaimwii] -> / malenaimwii/; gros problème : -mwelo, -welo, -we-lo… ? | 1 | I, tomorrow, next week, salim i kam, salim tok i go | NA |
2 | 2 | Cahier : wefemo | 1 | we, today, young men, men, went to work. The work done, at two o’clock, we go to the garden cleaning ground(?), then the night we come back to the house. | NA |
3 | 3 | 1.1 | Listen! | NA | |
3 | 4 | ; | 1.2 | I went downstream with a dog. | NA |
3 | 5 | upaoma - akiapmin. | 1.3 | Downstream, on Tepeso, I saw a crocodile, sleeping deep inside the water. | NA |
3 | 6 | 1.4 | I shoot the crocodile with a spear on the top of the neck and I get him. | NA |
head(corpus$texts)
text_id | title.en | title.abbreviation.en | source.en | comment.en |
---|---|---|---|---|
1 | 141104_01_T2 (correction dans 2015.III.S18) | 2014T2 | NA | NA |
2 | 141104_02_T3 A day working on the airstrip (correction dans 2015.III.S18) | 2014T3 | NA | NA |
3 | 141104_03_T4 How Martin Sipamo killed a crocodile (correction dans 2015.III.S18) | 2014T4 | NA | NA |
4 | 141105_01_T5 | 2014T5 | NA | NA |
5 | 141105_01_T5com | 2014T5com | NA | NA |
6 | 141105_03_T6 Hurry, the night is coming | 2014T6 | NA | NA |
This set of data.frame is similar to the tables of a database: rows from various tables are pointing to each other through ids.
Using these ids, new data.frame aggregating pieces of information coming from several data frame can be built. For instance, a table containing the columns of the morphemes and the words can be built using:
morphemes_words <- merge(corpus$morphemes, corpus$words[,-c(1,2)], by="word_id", suffixes = c(".morpheme",".word"))
head(morphemes_words)
word_id | text_id | sentence_id | morphem_id | type | txt.tww.morpheme | cf.tww | gls.en.morpheme | msa.en | hn.en | txt.tww.word | gls.en.word | pos.en |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 1 | 1 | 1 | stem | a | a | I | pers | 2 | a | NA | NA |
2 | 1 | 1 | 2 | stem | otoiso | otoiso | tomorrow | adv | NA | otoiso | NA | NA |
3 | 1 | 1 | 3 | stem | naham | naham | Naham | n_Npr | NA | nahamwelo | NA | NA |
3 | 1 | 1 | 4 | suffix | -we | -we | -MASC.SING | n_Npr:(NounGenderNumber) | NA | nahamwelo | NA | NA |
3 | 1 | 1 | 5 | suffix | -lo | -lo | COMITATIVE | n | 3 | nahamwelo | NA | NA |
4 | 1 | 1 | 6 | stem | na | na | to_find | v | NA | na | NA | NA |
Toolbox [https://software.sil.org/toolbox/] is widely used for producing interlinearized corpora. It uses a specific text-based format.
corpuspath <- system.file("exampleData", "tuwariToolbox.txt", package="interlineaR")
corpus <- read.toolbox(corpuspath)
Just as read.emeld, the result is a list containing the slots ‘morphemes’, ‘sentences’, ‘words’ and ‘texts’. These slots are data frames, where each line describe an occurrence:
head(corpus$morphemes)
texts_ids | sentences_ids | triplet_ids | morphemes_id | mb | ge | ps |
---|---|---|---|---|---|---|
1 | 1 | 1 | 1 | ta | we | Pr |
1 | 1 | 1 | 2 | samuel | Samuel | Npr |
1 | 1 | 1 | 3 | -we | -M.S | -sfx |
1 | 1 | 1 | 4 | m- | ?- | ?- |
1 | 1 | 1 | 5 | iasa | to_help | v |
1 | 1 | 1 | 6 | -ne | -Part | -mode |
As with read.emeld, optional fields can be declared. For instance, the kakabe corpus (by Alexandra Vydrina) also contains morpheme glosses in russian and french
path <- system.file("exampleData", "kakabe.txt", package="interlineaR")
corpus <- read.toolbox(path, morpheme.fields.suppl = c("gr", "gf"))
head(corpus$morphemes)
texts_ids | sentences_ids | triplet_ids | morphemes_id | mb | ge | gr | gf | ps |
---|---|---|---|---|---|---|---|---|
1 | 1 | 1 | 1 | mùsu | woman | женщина -ART т | femme | n |
1 | 1 | 1 | 2 | -È | -AR | от б | -AR | -mr |
1 | 1 | 1 | 3 | dóo | T one | ыть ма | T un | phn |
1 | 1 | 1 | 4 | bi | be | ниока -AR | être | cop |
1 | 1 | 1 | 5 | bàntará | manioc | T толочь -ART | manioc | n |
1 | 1 | 1 | 6 | -È | LOC |
The ‘sentences’ data frame contains a numeric id, the reference created for each sentence (“ref” field in toolbox), plus (as with read.emeld) the original text, the free translation as well as the note (“tx”, “nt”, “ft” field in toolbox).
head(corpus$sentences)
texts_ids | sentences_ids | ft | ref | tx |
---|---|---|---|---|
1 | 1 | mùséè dóo bi bàntaráà tùgéè là | ||
1 | 2 | Músa kéle-la báti n na-kɔ̀ri | ||
1 | 3 | kín-na-ma t’ a ladíi | ||
1 | 4 | wálè bi dúfen-na sínàn dé | ||
1 | 5 | dende bi faljɛ-la karaɲɛ tɔ | ||
1 | 6 | kɛɛ syɔ́ɲɛ̀ bi fáa-nden |
the text data.frame contains a numeric id and the title (toolbox ID) of each text.
This XML vocabulary has been introduced by SIL : [http://code.google.com/p/lift-standard] and is used by SIL Fieldworks as an export format.
read.lift() produce a list of three data frame: “entries”, “senses”, “examples” (“relations” should be added). These set of table are linked through IDs, as in a relational database. All the fields of the dictionary, in all languages, can be extracted. THe arguments of read.lift() allow to manually list the fields you are interested in for each data frame; you can also reduce the field (columns) to those that have non-empty values in some columns with simplify=TRUE.
dicpath <- system.file("exampleData", "tuwariDictionary.lift", package="interlineaR")
dictionary <- read.lift(dicpath, vernacular.languages="tww", simplify=TRUE)
table of the lexial units:
head(dictionary$entries)
id_LIFT | id | lexical-unit.tww | morph-type | variant.form.tww | variant.morph-type |
---|---|---|---|---|---|
u_002794b9-f063-4c6b-b77d-39b8ecd618d1 | 1 | u | stem | NA | NA |
ia_00909ffb-7e90-4b76-b948-cae56094abc9 | 2 | ia | stem | NA | NA |
totolo_00b3f2c3-bd07-4282-9287-603f2285c720 | 3 | totolo | stem | totolu | stem |
ofa_00dba42e-1a59-40e8-848d-dbe5bc20012e | 4 | ofa | stem | NA | NA |
tia3_013a1198-262e-45c6-9cab-fb6168f4f223 | 5 | tia | stem | NA | NA |
waia3_015e37a5-77f2-4c45-9404-e90a36e48014 | 6 | waia | stem | NA | NA |
head(dictionary$senses)
id_LIFT | id | lexem_id | grammatical-info.value | gloss.en | usage-type | semantic-domain-ddp4 | grammatical-info.traits |
---|---|---|---|---|---|---|---|
582795c9-9350-4e3b-af34-b72e9b5c89aa | 1 | 1 | Noun | fire | Noun-infl-class:tano | ||
5da1286e-07b0-47f2-81ae-5633ca9c875c | 2 | 2 | Noun | talk | 3.5 Communication | Noun-infl-class:he | |
5da1286e-07b0-47f2-81ae-5633ca9c875d | 3 | 2 | Noun | vernacular_language | 3.5 Communication | Noun-infl-class:he | |
5da1286e-07b0-47f2-81ae-5633ca9c875d | 4 | 2 | Noun | word | 3.5 Communication | Noun-infl-class:he | |
ed8e2d65-f3da-4efe-92cc-f381d08f0c08 | 5 | 3 | Noun | island | 1.2 World | Noun-infl-class:fo | |
8c5651bd-27b4-4647-b46b-d7b209a2477f | 6 | 4 | Adverb | now | 8.4 Time |
head(dictionary$examples)
id | lexem_id | sense_id | example.form.tww |
---|---|---|---|
1 | 2 | 2 | exemple1.1 |
2 | 2 | 2 | exemple1.2 |
3 | 2 | 3 | exemple2.1 |
4 | 2 | 3 | exemple2.2 |
5 | 2 | 4 | exemple3.1 |
6 | 2 | 4 | exemple3.2 |
Some fields in LIFT may be repeated. For instance, several “semantic domain” can be expressed in the sense element:
<trait name ="semantic-domain-ddp4" value="1.3 Water"/>
<trait name ="semantic-domain-ddp4" value="6.7 Tool"/>
In that case, the value are concatenated, and the column “semantic-domain-ddp4” contains a value “1.3 Water,6.7 Tool”.
Some other fields are both repeated and appearing as key-value pair, reflecting categories created for a language. In the following chunk, “Noun-infl-class” and “Noun-infl-class2” are two categories created for the nouns of a given language:
<grammatical-info value="Noun">
<trait name="Noun-infl-class" value="fo"/>
<trait name="Noun-infl-class2" value="hei"/>
</grammatical-info>
In that case, the column “trait” in the data.frame example will turn out as: “Noun-infl-class:fo,Noun-infl-class2:hei”.
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.