Compilation of Bilingual Multi-Word Terminology Lists from Lexical Resources

LIS Results Web App Language Portal for Serbian About Us (JeRTeh)

Upload two sentence-aligned text files. Files should have the same names, but the extension should be two-letter language codes (e.g. test.en and test.sr). These files are later fed into GIZA++.

Upload Sentence-Aligned Corpus

Upload a verticalized list of English terminology. Line format: term|extractor

Upload a List of English Terminology

Upload a verticalized list of Serbian terminology. Line format: term|frequency

Upload a List of Serbian Terminology

StringL and StringS: loose and strict string matching; Token: matches sets of normalised tokens.

Pick Matching Method

Run GIZA++ on aligned sentences.

Run GIZA++ on Billingual Corpus

Download Result

Discard Chunks from Previous Step that are Certainly not from the Desired Domain (by Inspecting English Dictionary i.e. List of English MWUs). This is very "rough" Bag-of-Words Based Elimination.

Discard Bad Candidates (out-of-vocabulary) from GIZA++ Output

Download Result

Perform spaCy lemmatization on English Chunks and Unitex Lemmatization on Serbian Chunks.

Perform Lemmatization on Serbian GIZA++ Chunks

Download Result

After Performing Fine Elimination (by Doing Intersection With English Dictionary), Obtain Other Results.

Upload Sentence-Aligned Corpus

Upload a List of English Terminology

Upload a List of Serbian Terminology

Pick Matching Method

Run GIZA++ on Billingual Corpus

Discard Bad Candidates (out-of-vocabulary) from GIZA++ Output

Perform Lemmatization on Serbian GIZA++ Chunks

Obtain Results

Keep Only Candidates Present in the English Terminology List

Retrieve Intersection with the List of Serbian Extracted MWUs