LIS Results Web App Language Portal for Serbian About Us (JeRTeh)
Compilation of Bilingual Multi-Word Terminology Lists from Lexical Resources

Upload two sentence-aligned text files. Files should have the same names, but the extension should be two-letter language codes (e.g. test.en and test.sr). These files are later fed into GIZA++.

Upload Sentence-Aligned Corpus


Or select existing:


Or select existing:

Upload a verticalized list of English terminology. Line format: term|extractor

Upload a List of English Terminology


Or select existing:

Upload a verticalized list of Serbian terminology. Line format: term|frequency

Upload a List of Serbian Terminology


Or select existing:

StringL and StringS: loose and strict string matching; Token: matches sets of normalised tokens.

Pick Matching Method

Token String [Strict] String [Loose]

Run GIZA++ on aligned sentences.

Run GIZA++ on Billingual Corpus

Discard Chunks from Previous Step that are Certainly not from the Desired Domain (by Inspecting English Dictionary i.e. List of English MWUs). This is very "rough" Bag-of-Words Based Elimination.

Discard Bad Candidates (out-of-vocabulary) from GIZA++ Output

Perform spaCy lemmatization on English Chunks and Unitex Lemmatization on Serbian Chunks.

Perform Lemmatization on Serbian GIZA++ Chunks

After Performing Fine Elimination (by Doing Intersection With English Dictionary), Obtain Other Results.

Obtain Results

Keep Only Candidates Present in the English Terminology List
Download Result
Retrieve Intersection with the List of Serbian Extracted MWUs
Download Result