Why terminology extraction?


funnel-helpThere is no doubt that terminology plays a very important role in many different fields such as translation, standardisation, technical documentation and localisation.

Subject fields such as different sectors of law and industry all have significant amounts of field-specific terminology. In addition, many document initiators might use their own preferred terminology. Researching the specific terms needed to complete any given translation is a time-consuming task.

However, attempting an initial terminology extraction using term extraction tools has proved to be very time-saving. Nevertheless, despite the fact that the extraction tools facilitate extraction, the resulting list of candidate terms must be verified by a human terminologist or translator. Therefore, the process of term extraction is computer-aided rather than fully automatic.

Term extraction can be defined as the operation of identifying term candidates in a given text.

It can either be monolingual or multilingual (usually bilingual). Monolingual term extraction attempts to analyse a text or corpus in order to identify candidate terms, while multilingual term extraction analyses existing source texts along with their translations in an attempt to identify potential terms and their equivalents.

Term extraction generally involves four steps: the compilation of a corpus, the extraction of term candidates, the validation of the term candidates and the automatic or semi-automatic creation of terminological records.

The preparation of term extraction projects requires substantial human intervention: the corpus for the extraction have to be prepared, the used software has to be set up, and word lists have to be imported and extraction rules created.

There are three main term extraction approaches usually implemented in terminology management: linguistic, statistic, or hybrid.


Term extraction tools using a linguistic approach typically attempt to identify word combinations that match certain morphological or syntactical patterns (e.g. “adjective+noun” or “noun+noun”). For these purpose parsers, part-of-speech taggers and morphological analyser are used to annotate the content of the corpus. Term candidates are filtered using different pattern matching techniques. Obviously the linguistic approach is heavily language-dependent because term formation patterns differ from language to language. Consequently, term extraction tools that use a linguistic approach are generally designed to work in a single language (or closely related languages) and cannot easily be extended to work with other languages. Therefore, they are not well suited for integration into TM systems, which are usually language-independent.


Term extraction tools using a statistical approach basically look for repeated sequences of lexical items. Often the frequency threshold, which refers to the number of times that a word or a sequence of words must be repeated to be considered a candidate term, can be specified by the user. The major strength of the statistical approach is its language-independence.


Even if, by following the linguistically-based approach users can get better delimited term candidates, this approach tends to produce too much “noise” (i.e. non-terms, usual expressions). On the other hand using only a purely statistical approach, the danger of producing “silence”, (i.e. to miss candidates that appear with a low frequency value) is much higher.

That is why the most common approach in the term extraction is the hybrid one, using both statistical and linguistic information. Even though the main part of such approaches is statistical, syntactic rules and filters are incorporated to allow picking candidate terms that have certain syntactic structures.

Besides accuracy in selecting the term candidates, other important evaluation criteria for the terminology extraction are the supported files formats and languages. Not all extraction tools support all kind of formats texts are available in.

A big issue is also the supported languages. For West European languages such as English, German or French it is easier to find a good linguistic or hybrid extraction tool. However, for East European or Asiatic languages the offer for such tools is really poor.

Terminology extraction tools

Different users, different companies and institutions mean different challenges for the term extraction projects and different expectations and necessities. That is why there is no single “best tool” for term extraction. Every user should run tests before picking the right extraction tool for their projects.

There are a lot of commercial terminology extractors as for example SDL TermExtract, SDL Phrase Finder or Synchroterm.

Some free translation memory systems also offer excellent built-in automatic terminology extraction as for example Similis or Across Personal Edition.

You can also find free term extraction tools as TermMine, AntConc or fivefilters.

Most terminology extraction tools provide lists of the term candidates that can be directly validated or exported for e.g. in *.txt or *.csv for an external validation.

Although a number of terminology extraction tools are available, it seems that not all of them meet the real needs of translators, interpreters or terminologists. These user groups are expecting tools which deliver properly delimited term candidates, term recognition and term variant recognition, properties which would make the term validation process less time consuming and terminology work more effective.