This week we would like to present the master thesis of Hans Friedrich Witschel.
Its original title reads “Text, Wörter, Morpheme — Möglichkeiten einer automatischen Terminologie-Extraktion”.
Professor Witschel is currently lecturer at the University of Applied Sciences of North-Switzerland. His thesis won him the GSCL Prize 2005, awarded by the prestigious German Society for Computational Linguistics. You can read below the complete summary translated into English:
This paper deals with a subfield of Text Mining, as it seeks to extract information (in this case technical terminology) from natural language text. The thesis states that in many areas of Text Mining the combination of different methods can be useful, in order to cope with the richness of natural language.
The methods used for terminology extraction are of statistical and linguistic (or pattern-based) nature. To derive them, some necessary characteristics of technical terms have been elaborated, which are relevant for their extraction. For instance the fact that many technical terms are nominal phrases of a certain form could be used directly to search for certain P(art)O(f)S(peech) patterns, while the distribution of terms in technical texts led to a statistical approach (differential analysis). Together with some others, these approaches have been integrated into a procedure which is able to learn from the user’s feedback and to refine the terminology search in more steps.
Several parameters of the procedure have been left variable, i.e. the user can adapt them according to his needs. While examining the results on the basis of two technical texts from different domains, it became clear that, although the different procedures can be well integrated indeed, the optimal values of the mutable parameters, even the selection of the applied methods, still depend on both the text and domain.
This shows also the limitations of the presented approach, as well as many Text-Mining methods in general: the multifaceted nature of language, even with the combination of several procedures, makes it impossible to create a system which works equally well for all texts.
The question whether this can be tackled by “domain recognition” and subsequent dynamic adjustment of the parameters can still be achieved, could not be answered in this thesis and should be the subject of further researches.
Dr. Witschel’s complete thesis is available here.
If you are interested in other theses and papers on terminology and linguistics, check out our Theses and Papers section.
Introduced and translated from German by Cosimo Palma, Communication Trainee at the Terminology Coordination Unit of the European Parliament (Luxembourg).