NLP Model and Embedding Toolchain

Meta navigation

Language selection and important links

Main navigation

NLP Model and Embedding Toolchain

The ABIZ Team is predominantly confronted with German text in industry projects. While having the hardware infrastructure to train state-of-the-art deep neural network models, we yet lack the software toolchain to efficiently train and finetune such models.

Brief information

Overview

There have been quite a few groundbreaking advances in the field of Natural Language Processing over the past 2-3 years. Several new approaches and Machine Learning models with constantly increasing performance were proposed. During training, these models usually process a very large amount of unstructured text data to gain knowledge about the characteristics of a language, its words and its sentences. This initial training phase is, with modern deep learning models, computationally very expensive but has to be executed only once. For specific applications, the pre-trained base model can then be finetuned on a much smaller corpus in order to inoculate domain-specific knowledge. This approach called transfer learning is generally seen as one of the big methodical achievements that makes deep learning technology much more accessible for industry projects. Nowadays, most researchers release the source code of their models as well as pre-trained base models for the purpose of reproducibility as well as to facilitate further research. However, most of the state-of-the-art research is conducted with English text, and consequently the released pre-trained models are only useful for tasks that deal with English text. An efficient toolchain allows us to quickly test and apply state-of-the-art NLP models tailored to specific domains in our industry projects and to compute pretrained German models. Some applications require finetuning of such a model to domain-specific text such as job posts, CVs or reports from social welfare. Additionally, once we have computed high quality German models, we plan to release them to the public to facilitate more research.

Facts

Type of project

Forschung

Internal organisations involved

Algorithmic Business F&E

Funding

Forschungsfinanzierung allgemein

Persons involved: internal

Breadcrumbs

Brief information

Overview

Facts

Persons involved: internal