Overview
There have been quite a few groundbreaking advances in the field of Natural Language Processing over the past 2-3 years. Several new approaches and Machine Learning models with constantly increasing performance were proposed. During training, these models usually process a very large amount of unstructured text data to gain knowledge about the characteristics of a language, its words and its sentences. This initial training phase is, with modern deep learning models, computationally very expensive but has to be executed only once. For specific applications, the pre-trained base model can then be finetuned on a much smaller corpus in order to inoculate domain-specific knowledge. This approach called transfer learning is generally seen as one of the big methodical achievements that makes deep learning technology much more accessible for industry projects. Nowadays, most researchers release the source code of their models as well as pre-trained base models for the purpose of reproducibility as well as to facilitate further research. However, most of the state-of-the-art research is conducted with English text, and consequently the released pre-trained models are only useful for tasks that deal with English text. An efficient toolchain allows us to quickly test and apply state-of-the-art NLP models tailored to specific domains in our industry projects and to compute pretrained German models. Some applications require finetuning of such a model to domain-specific text such as job posts, CVs or reports from social welfare. Additionally, once we have computed high quality German models, we plan to release them to the public to facilitate more research.