Extracting information from unstructured text data in the Finnish patient data repository with machine learning methods (Aioli)



Unit at THL:

Data and analytics

On other websites:

In the project, we explore the potential of machine learning methods for extracting data considered useful for monitoring, statistical reporting, and research from patient record texts. The research use of textual health data has been limited, as large-scale population studies require the possibility to automatically analyse large amounts of data.

Data collected in patient care situations has long been an important source of information when researching, for example, the occurrence of diseases based on recorded diagnoses, or the use of medicines through medicine registers. The data used in these tasks has generally been in an easily accessible structured format, but there are many important topics where structured data is incomplete.


The aim is to transform free-text patient records of the Patient Data Repository (PTA) into a uniform structured format by utilising modern data processing and machine learning methods.

In this pilot project, patient records are used to identify height, weight, and blood pressure data, as well as information about the patient’s smoking status. The broader objective of the project is to develop competence in the processing of textual information and to strengthen the related infrastructure.


In the early stages, we will restrict the data to patients existing in the diabetes registry. For these patients, we will identify free-text entries of their height, weight, blood pressure and smoking in the care documents of the Patient Data Repository (PTA). For comparison, structured entries may also be used. The documents are pseudonymised before processing.

A small part of the data will be annotated, which means that project researchers read the texts and identify information relevant to the project. Annotated data is used to train machine learning algorithms.

Several approaches will be used to classify data automatically, such as rule-based methods, supervised learning, and generative language models. We will compare the different approaches by their efficiency and qualitative performance.

All data processing takes place in secure computing environments at THL, and the language models used are open-source models that are installed in THL’s environment. The algorithms developed in the project will not be published without assessing data protection risks that may be associated with them.

We will measure the impact of the study by assessing how reliably information can be extracted from the free text documents. In addition, we will assess how significantly the extracted information improves the coverage of otherwise available data. The uniformity of recording between different patient information systems will also be assessed.

When publishing the results, care will be taken to ensure that all published information is anonymous and individual persons cannot be identified.

Contact information

Petteri Hovi
Research Manager
tel. +358 29 524 8941
E-mail: [email protected] 

Tuomo Nieminen
tel. +358 29 524 7534
E-mail: [email protected] 

Jokke Häsä
Data Scientist
tel. +358 29 524 8187
E-mail: [email protected] 

Mika Pihlajamäki
Development Manager
tel. +358 29 524 7733
E-mail: [email protected]