Weak Supervision and Clustering-Based Sample Selection for Clinical Named Entity Recognition

Wei Sun, Shaoxiong Ji, Tuulia Denti, Hans Moen, Oleg Kerro, Antti Rannikko, Pekka Marttinen, Miika Koskinen

Forskningsoutput: Kapitel i bok/rapport/konferenshandlingKonferensbidragVetenskapligPeer review

Sammanfattning

One of the central tasks of medical text analysis is to extract and structure meaningful information from plain-text clinical documents. Named Entity Recognition (NER) is a sub-task of information extraction that involves identifying predefined entities from unstructured free text. Notably, NER models require large amounts of human-labeled data to train, but human annotation is costly and laborious and often requires medical training. Here, we aim to overcome the shortage of manually annotated data by introducing a training scheme for NER models that uses an existing medical ontology to assign weak labels to entities and provides enhanced domain-specific model adaptation with in-domain continual pretraining. Due to limited human annotation resources, we develop a specific module to collect a more representative test dataset from the data lake than a random selection. To validate our framework, we invite clinicians to annotate the test set. In this way, we construct two Finnish medical NER datasets based on clinical records retrieved from a hospital’s data lake and evaluate the effectiveness of the proposed methods. The code is available at ttps://github.com/VRCMF/HAM-net.git.
Originalspråkengelska
Titel på värdpublikationMachine Learning and Knowledge Discovery in Databases : Applied Data Science and Demo Track. ECML PKDD 2023
RedaktörerGianmarco De Francisci Morales, Claudia Perlich, Natali Ruchansky, Nicolas Kourtellis, Elena Baralis, Francesco Bonchi
Antal sidor16
UtgivningsortCham
FörlagSpringer Nature Switzerland
Utgivningsdatum2023
Sidor444-459
ISBN (tryckt)978-3-031-43426-6
ISBN (elektroniskt)978-3-031-43427-3
DOI
StatusPublicerad - 2023
MoE-publikationstypA4 Artikel i en konferenspublikation
EvenemangEuropean Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases - Turin, Italien
Varaktighet: 18 sep. 202322 sep. 2023
https://2023.ecmlpkdd.org/

Publikationsserier

NamnLecture Notes in Artificial Intelligence
FörlagSpringer Nature
Volym14174
ISSN (tryckt)0302-9743
ISSN (elektroniskt)1611-3349

Vetenskapsgrenar

  • 3121 Allmänmedicin, inre medicin och annan klinisk medicin
  • 113 Data- och informationsvetenskap

Citera det här