Weak Supervision and Clustering-Based Sample Selection for Clinical Named Entity Recognition

Wei Sun, Shaoxiong Ji, Tuulia Denti, Hans Moen, Oleg Kerro, Antti Rannikko, Pekka Marttinen, Miika Koskinen

Research output: Chapter in Book/Report/Conference proceedingConference contributionScientificpeer-review

Abstract

One of the central tasks of medical text analysis is to extract and structure meaningful information from plain-text clinical documents. Named Entity Recognition (NER) is a sub-task of information extraction that involves identifying predefined entities from unstructured free text. Notably, NER models require large amounts of human-labeled data to train, but human annotation is costly and laborious and often requires medical training. Here, we aim to overcome the shortage of manually annotated data by introducing a training scheme for NER models that uses an existing medical ontology to assign weak labels to entities and provides enhanced domain-specific model adaptation with in-domain continual pretraining. Due to limited human annotation resources, we develop a specific module to collect a more representative test dataset from the data lake than a random selection. To validate our framework, we invite clinicians to annotate the test set. In this way, we construct two Finnish medical NER datasets based on clinical records retrieved from a hospital’s data lake and evaluate the effectiveness of the proposed methods. The code is available at ttps://github.com/VRCMF/HAM-net.git.
Original languageEnglish
Title of host publicationMachine Learning and Knowledge Discovery in Databases : Applied Data Science and Demo Track. ECML PKDD 2023
EditorsGianmarco De Francisci Morales, Claudia Perlich, Natali Ruchansky, Nicolas Kourtellis, Elena Baralis, Francesco Bonchi
Number of pages16
Place of PublicationCham
PublisherSpringer Nature Switzerland
Publication date2023
Pages444-459
ISBN (Print)978-3-031-43426-6
ISBN (Electronic)978-3-031-43427-3
DOIs
Publication statusPublished - 2023
MoE publication typeA4 Article in conference proceedings
EventEuropean Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases - Turin, Italy
Duration: 18 Sept 202322 Sept 2023
https://2023.ecmlpkdd.org/

Publication series

NameLecture Notes in Artificial Intelligence
PublisherSpringer Nature
Volume14174
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Fields of Science

  • 3121 General medicine, internal medicine and other clinical medicine
  • 113 Computer and information sciences

Cite this