Grandma Karl is 27 years old: automatic pseudonymization of research data

Projekti: Tutkimusprojekti

Projektin yksityiskohdat

Kuvaus (abstrakti)

Accessibility of research data is critical for advances in many research fields, but textual data often cannot
be shared due to the presence of personal and sensitive information, e.g names, political opinions. GDPR
suggests pseudonymization as a solution, but we need to learn more about it before adopting it for manipulation
of research data. This environment targets several aspects of pseudonymization, aiming to advance Sweden's
work on open access to research data:
1. algorithms to automatically detect, label and pseudonymize personal identifiers in freely written
texts (essays/blogs), focusing on linguistic challenges such as spelling errors, ambiguous entities,
semantic constraints etc
2. analysis of type and number of personal identifiers versus acceptable protection, followed by reidentification
tests to ensure that pseudonymization is effective
3. analysis of the effects of pseudonymization on research data, e.g on the readability of the resulting
texts, their utility for answering the intended research questions and applicability to practical scenarios
(e.g language assessment)
We will use Swedish learner-written essays, collected and manually annotated by us, and generalize to social
media domain (through available corpora). Natural Language Processing, machine learning, neural networks,
word embeddings are some of the methods we will work with.
Tools and datasets will be openly shared; theoretical and methodological insights will be discussed in articles.
TilaKäynnissä
Todellinen alku/loppupvm01/01/202331/12/2028

Tieteenalat

  • 6121 Kielitieteet
  • 113 Tietojenkäsittely- ja informaatiotieteet