Projektin yksityiskohdat
Kuvaus (abstrakti)
The project is based on intensive collaboration in the multidisciplinary consortium. The project is led by Professor Heikki Mannila from Helsinki Institute of Information Technology. The third partner in the consortium is the Varieng Centre of Excellence, led by Professor Terttu Nevalainen from the University of Helsinki.
Description
Communication in the modern world is more versatile than ever. Written language can vary from novels and newspaper articles to instant messaging and personal letters, while spoken language ranges from formal speeches and interviews to mobile and face-to-face conversations. The DAMMOC project is about analysing and comparing language use in different contexts. Using state-of-the-art techniques from data mining and information visualisation, we are developing new tools and methods for studying this enormous variety of linguistic communication.
To understand the significance of changes in present-day language, we study corpora of spoken and written texts from both present and past varieties of English. One of our research topics is linguistic complexity. For example, written language is claimed to be more complex than spoken language; however, there is also a trend of colloquialisation, where written language acquires features typically associated with spoken language. We are currently studying how this trend is manifested in early English letters, as measured by the proportion of nouns out of all words in the correspondence. A high percentage of nouns implies higher complexity.
Our results indicate that colloquialisation is not a recent phenomenon: the proportion of nouns in our material decreases as we move from the 15th to the 17th century. Furthermore, it seems that this change is led by women. From previous research we know that women use fewer nouns and more pronouns than men in present-day English; our results show that this difference in communicative style may have existed for centuries. In future work, we hope to compare English with other languages, such as Finnish, to determine whether this might be a cross-linguistic trend.
The above is just one example of our work – we are investigating a host of linguistic issues related to complexity, language variation and change. To study these problems, we use information visualisation combined with data analysis methods such as clustering, pattern mining and classification. Challenges for data mining include interaction between linguistic variables, analysis of non-stationary temporal data, and combining and comparing data from different corpora. The development of interactive methods is a key component of our research. The tools and methods we create will be disseminated to linguists and other domain experts as well as computer scientists around the world.
Description
Communication in the modern world is more versatile than ever. Written language can vary from novels and newspaper articles to instant messaging and personal letters, while spoken language ranges from formal speeches and interviews to mobile and face-to-face conversations. The DAMMOC project is about analysing and comparing language use in different contexts. Using state-of-the-art techniques from data mining and information visualisation, we are developing new tools and methods for studying this enormous variety of linguistic communication.
To understand the significance of changes in present-day language, we study corpora of spoken and written texts from both present and past varieties of English. One of our research topics is linguistic complexity. For example, written language is claimed to be more complex than spoken language; however, there is also a trend of colloquialisation, where written language acquires features typically associated with spoken language. We are currently studying how this trend is manifested in early English letters, as measured by the proportion of nouns out of all words in the correspondence. A high percentage of nouns implies higher complexity.
Our results indicate that colloquialisation is not a recent phenomenon: the proportion of nouns in our material decreases as we move from the 15th to the 17th century. Furthermore, it seems that this change is led by women. From previous research we know that women use fewer nouns and more pronouns than men in present-day English; our results show that this difference in communicative style may have existed for centuries. In future work, we hope to compare English with other languages, such as Finnish, to determine whether this might be a cross-linguistic trend.
The above is just one example of our work – we are investigating a host of linguistic issues related to complexity, language variation and change. To study these problems, we use information visualisation combined with data analysis methods such as clustering, pattern mining and classification. Challenges for data mining include interaction between linguistic variables, analysis of non-stationary temporal data, and combining and comparing data from different corpora. The development of interactive methods is a key component of our research. The tools and methods we create will be disseminated to linguists and other domain experts as well as computer scientists around the world.
Tila | Päättynyt |
---|---|
Todellinen alku/loppupvm | 01/01/2008 → 31/12/2011 |