This project focuses on automatic detection of stable grammatical and lexical features in n-grams and has two main practical objectives:
1. Development of corpus-based self-study tools that would be of great help for both teachers and students. These analysis tools for teaching language are based on big corpus data and utilize statically relevant method developed in collaboration with Dept. of Computer Science and Dept. of Modern Languages. The aim of the application is to answer questions about how words co-occur: which ones should be learned by rote and which ones follow the rules. The tools will be freely available as online service and phone application.
2. Development of an application for self-assessment with a possible mobile-based interface. We assume that our analysis methods can be further used to discover divergences between "clean" corpora and learner's texts. By discovering these divergences, we can identify learner's mistakes and their frequencies, group them by type, etc. Although some types of errors will likely not be amenable to our methods, the ability to treat a sizable proportion of errors will result in substantial impact and far-reaching applications. For example, it will be possible to rank the types of mistakes by frequency of occurrence. Since the queries are issued against very large corpora, the user will receive a complete automatically assembled information.