Cancer genetics research methods in the next-generation sequencing era

Research output: ThesisDoctoral ThesisCollection of Articles


The research in cancer genetics aims to detect genetic causes for the excessive growth of cells, which may subsequently form a tumor and further develop into cancer. The Human Genome Project succeeded in mapping the majority of the human DNA sequence, which enabled modern sequencing technologies to emerge, namely next-generation sequencing (NGS). The new era of disease genetics research shifted DNA analyses from laboratory to computer screens. Since then, the massive growth of sequencing data has been facilitating the detection of novel disease-causing mutations and thus improving the screening and medical treatments of cancer. However, the exponential growth of sequencing data brought new challenges for computing. The sheer size of the data is not only expensive to store and maintain, but also highly demanding to process and analyze. Moreover, not only has the amount of sequencing data increased, but new kinds of functional genomics data, which are instrumental in figuring out the consequences of detected mutations, have also emerged. To this end, continuous software development has become essential to enable the utilization of all produced research data, new and old. This thesis describes a software for the analysis and visualization of NGS data (publication I) that allows the integration of genomic data from various sources. The software, BasePlayer, was designed for the need of efficient and user-friendly methods that could be used to analyze and visualize massive variant, and various other types of genomic data. To this end, we developed a multi-purpose tool for the analysis of genomic data, such as DNA, RNA, ChIP-seq, and DNase. The capabilities of BasePlayer in the detection of putatively causative variants and data visualization have already been used in over twenty scientific publications. The applicability of the software is demonstrated in this thesis with two distinct analysis cases - publications II and III. The second study considered somatic mutations in colorectal cancer (CRC) genomes. We were able to identify distinct mutation patterns at the CTCF/Cohesin binding sites (CBSs) by analyzing whole-genome sequencing (WGS) data with BasePlayer. The sites were observed to be frequently mutated in CRC, especially in samples with a specific mutational signature. However, the source for the mutation accumulation remained unclear. On the contrary, a subset of samples with an ultra-mutator phenotype, caused by defective polymerase epsilon (POLE) gene, exhibited an inverse pattern at CBSs. We detected the same signal in other, predominantly gastrointestinal, cancers as well. However, we were not able to measure changes in gene expressions at mutated sites, so the role of the CBS mutations in tumorigenesis remained and still remains to be elucidated. The third study considered esophageal squamous cell carcinoma (ESCC), and the objective was to detect predisposing mutations using the Finnish Cancer Registry (FCR) data. We performed clustering analysis for the FCR data, with additional information obtained from the Population Information System of Finland. We detected an enrichment of ESCC in the Karelia region and were able to collect and sequence 30 formalin-fixed paraffin-embedded (FFPE) samples from the region. We reported several candidate genes, out of which EP300 and DNAH9 were considered the most interesting. The study not only reported putative genes predisposing to ESCC but also worked as a proof of concept for the feasibility of conducting genetic research utilizing both clustering of the FCR data and FFPE exome sequencing in such studies.
Original languageEnglish
  • Aaltonen, Lauri, Supervisor
  • Pitkänen, Esa, Supervisor
Place of PublicationHelsinki
Print ISBNs978-951-51-5898-7
Electronic ISBNs978-951-51-5899-4
Publication statusPublished - 2020
MoE publication typeG5 Doctoral dissertation (article)

Bibliographical note

M1 - 79 s. + liitteet

Fields of Science

  • Neoplasms
  • +diagnosis
  • +genetics
  • Computational Biology
  • +methods
  • Cell Cycle Proteins
  • Chromatin Immunoprecipitation Sequencing
  • Chromosomal Proteins, Non-Histone
  • Colorectal Neoplasms
  • Data Analysis
  • Data Visualization
  • Early Detection of Cancer
  • Esophageal Neoplasms
  • Esophageal Squamous Cell Carcinoma
  • Exome
  • High-Throughput Nucleotide Sequencing
  • Human Genome Project
  • Mutation Accumulation
  • Regulatory Sequences, Nucleic Acid
  • RNA
  • Software
  • 3111 Biomedicine

Cite this