FINEMAP - a statistical software for identifying causal genetic variants

Christian Benner

Research output: ThesisDoctoral ThesisCollection of Articles


The explosion of genomic data during the last ten years and the advent of Genome-Wide Association Studies (GWAS) have led to robust statistical associations between thousands of genomic regions and hundreds of phenotypes. However, any one associated genomic region can harbor thousands of correlated genetic variants, complicating the understanding of the underlying biological mechanisms that led to these associations. To address this problem, this doctoral thesis presents the development of the FINEMAP software for fine-mapping causal variants in these regions. In 2016, we solved the existing issue with the computationally expensive exhaustive search strategy of existing fine-mapping methods by implementing a Bayesian regression model and an ultrafast stochastic search algorithm in the FINEMAP software. We demonstrated that FINEMAP opens up completely new opportunities by fine-mapping the High Density Lipoprotein (HDL) cholesterol association to the LIPC locus with 20,000 variants in less than 90 seconds, while exhaustive search would require many years. With extensive simulations we further showed that FINEMAP is as accurate as exhaustive search when the latter can be completed and achieves even higher accuracy when the latter must be restricted due to computational reasons. Thus, FINEMAP is a promising tool for future fine-mapping analyses. Fine-mapping methods that use GWAS results also require Linkage Disequilibrium (LD) information as input in the form of estimates of pairwise correlations between variants. Motivated by feedback from FINEMAP users, we investigated in 2017 the consequences of misspecification of LD that could happen when publicly available reference genomes are used. We demonstrated both empirically and theoretically that the size of the reference panel needs to scale with the GWAS sample size to produce accurate results and we provided the LDstore software to help share LD estimates. This finding has important consequences for the application of all fine-mapping methods using GWAS results from GW AS consortia in which accurate LD estimates from each participating study are typically not available. In 2018, we implemented in FINEMAP an approach for estimating how much phenotypic variation can be explained by the causal variants. To demonstrate this, we applied FINEMAP to 110 regions across 51 biomarkers on 5,265 Finnish samples. We compared regional heritability estimation using FINEMAP with both the variance component model BOLT and fixed-effect model HESS in biomarker-associated regions, showing good concordance among all methods. Through simulations with biobank-scale projects, we also illustrated how violations of model assumptions on polygenicity or unspecified genetic architecture induces inaccuracy to the existing heritability estimates that becomes more accentuated as statistical power to identify causal variants increases. Ever increasing GWAS sample sizes, soon reaching millions of samples, provide unprecedented statistical power to decompose heritability estimates from polygenic models into heritability contributions from causal variants. In conclusion, this doctoral thesis shows that (1) the computational efficiency and accuracy of FINEMAP makes it a promising fine-mapping tool, (2) LD estimates need to be chosen more carefully than previously thought to avoid bias, and (3) large-scale data sets provide new opportunities for fine-mapping to deduce a variant-level picture of regional genetic architecture.
Original languageEnglish
  • Pirinen, Matti, Supervisor
  • Ripatti, Samuli, Supervisor
Place of PublicationHelsinki
Print ISBNs978-951-51-4805-6
Electronic ISBNs978-951-51-4806-3
Publication statusPublished - 2019
MoE publication typeG5 Doctoral dissertation (article)

Fields of Science

  • Software
  • Genome
  • Genome-Wide Association Study
  • Linkage Disequilibrium
  • Sample Size
  • Genotype
  • Efficiency
  • Probability
  • Lipase
  • +genetics
  • Lipoproteins, HDL
  • Cholesterol, HDL
  • DNA
  • Oligonucleotide Array Sequence Analysis
  • Algorithms
  • Bayes Theorem
  • Genetic Variation
  • Biomarkers
  • Genomics
  • Phenotype
  • Computational Biology
  • +methods
  • 3142 Public health care science, environmental and occupational health

Cite this