K-Pax2

Bayesian identification of cluster-defining amino acid positions in large sequence datasets

Alberto Pessia, Yonatan Grad, Sarah Cobey, Juha Santeri Puranen, Jukka Corander

Tutkimustuotos: ArtikkelijulkaisuArtikkeliTieteellinenvertaisarvioitu

Kuvaus

The recent growth in publicly available sequence data has introduced new opportunities for studying microbial evolution and spread. Because the pace of sequence accumulation tends to exceed the pace of experimental studies of protein function and the roles of individual amino acids, statistical tools to identify meaningful patterns in protein diversity are essential. Large sequence alignments from fast-evolving micro-organisms are particularly challenging to dissect using standard tools from phylogenetics and multivariate statistics because biologically relevant functional signals are easily masked by neutral variation and noise. To meet this need, a novel computational method is introduced that is easily executed in parallel using a cluster environment and can handle thousands of sequences with minimal subjective input from the user. The usefulness of this kind of machine learning is demonstrated by applying it to nearly 5000 haemagglutinin sequences of influenza A/H3N2. Antigenic and 3D structural mapping of the results show that the method can recover the major jumps in antigenic phenotype that occurred between 1968 and 2013 and identify specific amino acids associated with these changes. The method is expected to provide a useful tool to uncover patterns of protein evolution.
Alkuperäiskielienglanti
LehtiMicrobial Genomics
Vuosikerta1
Numero1
Sivumäärä11
ISSN2057-5858
DOI - pysyväislinkit
TilaJulkaistu - 15 heinäkuuta 2015
OKM-julkaisutyyppiA1 Alkuperäisartikkeli tieteellisessä aikakauslehdessä, vertaisarvioitu

Tieteenalat

  • 112 Tilastotiede
  • 1184 Genetiikka, kehitysbiologia, fysiologia

Lainaa tätä

Pessia, Alberto ; Grad, Yonatan ; Cobey, Sarah ; Puranen, Juha Santeri ; Corander, Jukka. / K-Pax2 : Bayesian identification of cluster-defining amino acid positions in large sequence datasets. Julkaisussa: Microbial Genomics. 2015 ; Vuosikerta 1, Nro 1.
@article{43d1e76eefbc40b894bbcb5fae7c0749,
title = "K-Pax2: Bayesian identification of cluster-defining amino acid positions in large sequence datasets",
abstract = "The recent growth in publicly available sequence data has introduced new opportunities for studying microbial evolution and spread. Because the pace of sequence accumulation tends to exceed the pace of experimental studies of protein function and the roles of individual amino acids, statistical tools to identify meaningful patterns in protein diversity are essential. Large sequence alignments from fast-evolving micro-organisms are particularly challenging to dissect using standard tools from phylogenetics and multivariate statistics because biologically relevant functional signals are easily masked by neutral variation and noise. To meet this need, a novel computational method is introduced that is easily executed in parallel using a cluster environment and can handle thousands of sequences with minimal subjective input from the user. The usefulness of this kind of machine learning is demonstrated by applying it to nearly 5000 haemagglutinin sequences of influenza A/H3N2. Antigenic and 3D structural mapping of the results show that the method can recover the major jumps in antigenic phenotype that occurred between 1968 and 2013 and identify specific amino acids associated with these changes. The method is expected to provide a useful tool to uncover patterns of protein evolution.",
keywords = "112 Statistics and probability, clustering, 1184 Genetics, developmental biology, physiology, protein evolution, Sequence Analysis",
author = "Alberto Pessia and Yonatan Grad and Sarah Cobey and Puranen, {Juha Santeri} and Jukka Corander",
year = "2015",
month = "7",
day = "15",
doi = "10.1099/mgen.0.000025",
language = "English",
volume = "1",
journal = "Microbial Genomics",
issn = "2057-5858",
publisher = "American Society for Microbiology",
number = "1",

}

K-Pax2 : Bayesian identification of cluster-defining amino acid positions in large sequence datasets. / Pessia, Alberto; Grad, Yonatan; Cobey, Sarah; Puranen, Juha Santeri; Corander, Jukka.

julkaisussa: Microbial Genomics, Vuosikerta 1, Nro 1, 15.07.2015.

Tutkimustuotos: ArtikkelijulkaisuArtikkeliTieteellinenvertaisarvioitu

TY - JOUR

T1 - K-Pax2

T2 - Bayesian identification of cluster-defining amino acid positions in large sequence datasets

AU - Pessia, Alberto

AU - Grad, Yonatan

AU - Cobey, Sarah

AU - Puranen, Juha Santeri

AU - Corander, Jukka

PY - 2015/7/15

Y1 - 2015/7/15

N2 - The recent growth in publicly available sequence data has introduced new opportunities for studying microbial evolution and spread. Because the pace of sequence accumulation tends to exceed the pace of experimental studies of protein function and the roles of individual amino acids, statistical tools to identify meaningful patterns in protein diversity are essential. Large sequence alignments from fast-evolving micro-organisms are particularly challenging to dissect using standard tools from phylogenetics and multivariate statistics because biologically relevant functional signals are easily masked by neutral variation and noise. To meet this need, a novel computational method is introduced that is easily executed in parallel using a cluster environment and can handle thousands of sequences with minimal subjective input from the user. The usefulness of this kind of machine learning is demonstrated by applying it to nearly 5000 haemagglutinin sequences of influenza A/H3N2. Antigenic and 3D structural mapping of the results show that the method can recover the major jumps in antigenic phenotype that occurred between 1968 and 2013 and identify specific amino acids associated with these changes. The method is expected to provide a useful tool to uncover patterns of protein evolution.

AB - The recent growth in publicly available sequence data has introduced new opportunities for studying microbial evolution and spread. Because the pace of sequence accumulation tends to exceed the pace of experimental studies of protein function and the roles of individual amino acids, statistical tools to identify meaningful patterns in protein diversity are essential. Large sequence alignments from fast-evolving micro-organisms are particularly challenging to dissect using standard tools from phylogenetics and multivariate statistics because biologically relevant functional signals are easily masked by neutral variation and noise. To meet this need, a novel computational method is introduced that is easily executed in parallel using a cluster environment and can handle thousands of sequences with minimal subjective input from the user. The usefulness of this kind of machine learning is demonstrated by applying it to nearly 5000 haemagglutinin sequences of influenza A/H3N2. Antigenic and 3D structural mapping of the results show that the method can recover the major jumps in antigenic phenotype that occurred between 1968 and 2013 and identify specific amino acids associated with these changes. The method is expected to provide a useful tool to uncover patterns of protein evolution.

KW - 112 Statistics and probability

KW - clustering

KW - 1184 Genetics, developmental biology, physiology

KW - protein evolution

KW - Sequence Analysis

U2 - 10.1099/mgen.0.000025

DO - 10.1099/mgen.0.000025

M3 - Article

VL - 1

JO - Microbial Genomics

JF - Microbial Genomics

SN - 2057-5858

IS - 1

ER -