SANS: high-throughput retrieval of protein sequences allowing 50% mismatches.

J. Patrik Koskinen, Liisa Holm

Research output: Contribution to journalArticleScientificpeer-review

Abstract

MOTIVATION:

The genomic era in molecular biology has brought on a rapidly widening gap between the amount of sequence data and first-hand experimental characterization of proteins. Fortunately, the theory of evolution provides a simple solution: functional and structural information can be transferred between homologous proteins. Sequence similarity searching followed by k-nearest neighbor classification is the most widely used tool to predict the function or structure of anonymous gene products that come out of genome sequencing projects.
RESULTS:

We present a novel word filter, suffix array neighborhood search (SANS), to identify protein sequence similarities in the range of 50-100% identity with sensitivity comparable to BLAST and 10 times the speed of USEARCH. In contrast to these previous approaches, the complexity of the search is proportional only to the length of the query sequence and independent of database size, enabling fast searching and functional annotation into the future despite rapidly expanding databases.
AVAILABILITY AND IMPLEMENTATION:

The software is freely available to non-commercial users from our website http://ekhidna.biocenter.helsinki.fi/downloads/sans.
Original languageEnglish
JournalBioinformatics
Volume28
Issue number18
Pages (from-to)i438-i443
Number of pages6
ISSN1367-4803
DOIs
Publication statusPublished - 15 Sep 2012
MoE publication typeA1 Journal article-refereed

Cite this

@article{6a25c5e40e7142b189463ffa8a85895c,
title = "SANS: high-throughput retrieval of protein sequences allowing 50{\%} mismatches.",
abstract = "MOTIVATION:The genomic era in molecular biology has brought on a rapidly widening gap between the amount of sequence data and first-hand experimental characterization of proteins. Fortunately, the theory of evolution provides a simple solution: functional and structural information can be transferred between homologous proteins. Sequence similarity searching followed by k-nearest neighbor classification is the most widely used tool to predict the function or structure of anonymous gene products that come out of genome sequencing projects.RESULTS:We present a novel word filter, suffix array neighborhood search (SANS), to identify protein sequence similarities in the range of 50-100{\%} identity with sensitivity comparable to BLAST and 10 times the speed of USEARCH. In contrast to these previous approaches, the complexity of the search is proportional only to the length of the query sequence and independent of database size, enabling fast searching and functional annotation into the future despite rapidly expanding databases.AVAILABILITY AND IMPLEMENTATION:The software is freely available to non-commercial users from our website http://ekhidna.biocenter.helsinki.fi/downloads/sans.",
author = "Koskinen, {J. Patrik} and Liisa Holm",
year = "2012",
month = "9",
day = "15",
doi = "10.1093/bioinformatics/bts417",
language = "English",
volume = "28",
pages = "i438--i443",
journal = "Bioinformatics",
issn = "1367-4803",
publisher = "Oxford University Press",
number = "18",

}

SANS: high-throughput retrieval of protein sequences allowing 50% mismatches. / Koskinen, J. Patrik; Holm, Liisa.

In: Bioinformatics, Vol. 28, No. 18, 15.09.2012, p. i438-i443.

Research output: Contribution to journalArticleScientificpeer-review

TY - JOUR

T1 - SANS: high-throughput retrieval of protein sequences allowing 50% mismatches.

AU - Koskinen, J. Patrik

AU - Holm, Liisa

PY - 2012/9/15

Y1 - 2012/9/15

N2 - MOTIVATION:The genomic era in molecular biology has brought on a rapidly widening gap between the amount of sequence data and first-hand experimental characterization of proteins. Fortunately, the theory of evolution provides a simple solution: functional and structural information can be transferred between homologous proteins. Sequence similarity searching followed by k-nearest neighbor classification is the most widely used tool to predict the function or structure of anonymous gene products that come out of genome sequencing projects.RESULTS:We present a novel word filter, suffix array neighborhood search (SANS), to identify protein sequence similarities in the range of 50-100% identity with sensitivity comparable to BLAST and 10 times the speed of USEARCH. In contrast to these previous approaches, the complexity of the search is proportional only to the length of the query sequence and independent of database size, enabling fast searching and functional annotation into the future despite rapidly expanding databases.AVAILABILITY AND IMPLEMENTATION:The software is freely available to non-commercial users from our website http://ekhidna.biocenter.helsinki.fi/downloads/sans.

AB - MOTIVATION:The genomic era in molecular biology has brought on a rapidly widening gap between the amount of sequence data and first-hand experimental characterization of proteins. Fortunately, the theory of evolution provides a simple solution: functional and structural information can be transferred between homologous proteins. Sequence similarity searching followed by k-nearest neighbor classification is the most widely used tool to predict the function or structure of anonymous gene products that come out of genome sequencing projects.RESULTS:We present a novel word filter, suffix array neighborhood search (SANS), to identify protein sequence similarities in the range of 50-100% identity with sensitivity comparable to BLAST and 10 times the speed of USEARCH. In contrast to these previous approaches, the complexity of the search is proportional only to the length of the query sequence and independent of database size, enabling fast searching and functional annotation into the future despite rapidly expanding databases.AVAILABILITY AND IMPLEMENTATION:The software is freely available to non-commercial users from our website http://ekhidna.biocenter.helsinki.fi/downloads/sans.

U2 - 10.1093/bioinformatics/bts417

DO - 10.1093/bioinformatics/bts417

M3 - Article

VL - 28

SP - i438-i443

JO - Bioinformatics

JF - Bioinformatics

SN - 1367-4803

IS - 18

ER -