Fast splice site detection using information content and feature reduction

biomed - Baten Akma , Halgamuge Sk , Chang , Chang Bch

Découvre YouScribe en t'inscrivant gratuitement

Je m'inscris

Obtenez un accès à la bibliothèque pour le consulter en ligne
En savoir plus

12 pages

English

Obtenez un accès à la bibliothèque pour le consulter en ligne
En savoir plus

A propos
Informations
Extrait

Description

Accurate identification of splice sites in DNA sequences plays a key role in the prediction of gene structure in eukaryotes. Already many computational methods have been proposed for the detection of splice sites and some of them showed high prediction accuracy. However, most of these methods are limited in terms of their long computation time when applied to whole genome sequence data. Results In this paper we propose a hybrid algorithm which combines several effective and informative input features with the state of the art support vector machine (SVM). To obtain the input features we employ information content method based on Shannon's information theory, Shapiro's score scheme, and Markovian probabilities. We also use a feature elimination scheme to reduce the less informative features from the input data. Conclusion In this study we propose a new feature based splice site detection method that shows improved acceptor and donor splice site detection in DNA sequences when the performance is compared with various state of the art and well known methods.

Informations

Publié par	biomed
Publié le	01 janvier 2008
Nombre de lectures	1
Langue	English

Extrait

BMC Bioinformatics

BioMedCentral

Open Access Research Fast splice site detection using information content and feature reduction 1 12 AKMA Baten*, SK Halgamugeand BCH Chang

1 Address: BiomechanicalEngineering Research Group, Department of Mechanical Engineering, Melbourne School of Engineering, The University 2 of Melbourne, Victoria 3010, Australia andInstitute of Plant and Microbial Biology, Academia Sinica, Taiwan Email: AKMA Baten*  a.baten@pgrad.unimelb.edu.au; SK Halgamuge  saman@unimelb.edu.au; BCH Chang  bchang1@gate.sinica.edu.tw * Corresponding author

fromAsia Pacific Bioinformatics Network (APBioNet) Seventh International Conference on Bioinformatics (InCoB2008) Taipei, Taiwan. 20–23 October 2008

Published: 12 December 2008 BMC Bioinformatics2008,9(Suppl 12):S8

doi:10.1186/1471-2105-9-S12-S8

<supplement> <title> <p>Seventh International Conference on Bioinformatics (InCoB2008)</p> </title> <editor>Shoba Ranganathan, Wen-Lian Hsu, Ueng-Cheng Yang and Tin Wee Tan</editor> <note>Proceedings</note> </supplement> This article is available from: http://www.biomedcentral.com/1471-2105/9/S12/S8 © 2008 Baten et al; licensee BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract Background:Accurate identification of splice sites in DNA sequences plays a key role in the prediction of gene structure in eukaryotes. Already many computational methods have been proposed for the detection of splice sites and some of them showed high prediction accuracy. However, most of these methods are limited in terms of their long computation time when applied to whole genome sequence data. Results:In this paper we propose a hybrid algorithm which combines several effective and informative input features with the state of the art support vector machine (SVM). To obtain the input features we employ information content method based on Shannon's information theory, Shapiro's score scheme, and Markovian probabilities. We also use a feature elimination scheme to reduce the less informative features from the input data. Conclusion:In this study we propose a new feature based splice site detection method that shows improved acceptor and donor splice site detection in DNA sequences when the performance is compared with various state of the art and well known methods.

Background Over the past decades, the scientific community has expe rienced a major growth in numbers of sequence data. With the emergence of novel and efficient sequencing technology, DNA sequencing is now much faster. Sequencing of several genomes including the human genome have been completed successfully. This massive amount of sequence data demands sophisticated tools for the analysis of data.

Identifying genes accurately is one of the most important and challenging tasks in bioinformatics and it requires the prediction of the complete gene structure. Identification of splice sites is the core component of eukaryotic gene finding algorithms. Their success depends on the precise identification of the exonintron structure and the splice sites. Most of the eukaryotic protein coding genes are char acterized by exons and introns. Exons are the protein cod ing portion of a gene and they are segmented with intervening sequences of introns. The border between an

Page 1 of 12 (page number not for citation purposes)