Resolving the structure of interactomes with hierarchical agglomerative clustering

biomed - Park Yongjin , Bader

Découvre YouScribe en t'inscrivant gratuitement

Je m'inscris

Obtenez un accès à la bibliothèque pour le consulter en ligne
En savoir plus

10 pages

English

Obtenez un accès à la bibliothèque pour le consulter en ligne
En savoir plus

A propos
Informations
Extrait

Description

Graphs provide a natural framework for visualizing and analyzing networks of many types, including biological networks. Network clustering is a valuable approach for summarizing the structure in large networks, for predicting unobserved interactions, and for predicting functional annotations. Many current clustering algorithms suffer from a common set of limitations: poor resolution of top-level clusters; over-splitting of bottom-level clusters; requirements to pre-define the number of clusters prior to analysis; and an inability to jointly cluster over multiple interaction types. Results A new algorithm, Hierarchical Agglomerative Clustering (HAC), is developed for fast clustering of heterogeneous interaction networks. This algorithm uses maximum likelihood to drive the inference of a hierarchical stochastic block model for network structure. Bayesian model selection provides a principled method for collapsing the fine-structure within the smallest groups, and for identifying the top-level groups within a network. Model scores are additive over independent interaction types, providing a direct route for simultaneous analysis of multiple interaction types. In addition to inferring network structure, this algorithm generates link predictions that with cross-validation provide a quantitative assessment of performance for real-world examples. Conclusions When applied to genome-scale data sets representing several organisms and interaction types, HAC provides the overall best performance in link prediction when compared with other clustering methods and with model-free graph diffusion kernels. Investigation of performance on genome-scale yeast protein interactions reveals roughly 100 top-level clusters, with a long-tailed distribution of cluster sizes. These are in turn partitioned into 1000 fine-level clusters containing 5 proteins on average, again with a long-tailed size distribution. Top-level clusters correspond to broad biological processes, whereas fine-level clusters correspond to discrete complexes. Surprisingly, link prediction based on joint clustering of physical and genetic interactions performs worse than predictions based on individual data sets, suggesting a lack of synergy in current high-throughput data.

Informations

Publié par	biomed
Publié le	01 janvier 2011
Nombre de lectures	5
Langue	English

Extrait

Park and BaderBMC Bioinformatics2011,12(Suppl 1):S44 http://www.biomedcentral.com/14712105/12/S1/S44

R E S E A R C HOpen Access Resolving the structure of interactomes with hierarchical agglomerative clustering 1,2 1,2* Yongjin Park, Joel S Bader FromThe Ninth Asia Pacific Bioinformatics Conference (APBC 2011) Inchon, Korea. 1114 January 2011

Abstract Background:Graphs provide a natural framework for visualizing and analyzing networks of many types, including biological networks. Network clustering is a valuable approach for summarizing the structure in large networks, for predicting unobserved interactions, and for predicting functional annotations. Many current clustering algorithms suffer from a common set of limitations: poor resolution of toplevel clusters; oversplitting of bottomlevel clusters; requirements to predefine the number of clusters prior to analysis; and an inability to jointly cluster over multiple interaction types. Results:A new algorithm, Hierarchical Agglomerative Clustering (HAC), is developed for fast clustering of heterogeneous interaction networks. This algorithm uses maximum likelihood to drive the inference of a hierarchical stochastic block model for network structure. Bayesian model selection provides a principled method for collapsing the finestructure within the smallest groups, and for identifying the toplevel groups within a network. Model scores are additive over independent interaction types, providing a direct route for simultaneous analysis of multiple interaction types. In addition to inferring network structure, this algorithm generates link predictions that with crossvalidation provide a quantitative assessment of performance for realworld examples. Conclusions:When applied to genomescale data sets representing several organisms and interaction types, HAC provides the overall best performance in link prediction when compared with other clustering methods and with modelfree graph diffusion kernels. Investigation of performance on genomescale yeast protein interactions reveals roughly 100 toplevel clusters, with a longtailed distribution of cluster sizes. These are in turn partitioned into 1000 finelevel clusters containing 5 proteins on average, again with a longtailed size distribution. Toplevel clusters correspond to broad biological processes, whereas finelevel clusters correspond to discrete complexes. Surprisingly, link prediction based on joint clustering of physical and genetic interactions performs worse than predictions based on individual data sets, suggesting a lack of synergy in current highthroughput data.

Background Graphs or networks provide an excellent organizing fra mework for representing data from highthroughput experiments that measure interactomes, or genome scale biological interactions: physical interactions between proteins; genetic interactions or specific pheno types such as synthetic lethality between genes; gene regulation interactions between transcription factors and

* Correspondence: joel.bader@jhu.edu 1 Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD 21218, USA Full list of author information is available at the end of the article

genes; and metabolic connections between enzymes and metabolites. In these networks, vertices represent genes, proteins, or other molecules, and edges represent speci fic interaction types [1,2]. An important current challenge is to develop methods to analyze these and other networks, such as social net works [3]. One challenge is to infer network structure by identifying subgroups of related vertices, which in the biological domain may be inferred to have similar func tions. A second challenge is to predict links that might exist but which are not represented in the data. Missing links are prevalent in biological interactomes, where

© 2011 Park and Bader; licensee BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Univers
Ebooks
Livres audio
Presse
Podcasts
BD
Documents

Resolving the structure of interactomes with hierarchical agglomerative clustering

YouScribe

Le catalogue

Le service

Les conditions