Resolving the structure of interactomes with hierarchical agglomerative clustering
10 pages
English

Découvre YouScribe en t'inscrivant gratuitement

Je m'inscris

Resolving the structure of interactomes with hierarchical agglomerative clustering

-

Découvre YouScribe en t'inscrivant gratuitement

Je m'inscris
Obtenez un accès à la bibliothèque pour le consulter en ligne
En savoir plus
10 pages
English
Obtenez un accès à la bibliothèque pour le consulter en ligne
En savoir plus

Description

Graphs provide a natural framework for visualizing and analyzing networks of many types, including biological networks. Network clustering is a valuable approach for summarizing the structure in large networks, for predicting unobserved interactions, and for predicting functional annotations. Many current clustering algorithms suffer from a common set of limitations: poor resolution of top-level clusters; over-splitting of bottom-level clusters; requirements to pre-define the number of clusters prior to analysis; and an inability to jointly cluster over multiple interaction types. Results A new algorithm, Hierarchical Agglomerative Clustering (HAC), is developed for fast clustering of heterogeneous interaction networks. This algorithm uses maximum likelihood to drive the inference of a hierarchical stochastic block model for network structure. Bayesian model selection provides a principled method for collapsing the fine-structure within the smallest groups, and for identifying the top-level groups within a network. Model scores are additive over independent interaction types, providing a direct route for simultaneous analysis of multiple interaction types. In addition to inferring network structure, this algorithm generates link predictions that with cross-validation provide a quantitative assessment of performance for real-world examples. Conclusions When applied to genome-scale data sets representing several organisms and interaction types, HAC provides the overall best performance in link prediction when compared with other clustering methods and with model-free graph diffusion kernels. Investigation of performance on genome-scale yeast protein interactions reveals roughly 100 top-level clusters, with a long-tailed distribution of cluster sizes. These are in turn partitioned into 1000 fine-level clusters containing 5 proteins on average, again with a long-tailed size distribution. Top-level clusters correspond to broad biological processes, whereas fine-level clusters correspond to discrete complexes. Surprisingly, link prediction based on joint clustering of physical and genetic interactions performs worse than predictions based on individual data sets, suggesting a lack of synergy in current high-throughput data.

Informations

Publié par
Publié le 01 janvier 2011
Nombre de lectures 5
Langue English

Extrait

Park and BaderBMC Bioinformatics2011,12(Suppl 1):S44 http://www.biomedcentral.com/14712105/12/S1/S44
R E S E A R C HOpen Access Resolving the structure of interactomes with hierarchical agglomerative clustering 1,2 1,2* Yongjin Park, Joel S Bader FromThe Ninth Asia Pacific Bioinformatics Conference (APBC 2011) Inchon, Korea. 1114 January 2011
Abstract Background:Graphs provide a natural framework for visualizing and analyzing networks of many types, including biological networks. Network clustering is a valuable approach for summarizing the structure in large networks, for predicting unobserved interactions, and for predicting functional annotations. Many current clustering algorithms suffer from a common set of limitations: poor resolution of toplevel clusters; oversplitting of bottomlevel clusters; requirements to predefine the number of clusters prior to analysis; and an inability to jointly cluster over multiple interaction types. Results:A new algorithm, Hierarchical Agglomerative Clustering (HAC), is developed for fast clustering of heterogeneous interaction networks. This algorithm uses maximum likelihood to drive the inference of a hierarchical stochastic block model for network structure. Bayesian model selection provides a principled method for collapsing the finestructure within the smallest groups, and for identifying the toplevel groups within a network. Model scores are additive over independent interaction types, providing a direct route for simultaneous analysis of multiple interaction types. In addition to inferring network structure, this algorithm generates link predictions that with crossvalidation provide a quantitative assessment of performance for realworld examples. Conclusions:When applied to genomescale data sets representing several organisms and interaction types, HAC provides the overall best performance in link prediction when compared with other clustering methods and with modelfree graph diffusion kernels. Investigation of performance on genomescale yeast protein interactions reveals roughly 100 toplevel clusters, with a longtailed distribution of cluster sizes. These are in turn partitioned into 1000 finelevel clusters containing 5 proteins on average, again with a longtailed size distribution. Toplevel clusters correspond to broad biological processes, whereas finelevel clusters correspond to discrete complexes. Surprisingly, link prediction based on joint clustering of physical and genetic interactions performs worse than predictions based on individual data sets, suggesting a lack of synergy in current highthroughput data.
Background Graphs or networks provide an excellent organizing fra mework for representing data from highthroughput experiments that measure interactomes, or genome scale biological interactions: physical interactions between proteins; genetic interactions or specific pheno types such as synthetic lethality between genes; gene regulation interactions between transcription factors and
* Correspondence: joel.bader@jhu.edu 1 Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD 21218, USA Full list of author information is available at the end of the article
genes; and metabolic connections between enzymes and metabolites. In these networks, vertices represent genes, proteins, or other molecules, and edges represent speci fic interaction types [1,2]. An important current challenge is to develop methods to analyze these and other networks, such as social net works [3]. One challenge is to infer network structure by identifying subgroups of related vertices, which in the biological domain may be inferred to have similar func tions. A second challenge is to predict links that might exist but which are not represented in the data. Missing links are prevalent in biological interactomes, where
© 2011 Park and Bader; licensee BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
  • Univers Univers
  • Ebooks Ebooks
  • Livres audio Livres audio
  • Presse Presse
  • Podcasts Podcasts
  • BD BD
  • Documents Documents