Word-sense disambiguation in biomedical ontologiesDissertationzur Erlangung des akademischen GradesDoctor rerum naturalium (Dr. rer. nat.)vorgelegt an derTechnischen Universit at DresdenFakult at InformatikvonDipl.-Bioch. M.Sc. Dimitra G Alexopoulougeboren am 24 November 1981 in Athen, GriechenlandGutachterProf. Dr.-Ing. Michael Schroeder, Technische Universit at Dresden (Betreuender Hochschullehrer)Prof. Dr. Udo Hahn, Friedrich-Schiller Universit at JenaTag der Verteidigung: 11 Juni 2010Dresden, den 8 April 2010To my parents, George and Maria, and to my grandma, Stefania, who is looking from aboveo & o " & o , !o M , o , " ,o o " o AcknowledgmentsThere are a lot of people who have directly or indirectly helped in making this work possible and I wouldlike to thank.First of all I would like to thank my supervisor Professor Michael Schroeder for giving me theopportunity to work in such an interdisciplinary and international group, guiding me throughout theseyears, making things look simple in an inspirational way, caring for us as a group from the scienti c sidebut also from the human side. It has certainly been a great experience for me to work in this group andI will de nitely be present in 2017 when we will open the time capsule we all sealed in November of 2007.
Wordsense disambiguation in biomedical ontologies
Gutachter
Dissertation
zur Erlangung des akademischen Grades Doctor rerum naturalium (Dr. rer. nat.)
vorgelegt an der Technischen Universität Dresden Fakultät Informatik
von
Dipl.Bioch. M.Sc. Dimitra G Alexopoulou geboren am 24 November 1981 in Athen, Griechenland
Prof. Dr.Ing. Michael Schroeder, Technische Universität Dresden (Betreuender Hochschullehrer)
Prof. Dr. Udo Hahn, FriedrichSchiller Universität Jena
Tag der Verteidigung: 11 Juni 2010
Dresden, den 8 April 2010
To my parents, George and Maria, and to my grandma, Stefania, who is looking from above Σγoνεις µoυτ oυς ,Γιωργo και M αρια,η γιαγια µoυκαι στ ,Στ εφανια, πoυ κoιτ αει απo ψηλα
Acknowledgments
There are a lot of people who have directly or indirectly helped in making this work possible and I would like to thank. First of all I would like to thank my supervisor Professor Michael Schroeder for giving me the opportunity to work in such an interdisciplinary and international group, guiding me throughout these years, making things look simple in an inspirational way, caring for us as a group from the scientific side but also from the human side. It has certainly been a great experience for me to work in this group and I will definitely be present in 2017 when we will open the time capsule we all sealed in November of 2007. I would also like to thank Jörg Hakenberg for an inspiring collaboration on Word Sense Disambigua tion, Bill Andreopoulos for pushing for publications from the very first months of working together, Thomas Wächter for sharing his enthusiasm while working at the same office, Heiko Dietze for patiently replying to all databasequerying related questions and Andreas Doms for working together in WSD. A big “ευχα̺ιστ ω” goes to George Tsatsaronis for providing valuable and constructive feedback on this thesis. I would like to especially thank Christof Winter for helping me during my first month in Dresden, when I was still “mute”, not speaking a single word of German. He was the person each one of us would like to have around during a fresh start. I can now say “Danke schön” :). In my fresh start but also during the years very important was also the contribution of Mandy Gläßer. So, “Vielen vielen Dank” go also to her. Big thanks also go to the system administrators, Alex Mestiashvili, Nick Dannenberg and Gregor Friedrich, who were at any time available and willing to assist in small and bigger tasks. I would also like to thank a group of people with which we started as colleagues and we ended up being a lot more, friends, food/beer friends, allunitedagainstbadweather friends, Piled higher and Deeper friends... I could add a lot of tags to them, but not this time. I think ‘friends’ is enough. So thanks Annalisa Marsico, Anne Tuukkanen, Janine Roy, Conrad Plake, Gihan Dawelbeit, Andreas Henschel. Special thanks also go to friends with which we have shared the same passion and frustration for science a bit longer than the PhD years, since we started working on Bioinformatics in Athens: Anna Elefsinioti and Evangelia Petsalakis. More special thanks go the geographically distant but very very very close friends, Rania Limitsiou and Vicky Sagia in Greece and Maria Mirotsou in the US. They know why. My last but certainly not least thanks go to my parents, George and Maria, and my godmother, Roxanne, for being there anytime and making things sound much easier.
List
of
Publications
1. Andreopoulos, B.,Alexopoulou, D., and Schroeder, M. (2008).Word sense disambiguation in biomedical ontologies with term cooccurrence analysis and document clustering. International Journal of Data Mining and Bioinformatics(Special Issue on Text, 2(3), 193–215. Mining and Information Retrieval).
2.Alexopoulou, D., Wächter, T., Pickersgill, L., Eyre, C., and Schroeder, M. (2008).Termi nologies for textmining; an experiment in the lipoprotein metabolism domain.BMC Bioinformatics, 9 Suppl 4, S2.
3.Alexopoulou, D., Andreopoulos, B., Dietze, H., Doms, A., Gandon, F., Hakenberg, J., Khelif, K., Schroeder, M., and Wächter, T. (2009).Biomedical word sense disambiguation with ontologies and metadata: automation meets accuracy.BMC Bioinformatics, 10(1), 28.
4. Oliver, H., Diallo, G., de Quincey, E., Kostkova, P., Jawaheer, G.,Alexopoulou, D., Habermann, B., Stevens, R., Jupp, S., Khelif, K., Schroeder, M., and Madle, G. (2009).A usercentred evaluation framework for the Sealife semantic web browsers.BMC Bioinformatics, 10, S14. (Special issue dedicated to the SWAT4LS workshop).
Abstract
With the ever increase in biomedical literature, textmining has emerged as an important technology to support biocuration and search. Word sense disambiguation (WSD), the correct identification of terms in text in the light of ambiguity, is an important problem in textmining. Since the late 1940s many approaches based on supervised (decision trees, naive Bayes, neural networks, support vector machines) and unsupervised machine learning (contextclustering, wordclustering, cooccurrence graphs) have been developed. Knowledgebased methods that make use of the WordNet computational lexicon have also been developed. But only few make use of ontologies, i.e. hierarchical controlled vocabularies, to solve the problem and none exploit inference over ontologies and the use of metadata from publications. This thesis addresses the WSD problem in biomedical ontologies by suggesting different approaches for word sense disambiguation that use ontologies and metadata. The “Closest Sense” method assumes that the ontology defines multiple senses of the term; it computes the shortest path of cooccurring terms in the document to one of these senses. The “Term Cooc” method defines a logodds ratio for cooccurring terms including inferred cooccurrences. The “MetaData” approach trains a classifier on metadata; it does not require any ontology, but requires training data, which the other methods do not. These approaches are compared to each other when applied to a manually curated training corpus of 2600 documents for seven ambiguous terms from the Gene Ontology and MeSH. All approaches over all conditions achieve 80% success rate on average. The MetaData approach performs best with 96%, when trained on highquality data. Its performance deteriorates as quality of the training data decreases. The Term Cooc approach performs better on Gene Ontology (92% success) than on MeSH (73% success) as MeSH is not a strict isa/partof, but rather a loose isrelatedto hierarchy. The Closest Sense approach achieves on average 80% success rate. Furthermore, the thesis showcases applications ranging from ontology design to semantic search where WSD is important.
Contents
1
2
3
4
Motivation 1.1 Definition of Open Problems. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.1 Open Problem 1: Word Sense Disambiguation in Biomedical Corpora. . . . . . . 1.1.2 Open problem 2: Text mining and WSD in Biomedical Terminologies. . . . . . . 1.1.3 Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .