Compressed suffix trees: design, construction, and applications [Elektronische Ressource] / Simon Gog
170 pages
English

Compressed suffix trees: design, construction, and applications [Elektronische Ressource] / Simon Gog

Le téléchargement nécessite un accès à la bibliothèque YouScribe
Tout savoir sur nos offres
170 pages
English
Le téléchargement nécessite un accès à la bibliothèque YouScribe
Tout savoir sur nos offres

Description

Universität Ulm |89069Ulm|Germany Fakultät fürIngenieurwissenschaftenund InformatikInstitut für TheoretischeInformatikCompressed Suffix Trees: Design,Construction, and ApplicationsDISSERTATIONzur Erlangung des Doktorgrades Dr. rer. nat.der Fakultät für Ingenieurwissenschaftenund Informatik derUniversität Ulmvorgelegt vonSIMON GOGaus EhingenAugust 2011Amtierender Dekan: Prof. Dr. Klaus DietmayerGutachter: Prof. Dr. Enno OhlebuschProf. Dr. Jacobo ToránProf. Dr. Kunihiko SadakaneTag der Promotion: 24. 11. 20113AcknowledgmentFirst of all I am grateful to my supervisor Enno Ohlebusch for the guidance during myresearch on this work here in Ulm. I also thank Uwe Schöning for providing me a researchassistant position. I would also like to thank Kunihiko Sadakane and Jacobo Torán forrefereeing my thesis.I had a great time here in Ulm during my Ph.D. studies with my colleagues of theInstitute. I enjoyed the daily discussions at lunch and the walk back around the buildingto the office. I am grateful to Adrian Kügel and Stefan Arnold for many discussions aboutAalgorithms, C++ template programming and LT X. I also thank Adrian for proofreadingEthis thesis and Markus Maucher for the suggestion to use the R programming languagefor evaluating my experimental results. I am also grateful to Timo Beller, who conductedthe experiments which are depicted in Table 5.1, and to Rodrigo Cánovas for providingme a patched version of the libcds library.

Sujets

Informations

Publié par
Publié le 01 janvier 2011
Nombre de lectures 20
Langue English
Poids de l'ouvrage 2 Mo

Extrait

Universität Ulm |89069Ulm|Germany Fakultät für
Ingenieurwissenschaften
und Informatik
Institut für Theoretische
Informatik
Compressed Suffix Trees: Design,
Construction, and Applications
DISSERTATION
zur Erlangung des Doktorgrades Dr. rer. nat.
der Fakultät für Ingenieurwissenschaften
und Informatik der
Universität Ulm
vorgelegt von
SIMON GOG
aus Ehingen
August 2011Amtierender Dekan: Prof. Dr. Klaus Dietmayer
Gutachter: Prof. Dr. Enno Ohlebusch
Prof. Dr. Jacobo Torán
Prof. Dr. Kunihiko Sadakane
Tag der Promotion: 24. 11. 20113
Acknowledgment
First of all I am grateful to my supervisor Enno Ohlebusch for the guidance during my
research on this work here in Ulm. I also thank Uwe Schöning for providing me a research
assistant position. I would also like to thank Kunihiko Sadakane and Jacobo Torán for
refereeing my thesis.
I had a great time here in Ulm during my Ph.D. studies with my colleagues of the
Institute. I enjoyed the daily discussions at lunch and the walk back around the building
to the office. I am grateful to Adrian Kügel and Stefan Arnold for many discussions about
A
algorithms, C++ template programming and LT X. I also thank Adrian for proofreading
E
this thesis and Markus Maucher for the suggestion to use the R programming language
for evaluating my experimental results. I am also grateful to Timo Beller, who conducted
the experiments which are depicted in Table 5.1, and to Rodrigo Cánovas for providing
me a patched version of the libcds library.4Contents
1 Introduction 2
2 Basic Concepts 5
2.1 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.1 Strings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.2 Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.3 Machine Model,O(·)-Notation, and Test Environment . . . . . . . 6
2.1.4 Text Compressibility . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 The Suffix Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 The Suffix Array . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4 The Operations Rank and Select . . . . . . . . . . . . . . . . . . . . . . . 14
2.5 The Burrows-Wheeler Transform . . . . . . . . . . . . . . . . . . . . . . . 15
2.6 The Wavelet Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.6.1 The Basic Wavelet Tree . . . . . . . . . . . . . . . . . . . . . . . . 17
2.6.2 Run-Length Wavelet Tree . . . . . . . . . . . . . . . . . . . . . . . 18
2.7 The Backward Search Concept . . . . . . . . . . . . . . . . . . . . . . . . 21
2.8 The Longest Common Prefix Array . . . . . . . . . . . . . . . . . . . . . . 24
3 Design 28
3.1 The Big Picture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2 CST Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.3 The Succinct Data Structure Library . . . . . . . . . . . . . . . . . . . . . 35
3.4 The vector Concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.4.1 int_vector and bit_vector . . . . . . . . . . . . . . . . . . . . . 36
3.4.2 enc_vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.5 The rank_support and select_support Concept . . . . . . . . . . . . . 38
3.6 The wavelet_tree Concept . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.6.1 The Hierarchical Run-Length Wavelet Tree . . . . . . . . . . . . . 41
3.7 The csa Concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.7.1 The Implementation of csa_uncompressed . . . . . . . . . . . . . 47
3.7.2 The of csa_sada . . . . . . . . . . . . . . . . . . 47
3.7.3 The Implementation of csa_wt . . . . . . . . . . . . . . . . . . . . 48
3.7.4 Experimental Comparison of CSA Implementations . . . . . . . . . 48
3.8 The bp_support Concept . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.8.1 The New bp_support Data Structure . . . . . . . . . . . . . . . . 56
3.8.2 The Range Min-Max-Tree . . . . . . . . . . . . . . . . . . . . . . . 68
iii Contents
3.8.3 Experimental Comparison of bp_support Implementations . . . . 70
3.9 The cst Concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
3.9.1 Sadakane’s CST Revisited . . . . . . . . . . . . . . . . . . . . . . . 78
3.9.2 Fischer et al.’s Solution Revisited . . . . . . . . . . . . . . . . . . . 80
3.9.3 A Succinct Representation of the LCP-Interval Tree . . . . . . . . 80
3.9.4 Node Type, Iterators, and Members . . . . . . . . . . . . . . . . . 86
3.9.5 Iterators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
3.10 The lcp Concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
3.10.1 The Tree Compressed LCP Representation . . . . . . . . . . . . . . 91
3.10.2 An Advanced Tree Compressed LCP Representation . . . . . . . . 93
3.10.3 Experimental Comparison of LCP Implementations . . . . . . . . . 95
4 Experimental Comparison of CSTs 100
4.1 Existing Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
4.2 Our Test Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
4.3 Our Test Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
4.3.1 The parent(v) Operation . . . . . . . . . . . . . . . . . . . . . . . . 105
4.3.2 The sibling(v) Op . . . . . . . . . . . . . . . . . . . . . . . 105
4.3.3 The ith child(v,1) Operation . . . . . . . . . . . . . . . . . . . . . 105
4.3.4 Depth-First-Search Traversals . . . . . . . . . . . . . . . . . . . . . 109
4.3.5 Calculating Matching Statistics . . . . . . . . . . . . . . . . . . . . 110
4.3.6 The child(v,c) Operation . . . . . . . . . . . . . . . . . . . . . . . 110
4.3.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
4.4 The Anatomy of Selected CSTs . . . . . . . . . . . . . . . . . . . . . . . . 116
5 Construction 119
5.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
5.2 CST Construction Revisited . . . . . . . . . . . . . . . . . . . . . . . . . . 119
5.2.1 CSA Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
5.2.2 LCP . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
5.2.3 NAV . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
5.2.4 Fast Semi-External Construction of CSTs . . . . . . . . . . . . . . 121
5.3 A LACA for CST construction . . . . . . . . . . . . . . . . . . . . . . . . 126
5.3.1 Linear Time LACAs Revisited . . . . . . . . . . . . . . . . . . . . 126
5.3.2 Ideas for a LACA in the CST Construction . . . . . . . . . . . . . 128
5.3.3 The First Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
5.3.4 The Second Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
5.3.5 Experimental Comparison of LACAs . . . . . . . . . . . . . . . . . 134
6 Applications 138
6.1 Calculating the k-th Order Empirical Entropy . . . . . . . . . . . . . . . . 139
6.2 Succinct Range Minimum Query Data Structures . . . . . . . . . . . . . . 141
6.3 Maximal Exact Matches . . . . . . . . . . . . . . . . . . . . . 145
6.3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145Contents 1
6.3.2 The New Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 146
6.3.3 Implementation and Experiments . . . . . . . . . . . . . . . . . . . 148
7 Conclusion 1511 Introduction
In sequence analysis it is often advantageous to build an index data structure for large
texts, as many tasks – for instance repeated pattern matching – can then be solved in
optimal time. The most popular and important index data structure for many years was
the suffix tree, which was proposed in the early 1970s by Weiner [Wei73] and was later
improved by McCreight [McC76]. Gusfield showed in [Gus97] the power of the suffix tree
by presenting over 20 problems which can be solved in optimal time complexity with
ease by using the suffix tree.
Despite its good properties — the suffix tree can be constructed in linear time for
a text of length n over a constant size alphabet of size σ, and most operations can be
performed in constant time — it has a severe drawback. Its space consumption is at least
17 times the text size in practice [Kur99]. In the early 1990s Gonnet et al. [GBYS92] and
Manber and Myers [MM93] introduced another index data structure called suffix array,
which only occupies nlogn bits which equals four times the text size, when the text uses
32
the ASCII alphabet and the sizen of the text is smaller than 2 . The price for the space
reduction was often a logn factor in the time complexity of the solution. However with
additional tables like the longest common prefix array (LCP) and the child table it is
possible to replace the suffix tree by the so called enhanced suffix array [AKO04], which
requires in the worst case 12 times the text size (nlogn bits for each table).
In the last decade it was shown that the space of the suffix tree can be further improved
by using so called succinct data structures, which have an amazing property: their
representation is compressed but the defined operations can still be performed in an
efficient way. That is we do not have to first decompress the representation to answer an
operation. For

  • Univers Univers
  • Ebooks Ebooks
  • Livres audio Livres audio
  • Presse Presse
  • Podcasts Podcasts
  • BD BD
  • Documents Documents