Compressed suffix trees: design, construction, and applications [Elektronische Ressource] / Simon Gog

universitat_ulm

Le téléchargement nécessite un accès à la bibliothèque YouScribe
Tout savoir sur nos offres

170 pages

English

Le téléchargement nécessite un accès à la bibliothèque YouScribe
Tout savoir sur nos offres

A propos
Informations
Extrait

Description

Sujets

Informatik

Informations

Publié par	universitat_ulm
Publié le	01 janvier 2011
Nombre de lectures	20
Langue	English
Poids de l'ouvrage	2 Mo

Extrait

Universität Ulm |89069Ulm|Germany Fakultät für
Ingenieurwissenschaften
und Informatik
Institut für Theoretische
Informatik
Compressed Suﬃx Trees: Design,
Construction, and Applications
DISSERTATION
zur Erlangung des Doktorgrades Dr. rer. nat.
der Fakultät für Ingenieurwissenschaften
und Informatik der
Universität Ulm
vorgelegt von
SIMON GOG
aus Ehingen
August 2011Amtierender Dekan: Prof. Dr. Klaus Dietmayer
Gutachter: Prof. Dr. Enno Ohlebusch
Prof. Dr. Jacobo Torán
Prof. Dr. Kunihiko Sadakane
Tag der Promotion: 24. 11. 20113
Acknowledgment
First of all I am grateful to my supervisor Enno Ohlebusch for the guidance during my
research on this work here in Ulm. I also thank Uwe Schöning for providing me a research
assistant position. I would also like to thank Kunihiko Sadakane and Jacobo Torán for
refereeing my thesis.
I had a great time here in Ulm during my Ph.D. studies with my colleagues of the
Institute. I enjoyed the daily discussions at lunch and the walk back around the building
to the oﬃce. I am grateful to Adrian Kügel and Stefan Arnold for many discussions about
A
algorithms, C++ template programming and LT X. I also thank Adrian for proofreading
E
this thesis and Markus Maucher for the suggestion to use the R programming language
for evaluating my experimental results. I am also grateful to Timo Beller, who conducted
the experiments which are depicted in Table 5.1, and to Rodrigo Cánovas for providing
me a patched version of the libcds library.4Contents
1 Introduction 2
2 Basic Concepts 5
2.1 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.1 Strings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.2 Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.3 Machine Model,O(·)-Notation, and Test Environment . . . . . . . 6
2.1.4 Text Compressibility . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 The Suﬃx Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 The Suﬃx Array . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4 The Operations Rank and Select . . . . . . . . . . . . . . . . . . . . . . . 14
2.5 The Burrows-Wheeler Transform . . . . . . . . . . . . . . . . . . . . . . . 15
2.6 The Wavelet Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.6.1 The Basic Wavelet Tree . . . . . . . . . . . . . . . . . . . . . . . . 17
2.6.2 Run-Length Wavelet Tree . . . . . . . . . . . . . . . . . . . . . . . 18
2.7 The Backward Search Concept . . . . . . . . . . . . . . . . . . . . . . . . 21
2.8 The Longest Common Preﬁx Array . . . . . . . . . . . . . . . . . . . . . . 24
3 Design 28
3.1 The Big Picture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2 CST Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.3 The Succinct Data Structure Library . . . . . . . . . . . . . . . . . . . . . 35
3.4 The vector Concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.4.1 int_vector and bit_vector . . . . . . . . . . . . . . . . . . . . . 36
3.4.2 enc_vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.5 The rank_support and select_support Concept . . . . . . . . . . . . . 38
3.6 The wavelet_tree Concept . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.6.1 The Hierarchical Run-Length Wavelet Tree . . . . . . . . . . . . . 41
3.7 The csa Concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.7.1 The Implementation of csa_uncompressed . . . . . . . . . . . . . 47
3.7.2 The of csa_sada . . . . . . . . . . . . . . . . . . 47
3.7.3 The Implementation of csa_wt . . . . . . . . . . . . . . . . . . . . 48
3.7.4 Experimental Comparison of CSA Implementations . . . . . . . . . 48
3.8 The bp_support Concept . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.8.1 The New bp_support Data Structure . . . . . . . . . . . . . . . . 56
3.8.2 The Range Min-Max-Tree . . . . . . . . . . . . . . . . . . . . . . . 68
iii Contents
3.8.3 Experimental Comparison of bp_support Implementations . . . . 70
3.9 The cst Concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
3.9.1 Sadakane’s CST Revisited . . . . . . . . . . . . . . . . . . . . . . . 78
3.9.2 Fischer et al.’s Solution Revisited . . . . . . . . . . . . . . . . . . . 80
3.9.3 A Succinct Representation of the LCP-Interval Tree . . . . . . . . 80
3.9.4 Node Type, Iterators, and Members . . . . . . . . . . . . . . . . . 86
3.9.5 Iterators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
3.10 The lcp Concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
3.10.1 The Tree Compressed LCP Representation . . . . . . . . . . . . . . 91
3.10.2 An Advanced Tree Compressed LCP Representation . . . . . . . . 93
3.10.3 Experimental Comparison of LCP Implementations . . . . . . . . . 95
4 Experimental Comparison of CSTs 100
4.1 Existing Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
4.2 Our Test Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
4.3 Our Test Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
4.3.1 The parent(v) Operation . . . . . . . . . . . . . . . . . . . . . . . . 105
4.3.2 The sibling(v) Op . . . . . . . . . . . . . . . . . . . . . . . 105
4.3.3 The ith child(v,1) Operation . . . . . . . . . . . . . . . . . . . . . 105
4.3.4 Depth-First-Search Traversals . . . . . . . . . . . . . . . . . . . . . 109
4.3.5 Calculating Matching Statistics . . . . . . . . . . . . . . . . . . . . 110
4.3.6 The child(v,c) Operation . . . . . . . . . . . . . . . . . . . . . . . 110
4.3.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
4.4 The Anatomy of Selected CSTs . . . . . . . . . . . . . . . . . . . . . . . . 116
5 Construction 119
5.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
5.2 CST Construction Revisited . . . . . . . . . . . . . . . . . . . . . . . . . . 119
5.2.1 CSA Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
5.2.2 LCP . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
5.2.3 NAV . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
5.2.4 Fast Semi-External Construction of CSTs . . . . . . . . . . . . . . 121
5.3 A LACA for CST construction . . . . . . . . . . . . . . . . . . . . . . . . 126
5.3.1 Linear Time LACAs Revisited . . . . . . . . . . . . . . . . . . . . 126
5.3.2 Ideas for a LACA in the CST Construction . . . . . . . . . . . . . 128
5.3.3 The First Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
5.3.4 The Second Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
5.3.5 Experimental Comparison of LACAs . . . . . . . . . . . . . . . . . 134
6 Applications 138
6.1 Calculating the k-th Order Empirical Entropy . . . . . . . . . . . . . . . . 139
6.2 Succinct Range Minimum Query Data Structures . . . . . . . . . . . . . . 141
6.3 Maximal Exact Matches . . . . . . . . . . . . . . . . . . . . . 145
6.3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145Contents 1
6.3.2 The New Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 146
6.3.3 Implementation and Experiments . . . . . . . . . . . . . . . . . . . 148
7 Conclusion 1511 Introduction
In sequence analysis it is often advantageous to build an index data structure for large
texts, as many tasks – for instance repeated pattern matching – can then be solved in
optimal time. The most popular and important index data structure for many years was
the suﬃx tree, which was proposed in the early 1970s by Weiner [Wei73] and was later
improved by McCreight [McC76]. Gusﬁeld showed in [Gus97] the power of the suﬃx tree
by presenting over 20 problems which can be solved in optimal time complexity with
ease by using the suﬃx tree.
Despite its good properties — the suﬃx tree can be constructed in linear time for
a text of length n over a constant size alphabet of size σ, and most operations can be
performed in constant time — it has a severe drawback. Its space consumption is at least
17 times the text size in practice [Kur99]. In the early 1990s Gonnet et al. [GBYS92] and
Manber and Myers [MM93] introduced another index data structure called suﬃx array,
which only occupies nlogn bits which equals four times the text size, when the text uses
32
the ASCII alphabet and the sizen of the text is smaller than 2 . The price for the space
reduction was often a logn factor in the time complexity of the solution. However with
additional tables like the longest common preﬁx array (LCP) and the child table it is
possible to replace the suﬃx tree by the so called enhanced suﬃx array [AKO04], which
requires in the worst case 12 times the text size (nlogn bits for each table).
In the last decade it was shown that the space of the suﬃx tree can be further improved
by using so called succinct data structures, which have an amazing property: their
representation is compressed but the deﬁned operations can still be performed in an
eﬃcient way. That is we do not have to ﬁrst decompress the representation to answer an
operation. For