Query estimation techniques in database systems [Elektronische Ressource] / von Arnd Christian König

universitat_des_saarlandes

Le téléchargement nécessite un accès à la bibliothèque YouScribe
Tout savoir sur nos offres

100 pages

English

Le téléchargement nécessite un accès à la bibliothèque YouScribe
Tout savoir sur nos offres

A propos
Informations
Extrait

Sujets

Informatik

Informations

Publié par	universitat_des_saarlandes
Publié le	01 janvier 2004
Nombre de lectures	15
Langue	English
Poids de l'ouvrage	1 Mo

Extrait

Query Estimation Techniques in Database Systems
Dissertation
zur Erlangung des Grades
Doktor der Ingenieurwissenschaften (Dr.-Ing.)
der Naturwissenschaftlich-Technischen Fakult at I
der Universit at des Saarlandes
von
Diplom-Informatiker
Arnd Christian K onig
SaarbrucÄ ken,
im Dezember 2001Contents
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.1 Synopsis Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.2 Physical Design Problem for Data Synopses . . . . . . . . . . . . . . . . . 5
1.3 Contribution and Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3.1 Limitations of This Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Related Work 8
2.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Query Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.4 Query Result Estimation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4.1 Design Space for Data Synopses . . . . . . . . . . . . . . . . . . . . . . . 13
2.4.2 Approximation Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4.3 A Classiﬁcation of All Approaches . . . . . . . . . . . . . . . . . . . . . . 20
2.4.4 Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.5 Physical Design of Data Synopses . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3 Approximation of a Single Attribute 24
3.1 Approximation of the Attribute-Value Frequencies . . . . . . . . . . . . . . . . . 25
3.1.1 Fitting the Frequency Function within a Bucket . . . . . . . . . . . . . . 26
3.1.2 Computation of the Error for a Single Bucket . . . . . . . . . . . . . . . . 27
3.1.3 Partitioning ofV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2 Approximation of the Value Density . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.3 Synopsis Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.3.1 Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.3.2 Reconciling Frequency and Density Synopses for One Attribute . . . . . . 36
3.3.3 Multiple Synopses . . . . . . . . . . . . . . . . . . . . . . . . 37
3.3.4 Matching Frequency and Density Synopses . . . . . . . . . . . . . . . . . 37
3.3.5 Weighted Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.4 Using Spline Synopses for Query Estimation . . . . . . . . . . . . . . . . . . . . . 38
3.4.1 Projections (and Grouping Queries) . . . . . . . . . . . . . . . . . . . . . 39
3.4.2 Range Selections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.4.3 The Inherent Diﬃculty of Join Estimation . . . . . . . . . . . . . . . . . . 41
3.4.4 Join Synopses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.4.5 Integrating Join Synopses . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.4.6 Experimental Evaluation of Join Synopses . . . . . . . . . . . . . . . . . . 47
3.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48Table of Contents
3.5.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.5.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.5.3 Running Times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4 Approximation of Multidimensional Correlation 53
4.1 Multidimensional Partitioning and the “Curse of Dimensionality” . . . . . . . . . 53
4.2 Preserving Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.3 Spline Synopses and Mapping Multidimensional Data . . . . . . . . . . . . . . . 55
4.4 The Sierpinski´ Space-Filling Curve Construction . . . . . . . . . . . . . . . . . . 55
4.4.1 Properties of the Sierpinski´ Curve . . . . . . . . . . . . . . . . . . . . . . 58
4.5 Using Multidimensional Spline Synopses for Query Estimation . . . . . . . . . . 59
4.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.6.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5 Physical Design for Data Synopses 62
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.2 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.3 Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.3.1 Storage of Multiple Synopses . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.3.2 The Error Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.4 Synopsis Selection and Memory Allocation . . . . . . . . . . . . . . . . . . . . . . 65
5.4.1 Pruning the Search Space . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.4.2 Selecting the Synopses for a Single Relation R . . . . . . . . . . . . . . . 70
5.4.3 the for all Relations . . . . . . . . . . . . . . . . . . . 72
5.5 Enumeration of all synopses combinations . . . . . . . . . . . . . . . . . . . . . . 72
5.6 Reducing the Computational Overhead. . . . . . . . . . . . . . . . . . . . . . . . 73
5.7 Putting the Pieces Together . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.7.1 Running Times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.8 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.8.1 Base Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.8.2 Skewed and Correlated Data . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.8.3 Query Locality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
6 Conclusion 81
7 Summary of this Thesis 83
8 Zusammenfassung der Arbeit 85
4Abstract
The eﬀectiveness of query optimization in database systems critically depends on the system’s
abilitytoassess theexecutioncostsof diﬀerentqueryexecution plans. Forthis purpose, thesizes
and data distributions of the intermediate results generated during plan execution need to be
estimated as accurately as possible. This estimation requires the maintenanceof statistics on the
data stored in the database, which are referred to as data synopses.
Whiletheproblemofquerycostestimationhasreceivedsigniﬁcantattentionforoveradecade,
it has remained an open issue in practice, because most previous techniques have focused on
singular aspects of the problem such as minimizing the estimation error of a single type of query
and a single data distribution, whereas database management systems generally need to support
a wide range of queries over a number of datasets.
In this thesis I introduce a new technique for query result estimation, which extends existing
techniques in that it oﬀers estimation for all combinations of the three major database operators
selection, projection, and join. The approach is based on separate and independent approxi-
mations of the attribute values contained in a dataset and their frequencies. Through the use
of space-ﬁlling curves, the approach extends to multi-dimensional data, while maintaining its
accuracy and computational properties. The resulting estimation accuracy is competitive with
specialized techniques and superior to the histogram techniques currently implemented in com-
mercial database management systems.
Because data synopses reside in main memory, they compete for available space with the
databasecacheandqueryexecutionbuﬀers. Consequently,thememoryavtodatasynopses
needs to be used eﬃciently. This results in a physical design problem for data synopses, which is
to determine the best set of synopses for a given combination of datasets, queries, and available
memory. Thisthesisintroducesaformalizationoftheproblem,andeﬃcientalgorithmicsolutions.
All discussed techniques are evaluated with regard to their overhead and resulting estimation
accuracy on a variety of synthetic and real-life datasets.Kurzfassung
Die Eﬀektivit at der Anfrage-Optimierung in Datenbanksystemen h angt entscheidend von der
F ahigkeitdesSystemsab,dieKostenderverschiedenenM oglichkeiten,eineAnfrageauszufuhren,Ä
abzusch atzen. Zu diesem Zweck ist es n otig, die Gr oßen und Datenverteilungen der Zwischen-
resultate, die w ahrend der AusfuhrungÄ einer Anfrage generiert werden, so genau wie m oglich zu
sch atzen. Zur L osung dieses Sch atzproblems ben otigt man Statistiken ubÄ er die Daten, welche
in dem Datenbanksystem gespeichert werden; diese Statistiken werden auch als Daten Synopsen
bezeichnet.
Obwohl das Problem der Sch atzung von Anfragekosten innerhalb der letzten 10 Jahre in-
tensiv untersucht wurde, gilt es weiterhin als oﬀen, da viele der vorgeschlagenen Ans atze nur
einen Teilaspekt des Problems betrachten. In den meisten F allen wurden Techniken furÄ das Ab-
sch atzen eines einzelnen Operators auf einer einzelnen Datenverteilung untersucht, wohingegen
DatenbanksystemeinderPraxiseineVielfaltvonAnfragenubÄ erdiverseDatens atzeunterstutzenÄ
mussen.Ä
AusdiesemGrundst