Efficiency in Cluster Database Systems - Dynamic and Workload-Aware Scaling and Allocation [Elektronische Ressource] / Tilmann Rabl. Betreuer: Harald Kosch

universitat_passau

Le téléchargement nécessite un accès à la bibliothèque YouScribe
Tout savoir sur nos offres

244 pages

English

Le téléchargement nécessite un accès à la bibliothèque YouScribe
Tout savoir sur nos offres

A propos
Informations
Extrait

Description

Sujets

Informatik

Informations

Publié par	universitat_passau
Publié le	01 janvier 2011
Nombre de lectures	30
Langue	English
Poids de l'ouvrage	7 Mo

Extrait

Lehrstuhl fur Verteilte Informationssysteme
Fakult at fur Informatik und Mathematik
Universit at Passau
Doctoral Thesis
E ciency in Cluster Database Systems
Dynamic and Workload-Aware Scaling and Allocation
Dipl. Inf. Tilmann Rabl
July 21, 2011
Advisor: Prof. Dr. Harald Kosch
Second Advisor: Prof. Lionel BrunieFor Michael Hendrik RablAbstract
Database systems have been vital in all forms of data processing for a long time. In recent
years, the amount of processed data has been growing dramatically, even in small projects.
Nevertheless, database management systems tend to be static in terms of size and perfor-
mance which makes scaling a di cult and expensive task. Because of performance and
especially cost advantages more and more installed systems have a shared nothing cluster
architecture. Due to the massive parallelism of the hardware programming paradigms
from high performance computing are translated into data processing. Database research
struggles to keep up with this trend. A key feature of traditional database systems is
to provide transparent access to the stored data. This introduces data dependencies and
increases system complexity and inter process communication. Therefore, many develop-
ers are exchanging this feature for a better scalability. However, explicitly managing the
data distribution and data ow requires a deep understanding of the distributed system
and reduces the possibilities for automatic and autonomic optimization. In this thesis
we present an approach for database system scaling and allocation that features good
scalability although it keeps the data distribution transparent.
The rst part of this thesis analyzes the challenges and opportunities for self-scaling
database management systems in cluster environments. Scalability is a major concern of
Internet based applications. Access peaks that overload the application are a nancial risk.
Therefore, systems are usually con gured to be able to process peaks at any given moment.
As a result, server systems often have a very low utilization. In distributed systems the
e ciency can be increased by adapting the number of nodes to the current workload. We
propose a processing model and an architecture that allows e cient self-scaling of cluster
database systems. In the second part we consider di erent allocation approaches. To
increase the e ciency we present a workload-aware, query-centric model. The approach
is formalized; optimal and heuristic algorithms are presented. The algorithms optimize
the data distribution for local query execution and balance the workload according to
the query history. We present di erent query classi cation schemes for di erent forms
of partitioning. The approach is evaluated for OLTP and OLAP style workloads. It is
shown that variants of the approach scale well for both elds of application. The third
part of the thesis considers benchmarks for large, adaptive systems. First, we present a
data generator for cloud-sized applications. Due to its architecture the data generator can
easily be extended and con gured. A key feature is the high degree of parallelism that
makes linear speedup for arbitrary numbers of nodes possible. To simulate systems with
user interaction, we have analyzed a productive online e-learning management system.
Based on our ndings, we present a model for workload generation that considers the
temporal dependency of user interaction.
vKurzzusammenfassung
Datenbanksysteme sind seit langem die Grundlage fur alle Arten von Informationsver-
arbeitung. In den letzten Jahren ist das Datenaufkommen selbst in kleinen Projekten
dramatisch angestiegen. Dennoch sind viele Datenbanksysteme statisch in Bezug auf
ihre Kapazit at und Verarbeitungsgeschwindigkeit was die Skalierung aufwendig und teuer
macht. Aufgrund der guten Gescheit und vor allem aus Kostengrunden haben
immer mehr Systeme eine Shared-Nothing-Architektur, bestehen also aus unabh angigen,
lose gekoppelten Rechnerknoten. Da dieses Konstruktionsprinzip einen sehr hohen Grad
an Parallelit at aufweist, werden zunehmend Programmierparadigmen aus dem klassischen
Hochleistungsrechen fur die Informationsverarbeitung eingesetzt. Dieser Trend stellt die
Datenbankforschung vor gro e Herausforderungen. Eine der grundlegenden Eigenschaften
traditioneller Datenbanksysteme ist der transparente Zugri zu den gespeicherten Daten,
der es dem Nutzer erlaubt unabh angig von der internen Organisation auf die Daten
zuzugreifen. Die resultierende Unabh angigkeit fuhrt zu Abh angigkeiten in den Daten
und erh oht die Komplexit at der Systeme und der Kommunikation zwischen einzelnen
Prozessen. Daher wird Transparenz von vielen Entwicklern fur eine bessere Skalierbarkeit
geopfert. Diese Entscheidung fuhrt dazu, dass der die Datenorganisation und der Daten-
uss explizit behandelt werden muss, was die M oglichkeiten fur eine automatische und
autonome Optimierung des Systems einschr ankt. Der in dieser Arbeit vorgestellte Ansatz
zur Skalierung und Allokation erh alt den transparenten Zugri und zeichnet sich dabei
durch seine vollst andige Automatisierbarkeit und sehr gute Skalierbarkeit aus.
Im ersten Teil dieser Dissertation werden die Herausforderungen und Chancen fur
selbst-skalierende Datenbankmanagementsysteme behandelt, die in auf Computerclus-
tern betrieben werden. Gute Skalierbarkeit ist eine notwendige Eigenschaft fur Anwen-
dungen, die ub er das Internet zugreifbar sind. Lastspitzen im Zugri , die die Anwen-
dung ub erladen stellen ein nanzielles Risiko dar. Deshalb werden Systeme so kon guri-
ert, dass sie eventuelle Lastspitzen zu jedem Zeitpunkt verarbeiten k onnen. Das fuhrt
meist zu einer im Schnitt sehr geringen Auslastung der unterliegenden Systeme. Eine
M oglichkeit dieser Ine zienz entgegen zu steuern ist es die Anzahl der verwendeten Rech-
nerknoten an die vorliegende Last anzupassen. In dieser Dissertation werden ein Modell
und eine Architektur fur die Anfrageverarbeitung vorgestellt, mit denen es m oglich ist
Datenbanksysteme auf Clusterrechnern einfach und e zient zu skalieren. Im zweiten Teil
der Arbeit werden verschieden M oglichkeiten fur die Datenverteilung behandelt. Um die
E zienz zu steigern wird ein Modell verwendet, das die Lastv im Anfragestrom
beruc ksichtigt. Der Ansatz ist formalisiert und optimale und heuristische L osungen wer-
den pr asentiert. Die vorgestellten Algorithmen optimieren die Datenverteilung fur eine
lokale Ausfuhrung aller Anfragen und balancieren die Last auf den Rechnerknoten. Es
viiwerden unterschiedliche Arten der Anfrageklassi zierung vorgestellt, die zu verschiedenen
Arten von Partitionierung fuhren. Der Ansatz wird fur sowohl fur Onlinetransaktionsver-
arbeitung, als auch Onlinedatenanalyse evaluiert. Die Evaluierung zeigt, dass der Ansatz
fur beide Felder sehr gut skaliert. Im letzten Teil der Arbeit werden verschiedene Tech-
niken fur die Leistungsmessung von gro en, adaptiven Systemen pr asentiert. Zun achst
wird ein Datengenerierungsansatz gezeigt, der es erm oglicht sehr gro e Datenmengen
v ollig parallel zu erzeugen. Um die Benutzerinteraktion von Onlinesystemen zu simulieren
wurde ein produktives E-learningsystem analysiert. Anhand der Analyse wurde ein Mod-
ell fur die Generierung von Arbeitslasten erstellt, das die zeitlichen Abh angigkeiten von
Benutzerinteraktion beruc ksichtigt.Acknowledgements
This thesis would not have been possible without the encouragement and supervision of
Harald Kosch. I am grateful for his constant support and friendly advise. He allowed me
the room to work in my own way and kept me motivated throughout the thesis. Further-
more, he gave me the opportunity to work at his chair which was a great experience.
In the ve years at the chair I had the pleasure to work with many friendly people,
who I count among my friends. I would like to thank, Gun ther H olbling, who shared a
room with me and with whom I had plenty of fruitful discussions. Mario D oller always
impressed me and encouraged me with his e ectiveness and e ciency. Florian Stegmaier
brightened my day and always lent me his ear. Stella Stars, Britta Meixner, and David
Coquil enriched my day with stimulating conversation. Thanks to all the members of the
doctoral college, Christian, Getnet, Hatem, Lyes, Natacha, Tobias, Vanessa, and Zeina. I
would especially like to thank Ingrid Winter, who was like a mother and always kept me
free of nasty paper work.
I had the chance to advise many students, who helped me with my projects and mo-
tivated me with their e ort. I am thankful to all of them, it was a pleasure to work
with each one. I would like to thank Christoph Koch and Marc Pfe er, who were my
rst students and set the bar high for the following. Marc built a rst prototype for my
thesis project. Bastian H osch built a second prototype and helped me with the linear
program. Marco Sitzberger examined the periodic behavior of the Stud.IP logs. Andreas
Brandl helped me with the implementation of the nal prototype. Christian Dellwo im-
plemented the Scalileo framework and Niklas Schmidtmer integrated it in my prototype.
Michael Frank implemented PDGF and Manuel Danisch currently adapts it for the TPC-
DI benchmark. Andreas Lang helped me with the analysis of the Stud.IP logs for the
workload generation. I had many more students, who did a great job on their theses.
I would also like to thank Lionel Brunie for being my supervisor. During th