Massively Parallel Data Processing on Infrastructure as a Service Platforms [Elektronische Ressource] / Daniel Warneke. Betreuer: Odej Kao

technische_universitat_berlin - Warneke

Découvre YouScribe en t'inscrivant gratuitement

Je m'inscris

Obtenez un accès à la bibliothèque pour le consulter en ligne
En savoir plus

159 pages

English

Obtenez un accès à la bibliothèque pour le consulter en ligne
En savoir plus

A propos
Informations
Extrait

Description

Sujets

Informatik

Informations

Publié par	technische_universitat_berlin
Publié le	01 janvier 2011
Nombre de lectures	15
Langue	English
Poids de l'ouvrage	1 Mo

Extrait

Massively Parallel Data Processing on
Infrastructure as a Service Platforms
vorgelegt von
Dipl.-Inf. Daniel Warneke
aus Berlin
der Fakult¨at IV - Elektrotechnik und Informatik
der Technischen Universit¨at Berlin
zur Erlangung des akademischen Grades
Doktor der Naturwissenschaften
- Dr. rer. nat. -
genehmigte Dissertation
Promotionsausschuss:
Vorsitzender: Prof. Dr. Sahin Albayrak
Gutachter: Prof. Dr. Odej Kao
Prof. Dr. Volker Markl
Prof. Dr. Felix Naumann
Tag der wissenschaftlichen Aussprache: 28. September 2011
Berlin 2011
D83iiAcknowledgement
I would like to take the chance of expressing my sincere gratitude to a number of people
who accompanied me in the course of my studies and helped shaping this thesis.
First and foremost I would like to thank my advisor Odej Kao. He provided me with
a great working environment in his research group and supported my work through
many helpful comments and collaboration opportunities. I also would like to express my
appreciation to Volker Markl and Felix Naumann for agreeing to review this thesis.
Much of the great work atmosphere I experienced at TU Berlin is owed to the many
people I had the opportunity to work with. First of all, I would like give credit to
the wonderful members of the CIT team (past and present), especially Dominic Battr´e,
Philipp Berndt, Bj¨orn Lohrmann, Andr´eHoing,¨ Matthias Hovestadt, and Martin Raack.
They were always open to discussions (on research and non-research related matters)
and provided help on technical problems whenever possible. Moreover, I greatly appre-
ciated their company during the countless evenings we spent working late together in
the oﬃce.
I would also like to thank the many master students (and partly now colleagues) I was
lucky enough to work with, namely Natalia Frejnik, Andreas Kliem, Alexander Stanik,
and Mareike Struwing.¨ They all helped to develop many valuable ideas of this thesis and
contributed to the prototypic implementation. Siddhant Goel was a summer intern at
our research group in 2009 and helped building the foundation for the topology inference
research which is presented in this thesis.
Furthermore, I must not forget to express my gratitude to the DIMA research group of
TU Berlin, in particular Volker Markl, Fabian Husk¨ e, and Stephan Ewen. Volker was
kind enough to take me with this group on several research trips and introduced me
to many interesting people of the database community. Fabian and Stephan provided
very helpful feedback on the design of the Nephele framework and were always open to
discussions on new ideas centering around large-scale data analysis systems.
Finally, I would like to thank my dear family for their love and support, especially my
father Joachim who proofread this thesis, although computer science is most certainly
not his favorite subject.
iiiivAbstract
In recent years, Infrastructure as a Service (IaaS) clouds have emerged as a promising
new platform for massively parallel data processing. By eliminating the need for large
upfront capital expenses, operators of IaaS clouds oﬀer their customers the unprece-
dented possibility to acquire access to a highly scalable pool of computing resources on a
short-term basis and enable them to execute data analysis applications at a scale which
has been traditionally reserved to large Internet companies and research facilities.
However, despite the growing popularity of these kinds of distributed applications, the
current parallel data processing frameworks, which support the creation and execution
of large-scale data analysis jobs, still stem from the era of dedicated, static compute
clusters and have disregarded the particular characteristics of IaaS platforms so far.
This thesis revisits the design of a parallel data processing framework against the back-
ground of the new possibilities and challenges of IaaS clouds with the objective of im-
proving the processing eﬃciency on these platforms in terms of both time and cost. In
particular, the thesis analyzes how parallel data processing frameworks can take advan-
tage of the cloud’s ability for rapid resource provisioning and presents a new parallel
data processing framework called Nephele, which explicitly exploits these new cloud fea-
tures. Moreover, several approaches are presented to reduce the increased risk of I/O
bottlenecks during the job execution which results from the cloud’s use of hardware
virtualization.
In order to underline their eﬀectiveness, all contributions of this thesis are evaluated
through various practical experiments and, whenever possible, contrasted to the state of
the art in the respective ﬁeld.
vviZusammenfassung
Infrastructure as a Service (IaaS) Clouds haben sich in den vergangenen Jahren zu einer
vielversprechenden neuen Plattform fur¨ massiv-parallele Datenverarbeitung entwickelt.
Durch den Wegfall der Notwendigkeit hoher Anfangsinvestitionen bieten Betreiber von
IaaS Clouds ihre Kunden die nie dagewesene M¨oglichkeit, kurzzeitigen Zugriﬀ auf einen
hoch skalierbaren Pool von Rechenressourcen zu erhalten und darauf Datenanalysepro-
gramme in einer Gr¨ oßenordnung auszufuhren,¨ die bislang nur großen Internetﬁrmen und
Forschungseinrichtungen vorbehalten war.
Trotz der steigenden Popularit¨at dieser Form von verteilten Anwendungen, stammen
die aktuellen Datenverarbeitungsframeworks, die die Erstellung und Ausfuhrung¨ dieser
¨großangelegten Aufgaben (Jobs) zur Datenanalyse unterstutzen,¨ immernoch aus der Ara
derdedizierten, statischenRechenclusterundhabendiespeziellenEigenschaftenderIaaS
Plattformen bislang außer Acht gelassen.
Diese Doktorarbeit greift den Entwurf eines parallelen Datenverarbeitungsframeworks
vor dem Hintergrund der neuen M¨ oglichkeiten und Herausforderungen einer IaaS Cloud
neu auf, und zwar mit dem Ziel, die Verarbeitungseﬃzienz von Jobs auf dieser Plattform
sowohlinHinblickaufdieZeitalsauchaufdieKostenzuverbessern. Dabeianalysiertdie
Arbeit, wie ein Framework fur¨ parallele Datenverarbeitung die Fahigk¨ eiten einer Cloud
zur schnellen Ressourcenbereitstellung nutzen kann und pr¨asentiert daraufhin ein neues
Verarbeitungsframework mit dem Namen Nephele, welches diese neuen Moglic¨ hkeiten
der Cloud explizit ausnutzt. Darub¨ er hinaus werden noch mehrere Ans¨ atze zur Re-
duzierung des erh¨ ohten Risikos von I/O Flaschenh¨alsen w¨ahrend der Jobausfuhrung¨
vorgestellt, welches in einer Cloud durch die Verwendung von Hardwarevirtualisierung
entsteht.
Um ihre Leistungsf¨ ahigkeit aufzuzeigen, werden alle Beitr¨ age dieser Doktorarbeit durch
zahlreiche praktische Experimente evaluiert und, sofern m¨oglich, mit dem aktuellen
Stand der Technik gegenub¨ ergestellt.
viiviiiContents
1. Introduction 1
1.1.ProblemDeﬁnition............................... 3
1.2.Contribution.................................. 5
1.3.OutlineoftheThesis.............................. 7
2. Characteristics of Infrastructure as a Service Clouds 9
2.1.ServiceModelsofIaSClouds......................... 9
2.1.1.ComputeServiceModels........................10
2.1.2.StorageServiceModels1
2.1.3.ServiceLevelAgrements.......................13
2.2.UserInterfacetoIaSClouds13
2.3.PerformanceCharacteristics..........................15
2.3.1.CPUPerformanceCharacteristics ..................16
2.3.2.I/OPerformanceCharacteristics...................19
2.4.Summary....................................24
3. Exploiting Dynamic Resource Allocation 27
3.1.DesignPrinciples................................28
3.2. The Nephele Parallel Data Processing Framework..............29
3.2.1.Architecture ..............................29
3.2.2. Job Description.............................31
3.2.3.JobSchedulingandExecution....................3
3.3.ParalelizationandSchedulingStrategies...................36
3.3.1.FindingSuitableDegresofParalelismandVMTypes......36
3.3.2.AutomaticVMAlocationandDealocation.............38
3.4.Evaluation....................................39
3.4.1.Experiment1:MapReduceandHadoop...............40
3.4.2.Experiment2:MapReduceandNephele41
3.4.3.Experiment3:DAGandNephele...................4
3.4.4.Results.................................46
3.5.RelatedWork..................................50
3.6.Summary52
ix4. Detecting Bottlenecks in Parallel Data Flow Programs 53
4.1. Processing Model and Problem Deﬁnition ..................54
4.1.1. Processing Model............................5
4.1.2.ProblemDeﬁnition...........................56
4.2.BottleneckDetectionAlgorithms.......................57
4.3.ImplementationinNephele..........................60
4.4.Evaluation....................................62
4.4.1.UseCase................................63
4.4.2.Results.................................64
4.5.RelatedWork..................................67
4.6.Summary69
5. Mitigating I/O Variations with Adaptive Compression 71
5.1.DesignPrinciples72
5.2. Adaptive Online Compression in IaaS Clouds ................74
5.2.1.DecisionModel.............................74
5.2.2.ImplementationinNephele......................76
5.3.Evaluation....................................79
5.3.1. Adaptivity ..............................