Massively Parallel Data Processing on Infrastructure as a Service Platforms [Elektronische Ressource] / Daniel Warneke. Betreuer: Odej Kao
159 pages
English

Découvre YouScribe en t'inscrivant gratuitement

Je m'inscris

Massively Parallel Data Processing on Infrastructure as a Service Platforms [Elektronische Ressource] / Daniel Warneke. Betreuer: Odej Kao

-

Découvre YouScribe en t'inscrivant gratuitement

Je m'inscris
Obtenez un accès à la bibliothèque pour le consulter en ligne
En savoir plus
159 pages
English
Obtenez un accès à la bibliothèque pour le consulter en ligne
En savoir plus

Description

Massively Parallel Data Processing onInfrastructure as a Service Platformsvorgelegt vonDipl.-Inf. Daniel Warnekeaus Berlinder Fakult¨at IV - Elektrotechnik und Informatikder Technischen Universit¨at Berlinzur Erlangung des akademischen GradesDoktor der Naturwissenschaften- Dr. rer. nat. -genehmigte DissertationPromotionsausschuss:Vorsitzender: Prof. Dr. Sahin AlbayrakGutachter: Prof. Dr. Odej KaoProf. Dr. Volker MarklProf. Dr. Felix NaumannTag der wissenschaftlichen Aussprache: 28. September 2011Berlin 2011D83iiAcknowledgementI would like to take the chance of expressing my sincere gratitude to a number of peoplewho accompanied me in the course of my studies and helped shaping this thesis.First and foremost I would like to thank my advisor Odej Kao. He provided me witha great working environment in his research group and supported my work throughmany helpful comments and collaboration opportunities. I also would like to express myappreciation to Volker Markl and Felix Naumann for agreeing to review this thesis.Much of the great work atmosphere I experienced at TU Berlin is owed to the manypeople I had the opportunity to work with. First of all, I would like give credit tothe wonderful members of the CIT team (past and present), especially Dominic Battr´e,Philipp Berndt, Bj¨orn Lohrmann, Andr´eHoing,¨ Matthias Hovestadt, and Martin Raack.

Sujets

Informations

Publié par
Publié le 01 janvier 2011
Nombre de lectures 15
Langue English
Poids de l'ouvrage 1 Mo

Extrait

Massively Parallel Data Processing on
Infrastructure as a Service Platforms
vorgelegt von
Dipl.-Inf. Daniel Warneke
aus Berlin
der Fakult¨at IV - Elektrotechnik und Informatik
der Technischen Universit¨at Berlin
zur Erlangung des akademischen Grades
Doktor der Naturwissenschaften
- Dr. rer. nat. -
genehmigte Dissertation
Promotionsausschuss:
Vorsitzender: Prof. Dr. Sahin Albayrak
Gutachter: Prof. Dr. Odej Kao
Prof. Dr. Volker Markl
Prof. Dr. Felix Naumann
Tag der wissenschaftlichen Aussprache: 28. September 2011
Berlin 2011
D83iiAcknowledgement
I would like to take the chance of expressing my sincere gratitude to a number of people
who accompanied me in the course of my studies and helped shaping this thesis.
First and foremost I would like to thank my advisor Odej Kao. He provided me with
a great working environment in his research group and supported my work through
many helpful comments and collaboration opportunities. I also would like to express my
appreciation to Volker Markl and Felix Naumann for agreeing to review this thesis.
Much of the great work atmosphere I experienced at TU Berlin is owed to the many
people I had the opportunity to work with. First of all, I would like give credit to
the wonderful members of the CIT team (past and present), especially Dominic Battr´e,
Philipp Berndt, Bj¨orn Lohrmann, Andr´eHoing,¨ Matthias Hovestadt, and Martin Raack.
They were always open to discussions (on research and non-research related matters)
and provided help on technical problems whenever possible. Moreover, I greatly appre-
ciated their company during the countless evenings we spent working late together in
the office.
I would also like to thank the many master students (and partly now colleagues) I was
lucky enough to work with, namely Natalia Frejnik, Andreas Kliem, Alexander Stanik,
and Mareike Struwing.¨ They all helped to develop many valuable ideas of this thesis and
contributed to the prototypic implementation. Siddhant Goel was a summer intern at
our research group in 2009 and helped building the foundation for the topology inference
research which is presented in this thesis.
Furthermore, I must not forget to express my gratitude to the DIMA research group of
TU Berlin, in particular Volker Markl, Fabian Husk¨ e, and Stephan Ewen. Volker was
kind enough to take me with this group on several research trips and introduced me
to many interesting people of the database community. Fabian and Stephan provided
very helpful feedback on the design of the Nephele framework and were always open to
discussions on new ideas centering around large-scale data analysis systems.
Finally, I would like to thank my dear family for their love and support, especially my
father Joachim who proofread this thesis, although computer science is most certainly
not his favorite subject.
iiiivAbstract
In recent years, Infrastructure as a Service (IaaS) clouds have emerged as a promising
new platform for massively parallel data processing. By eliminating the need for large
upfront capital expenses, operators of IaaS clouds offer their customers the unprece-
dented possibility to acquire access to a highly scalable pool of computing resources on a
short-term basis and enable them to execute data analysis applications at a scale which
has been traditionally reserved to large Internet companies and research facilities.
However, despite the growing popularity of these kinds of distributed applications, the
current parallel data processing frameworks, which support the creation and execution
of large-scale data analysis jobs, still stem from the era of dedicated, static compute
clusters and have disregarded the particular characteristics of IaaS platforms so far.
This thesis revisits the design of a parallel data processing framework against the back-
ground of the new possibilities and challenges of IaaS clouds with the objective of im-
proving the processing efficiency on these platforms in terms of both time and cost. In
particular, the thesis analyzes how parallel data processing frameworks can take advan-
tage of the cloud’s ability for rapid resource provisioning and presents a new parallel
data processing framework called Nephele, which explicitly exploits these new cloud fea-
tures. Moreover, several approaches are presented to reduce the increased risk of I/O
bottlenecks during the job execution which results from the cloud’s use of hardware
virtualization.
In order to underline their effectiveness, all contributions of this thesis are evaluated
through various practical experiments and, whenever possible, contrasted to the state of
the art in the respective field.
vviZusammenfassung
Infrastructure as a Service (IaaS) Clouds haben sich in den vergangenen Jahren zu einer
vielversprechenden neuen Plattform fur¨ massiv-parallele Datenverarbeitung entwickelt.
Durch den Wegfall der Notwendigkeit hoher Anfangsinvestitionen bieten Betreiber von
IaaS Clouds ihre Kunden die nie dagewesene M¨oglichkeit, kurzzeitigen Zugriff auf einen
hoch skalierbaren Pool von Rechenressourcen zu erhalten und darauf Datenanalysepro-
gramme in einer Gr¨ oßenordnung auszufuhren,¨ die bislang nur großen Internetfirmen und
Forschungseinrichtungen vorbehalten war.
Trotz der steigenden Popularit¨at dieser Form von verteilten Anwendungen, stammen
die aktuellen Datenverarbeitungsframeworks, die die Erstellung und Ausfuhrung¨ dieser
¨großangelegten Aufgaben (Jobs) zur Datenanalyse unterstutzen,¨ immernoch aus der Ara
derdedizierten, statischenRechenclusterundhabendiespeziellenEigenschaftenderIaaS
Plattformen bislang außer Acht gelassen.
Diese Doktorarbeit greift den Entwurf eines parallelen Datenverarbeitungsframeworks
vor dem Hintergrund der neuen M¨ oglichkeiten und Herausforderungen einer IaaS Cloud
neu auf, und zwar mit dem Ziel, die Verarbeitungseffizienz von Jobs auf dieser Plattform
sowohlinHinblickaufdieZeitalsauchaufdieKostenzuverbessern. Dabeianalysiertdie
Arbeit, wie ein Framework fur¨ parallele Datenverarbeitung die Fahigk¨ eiten einer Cloud
zur schnellen Ressourcenbereitstellung nutzen kann und pr¨asentiert daraufhin ein neues
Verarbeitungsframework mit dem Namen Nephele, welches diese neuen Moglic¨ hkeiten
der Cloud explizit ausnutzt. Darub¨ er hinaus werden noch mehrere Ans¨ atze zur Re-
duzierung des erh¨ ohten Risikos von I/O Flaschenh¨alsen w¨ahrend der Jobausfuhrung¨
vorgestellt, welches in einer Cloud durch die Verwendung von Hardwarevirtualisierung
entsteht.
Um ihre Leistungsf¨ ahigkeit aufzuzeigen, werden alle Beitr¨ age dieser Doktorarbeit durch
zahlreiche praktische Experimente evaluiert und, sofern m¨oglich, mit dem aktuellen
Stand der Technik gegenub¨ ergestellt.
viiviiiContents
1. Introduction 1
1.1.ProblemDefinition............................... 3
1.2.Contribution.................................. 5
1.3.OutlineoftheThesis.............................. 7
2. Characteristics of Infrastructure as a Service Clouds 9
2.1.ServiceModelsofIaSClouds......................... 9
2.1.1.ComputeServiceModels........................10
2.1.2.StorageServiceModels1
2.1.3.ServiceLevelAgrements.......................13
2.2.UserInterfacetoIaSClouds13
2.3.PerformanceCharacteristics..........................15
2.3.1.CPUPerformanceCharacteristics ..................16
2.3.2.I/OPerformanceCharacteristics...................19
2.4.Summary....................................24
3. Exploiting Dynamic Resource Allocation 27
3.1.DesignPrinciples................................28
3.2. The Nephele Parallel Data Processing Framework..............29
3.2.1.Architecture ..............................29
3.2.2. Job Description.............................31
3.2.3.JobSchedulingandExecution....................3
3.3.ParalelizationandSchedulingStrategies...................36
3.3.1.FindingSuitableDegresofParalelismandVMTypes......36
3.3.2.AutomaticVMAlocationandDealocation.............38
3.4.Evaluation....................................39
3.4.1.Experiment1:MapReduceandHadoop...............40
3.4.2.Experiment2:MapReduceandNephele41
3.4.3.Experiment3:DAGandNephele...................4
3.4.4.Results.................................46
3.5.RelatedWork..................................50
3.6.Summary52
ix4. Detecting Bottlenecks in Parallel Data Flow Programs 53
4.1. Processing Model and Problem Definition ..................54
4.1.1. Processing Model............................5
4.1.2.ProblemDefinition...........................56
4.2.BottleneckDetectionAlgorithms.......................57
4.3.ImplementationinNephele..........................60
4.4.Evaluation....................................62
4.4.1.UseCase................................63
4.4.2.Results.................................64
4.5.RelatedWork..................................67
4.6.Summary69
5. Mitigating I/O Variations with Adaptive Compression 71
5.1.DesignPrinciples72
5.2. Adaptive Online Compression in IaaS Clouds ................74
5.2.1.DecisionModel.............................74
5.2.2.ImplementationinNephele......................76
5.3.Evaluation....................................79
5.3.1. Adaptivity ..............................

  • Univers Univers
  • Ebooks Ebooks
  • Livres audio Livres audio
  • Presse Presse
  • Podcasts Podcasts
  • BD BD
  • Documents Documents