Clouds and MapReduce for Scientific Applications

mtoledan - Geoffrey Fox

Le téléchargement nécessite un accès à la bibliothèque YouScribe
Tout savoir sur nos offres

5 pages

English

Le téléchargement nécessite un accès à la bibliothèque YouScribe
Tout savoir sur nos offres

A propos
Informations
Extrait

Description

Sujets

CloudsandMapReduceforScientificApplicationsIntroduction Cloudcomputing[1]isatthepeakoftheGartnertechnologyhypecurve[2]buttherearegoodreasonstobelievethatasitmaturesthatitwillnotdisappearintotheirtroughofdisillusionmentbutrathermoveintotheplateauofproductivityashaveforexampleserviceorientedarchitectures.CloudsaredrivenbylargecommercialmarketswhereIDCestimatesthatcloudswillrepresent14%ofITexpenditurein2012andthereisrapidlygrowinginterestfromgovernmentandindustry.Thereareseveralreasonswhycloudsshouldbeimportantforlargescalescientificcomputing1) Cloudsarethelargestscalecomputercentersconstructedandsotheyhavethecapacitytobeimportanttolargescalescienceproblemsaswellasthoseatsmallscale.2) Cloudsexploittheeconomiesofthisscaleandsocanbeexpectedtobeacosteffectiveapproachtocomputing.Theirarchitectureexplicitlyaddressestheimportantfaulttoleranceissue.3) CloudsarecommerciallysupportedandsoonecanexpectreasonablyrobustsoftwarewithoutthesustainabilitydifficultiesseenfromtheacademicsoftwaresystemscriticaltomuchcurrentCyberinfrastructure.4) Thereare3majorvendorsofclouds(Amazon,Google,Microsoft)andmanyotherinfrastructureandsoftwarecloudtechnologyvendorsincludingEucalyptusSystemsthatspunoffUCSantaBarbaraHPCresearch.Thiscompetitionshouldensurethatcloudsshoulddevelopinahealthyinnovativefashion.Furtherattentionisalreadybeinggiventocloudstandards[3]5) TherearemanyCloudresearch,conferencesandotheractivitieswithresearchcloudinfrastructureeffortsincludingNimbus[4],OpenNebula[5],Sector/Sphere[6]andEucalyptus[7].6) ThereareagrowingnumberofacademicandsciencecloudsystemssupportingusersthroughNSFProgramsforGoogle/IBMandMicrosoftAzuresystems.InNSFOCI,FutureGrid[8]willofferaCloudtestbedandMagellan[9]isamajorDoEexperimentalcloudsystem.TheEUframework7projectVENUS‐ C[10]isjuststarting.7) Cloudsoffer"on‐demand"andinteractivecomputingthatismoreattractivethanbatchsystemstomanyusers.ListeningtosomeofthetalksattherecentCloudFuturesworkshop[11],onemightimaginethatallscientificcomputingcouldbeperformedonclouds.Thisisnottruebutratherthesituationissomewhereinthemiddlewithsomeimportantclassesofscientificcomputingbeingsuitableforcloudsbutothersnot.Theproblemswithusingcloudsarewelldocumentedandinclude8) Thecentralizedcomputingmodelforcloudsrunscountertotheconceptof"bringingthecomputingtothedata"andbringingthe"datatoacommercialcloudfacility"maybeslowandexpensive.9) Therearemanysecurity,legalandprivacyissues[12]thatoftenmimicthoseInternetwhichareespeciallyproblematicinareassuchhealthinformaticsandwhereproprietaryinformationcouldbeexposed.10) Thevirtualizednetworkingcurrentlyusedinthevirtualmachinesintoday’scommercialcloudsandjitterfromcomplexoperatingsystemfunctionsincreasessynchronization/communicationcosts.ThisisespeciallyseriousinlargescaleparallelcomputingandleadstosignificantoverheadsinmanyMPIapplications[13,14].Indeedtheusual(andattractive)faulttolerancemodelforcloudsrunscountertothetightsynchronizationneededinmostMPIapplications.Someoftheseissuescanbeaddressedwithcustomized(private)cloudsandenhancedbandwidthfromTeraGridtocommercialcloudnetworks.Forexample,therecouldbegrowinginterestin"HPCasaService"asexemplifiedbyPenguinComputingonDemand.Howeveritseemslikelythatcloudswillnotsupplanttraditionalapproachesforverylargescaleparallel(MPI)jobsinthenearfuture.ItisnaturaltoconsiderahybridmodelwithjobsrunningoneitherclassicHPCsystemsorcloudsorinfactbothasagivenworkflow(asinexamplebelow)couldwellhaveindividualjobssuitablefordifferentpartsofthishybridsystem.Commercialcloudssupport"massivelyparallel"applicationsbutonlythosethatarelooselycoupledandsoinsensitivetohighersynchronizationcosts.Letusfocus

on"massivelyparallel"or"manytask"cloudapplicationsasthesemostinterestingly"compete"withpossibleTeraGridimplementations.Inthiscase,theprogrammingmodelMapReduce[15]describesproblemssuitableforclouds.ThisisofferedonAmazoncloudsandisexpectedsoononothercommercialcloudswhileitcanbeimplementedonanyclusterusingtheopensourceHadoop[16]softwareforLinuxortheMicrosoftDryadsystem[17]forWindowsclusters.OnecancompareMPI,MapReduce(withorwithoutvirtualmachines)anddifferentnativecloudimplementationsandfindcomparable(witharangeof30%)performanceonapplicationssuitablefortheseparadigms[18].MapReduceanditsextensionsofferthemostuserfriendlyenvironment.OnecandescribethedifferencebetweenMPIandMapReduceasfollows.InMapReducemultiplemapprocessesareformed ‐‐typicallybyadomain(data)decompositionfamiliarfromMPI ‐‐theserunasynchronouslytypicallywritingresultstoafilesystemthatisconsumedbyasetofreducetasksthatmergeparallelresultsinsomefashion.Thisprogrammingmodelimpliesstraightforwardandefficientfaulttolerancebyre‐runningfailedmaporreducetasks.MPIaddressesamorecomplicatedproblemarchitecturewithiterativecompute‐‐communicatestageswithsynchronizationatthecommunicationphase.Thissynchronizationmeansforexamplethatallprocesseswaitifoneisdelayedorfailed.ThisinefficiencyisnotpresentinMapReducewhereresourcesarereleasedwhenindividualmaporreducetaskscomplete.MPIofcoursesupportsgeneral(builtinanduserdefined)reductionssoMPIcouldbeusedforapplicationsoftheMapReducestyle.Howeverthelatteroffersgreaterfaulttoleranceanduserfriendlyhigherlevelenvironmentlargelystemmingfromthecoarsegrainfunctionalprogrammingmodelimplementedasside‐effectfreetasks.Oversimplifying,MPIsupportsmultipleMap‐ReducestagesbutMapReducejustone.CorrespondinglycloudssupportapplicationthathavetheloosecouplingsupportedbyMapReducewhileclassicHPCsupportsmoretightlycoupledapplications.ResearchintoextensionsofMapReduceattempttobridgethesedifferences[19].MapReducecoversmanyhighthroughputcomputingapplicationsincluding"parametersearches".ManydataanalysisapplicationsincludinginformationretrievalfittheMapReduceparadigm.InLHCorsimilaracceleratordata,mapsconsistsofMonteCarlogenerationoranalysisofeventswhilereductionisconstructionofhistogramsbymergingthosefromdifferentmaps.IntheSARdataanalysisoficesheetobservations,mapsconsistofindependentMatlabinvocationsondifferentdatasamples.LifeScienceshavemanynaturalcandidatesforMapReduceincludingsequenceassemblyandtheuseofBLASTandsimilarprograms.Ontheotherhandpartialdifferentialequationsolvers,particledynamicsandlinearalgebrarequirethefullMPImodelforhighperformanceparallelimplementation.GrandChallengeImplicationsofMapReduceandCloudsMapReduceandCloudscanbeusedforsomeoftheapplicationsthataremostrapidlygrowinginimportance.Theirsupportseemsessentialifoneistosupportlargescaledataintensiveapplications.Moregenerallyamorecarefulanalysisofcloudsversustraditionalenvironmentsisneededtoquantifythesimplisticanalysisgivenabove.ThereisaclearalgorithmchallengetodesignmorelooselycoupledalgorithmsthatarecompatiblewiththemapfollowedbyreducemodelofMapReduceormoregenerallywiththestructureofclouds.ThiscouldleadtogeneralizationsofMapReducewhicharestillcompatiblewiththecloudvirtualizationandfaulttolerancefeatures.TherearemanysoftwarechallengesincludingMapReduceitself;itsextensions(bothinfunctionalityandhigherlevelabstractions);andimprovedworkflowsystemssupportingMapReduceandthelinkingofclients,cloudsandMPIengines.Wehavenotedresearchchallengesinsecurityandthereisalsoactiveworkinthepreparation,managementanddeploymentofprogramimages(appliances)tobeloadedintovirtualmachines.Theintrinsicconflictbetweenvirtualizationandtheissuesaroundlocalityoraffinity(betweennodesinMPIorbetweencomputationanddata)needsmoreresearch.Ontheinfrastructureside,wehavealreadydiscussedtheimportanceofhighqualitynetworkingbetweenMPIandcloudsystems.AnothercriticalareaisfilesystemswherecloudsandMapReduceusenewapproachesthatarenotclearlycompatiblewithtraditionalTeraGridapproaches.SupportofnoveldatabasessuchasBigTableacrosscloudsandMPIclustersisprobablyimportant.ObviouslyNSFandthecomputationalsciencecommunityneedstodecideonthebalancebetweenuseofcommercialcloudsaswellas"private"TeraGridcloudsmimickingMagellanandprovidingthelargescaleproductionfacilitiesforcodesprototypedonFutureGrid.

Metagenomics‐AGrandChallengeVignetteThestudyofmicrobialgenomesiscomplicatedbythefactthatonlysmallnumberofspeciescanbeisolatedsuccessfullyandthecurrentwayforwardismetagenomicstudiesofculture‐independent,collectivesetsofgenomesintheirnaturalenvironments.Thisrequiresidentificationofasmanyasmillionsofgenesandthousandsofspeciesfromindividualsamples.Newsequencingtechnologycanprovidetherequireddatasampleswithathroughputof1trillionbasepairsperdayandthisratewillincrease.Atypicalobservationanddatapipelineisshowninfigure1withsequencersproducingDNAsamplesthatareassembledandsubjecttofurtheranalysisincludingBLAST‐likecomparisonwithexistingdatasetsaswellasclusteringandvisualizationtoidentifynewgenefamilies.Figure2showsinitialresultsfromanalysisof30,000sequenceswithclustersidentifiedandvisualizedusingdimensionreductiontomaptothreedimensionswithMulti‐dimensionalscalingMDS.TheinitialpartsofthepipelinefittheMapReduceormany‐taskCloudmodelbutthelatterstagesinvolveparallellinearalgebra.

Internet

FASTAFile NSequences

ReadInstruments Alignment Pairwise clustering Dissimilarity Visualization FormSequence Blocking MatrixPlotviz block alignment Pairings N(N‐1)/2values MDS Figure1PipelineforanalysisofmetagenomicsData

F gure:Resu tso17c ustersorusamp eus ngFigure3:Timetoprocessasinglebiologysequencefile(458reads)percorewithdifferentframeworks[18]Sammon’sversionofMDSforvisualization[20].2 StateoftheartMDSandclusteringalgorithmsscalelikeO(N )forNsequences;thetotalruntimeforMDSandclusteringisabout2hourseachona768corecommodityclusterobtainingaspeedupofabout500usingahybridMPI‐threadingimplementationon24corenodes.TheinitialstepscanberunoncloudsandincludethecalculationofadistancematrixofN(N‐1)/2independentelements.MillionsequenceproblemsofthistypewillchallengethelargestcloudsandthelargestTeraGridresources.Figure3looksatarelatedsequenceassemblyproblemandcomparesperformanceofMapReduce(Hadoop,DryadLINQ)withandwithoutvirtualmachinesandthebasicAmazonandMicrosoftclouds.Theexecutiontimesaresimilar(rangeis30%)showingthatthisclassofalgorithmcanbeeffectivelyrunonmanydifferentinfrastructuresanditmakessensetoconsidertheintrinsicadvantagesof2 cloudsdescribedabove.InrecentworkwehavelookedhierarchicalmethodstoreduceO(N )executiontimetoO(NlogN)orO(N)andallowloosely‐coupledcloudimplementationwithinitialresultsoninterpolationmethodspresentedin[21].

References

[1] MichaelArmbrust,ArmandoFox,ReanGriffith,AnthonyD.Joseph,RandyKatz,AndyKonwinski,GunhoLee,DavidPatterson,ArielRabkin,IonStoica,andMateiZahariaAbovetheClouds:ABerkeleyViewofCloudComputinghttp://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS‐2009‐28.pdf[2] PressReleaseGartner's2009HypeCycleSpecialReportEvaluatesMaturityof1,650Technologieshttp://www.gartner.com/it/page.jsp?id=1124212[3]CloudComputingForum&WorkshopNISTInformationTechnologyLaboratoryWashingtonDCMay202010http://www.nist.gov/itl/cloud.cfm[4]NimbusCloudComputingforSciencehttp://www.nimbusproject.org/[5] OpenNebulaOpenSourceToolkitforCloudComputinghttp://www.opennebula.org/[6] SectorandSphereDataIntensiveCloudComputingPlatformhttp://sector.sourceforge.net/doc.html[7] EucalyptusOpenSourceCloudSoftwarehttp://open.eucalyptus.com/[8] FutureGridGridTestbedhttp://www.futuregrid.org[9] MagellanCloudforSciencehttp://magellan.alcf.anl.gov/,http://www.nersc.gov/nusers/systems/magellan/[10] EuropeanFramework7projectstartingJune12010VENUS‐CVirtualmultidisciplinaryEnviroNmentsUSingCloudinfrastructure.[11] RecordingsofPresentationsCloudFutures2010RedmondWA,April8‐92010http://research.microsoft.com/en‐us/events/cloudfutures2010/videos.aspx[12] LockheedMartinCyberSecurityAllianceApril2010CloudComputingWhitepaperhttp://www.lockheedmartin.com/data/assets/isgs/documents/CloudComputingWhitePaper.pdf[13] EdwardWalker,BenchmarkingAmazonEC2forHighPerformanceScientificComputing,USENIX;login,vol.33(5),Oct2008http://www.usenix.org/publications/login/2008‐10/openpdfs/walker.pdf[14] JaliyaEkanayake,XiaohongQiu,ThilinaGunarathne,ScottBeason,GeoffreyFoxHighPerformanceParallelComputingwithCloudsandCloudTechnologiestoappearasabookchaptertoCloudComputingandSoftwareServices:TheoryandTechniques,CRCPress(TaylorandFrancis),ISBN‐10:1439803153.http://grids.ucs.indiana.edu/ptliupages/publications/cloud_handbook_final‐with‐diagrams.pdf[15] Dean,J.andS.Ghemawat.2008.MapReduce:simplifieddataprocessingonlargeclusters.Commun.ACM51(1):107‐113.[16] OpensourceMapReduceApacheHadoop,http://hadoop.apache.org/core/[17] JaliyaEkanayake,ThilinaGunarathne,JudyQiu,GeoffreyFox,ScottBeason,JongYoulChoi,YangRuan,Seung‐ HeeBae,HuiLiApplicabilityofDryadLINQtoScientificApplicationsTechnicalReportJanuary302010http://grids.ucs.indiana.edu/ptliupages/publications/DryadReport.pdf[18] ThilinaGunarathne,Tak‐LonWu,JudyQiu,andGeoffreyFox,CloudComputingParadigmsforPleasinglyParallelBiomedicalApplications,ProceedingsofEmergingComputationalMethodsfortheLifeSciencesWorkshopofACMHPDC2010conference,Chicago,Illinois,June20‐25,2010.[19]JaliyaEkanayake,HuiLi,BingjingZhang,ThilinaGunarathne,Seung‐HeeBae,JudyQiu,GeoffreyFoxTwister:ARuntimeforIterativeMapReduce,ProceedingsoftheFirstInternationalWorkshoponMapReduceanditsApplicationsofACMHPDC2010conference,Chicago,Illinois,June20‐25,2010.[20] GeoffreyFox,XiaohongQiu,ScottBeason,JongYoulChoi,MinaRho,HaixuTang,NeilDevadasan,GilbertLiuBiomedicalCaseStudiesinDataIntensiveComputingKeynotetalkatThe1stInternationalConferenceon