La lecture en ligne est gratuite
Le téléchargement nécessite un accès à la bibliothèque YouScribe
Tout savoir sur nos offres

Partagez cette publication

CloudsandMapReduceforScientificApplicationsIntroduction Cloudcomputing[1]isatthepeakoftheGartnertechnologyhypecurve[2]buttherearegoodreasonstobelievethatasitmaturesthatitwillnotdisappearintotheirtroughofdisillusionmentbutrathermoveintotheplateauofproductivityashaveforexampleserviceorientedarchitectures.CloudsaredrivenbylargecommercialmarketswhereIDCestimatesthatcloudswillrepresent14%ofITexpenditurein2012andthereisrapidlygrowinginterestfromgovernmentandindustry.Thereareseveralreasonswhycloudsshouldbeimportantforlargescalescientificcomputing1) Cloudsarethelargestscalecomputercentersconstructedandsotheyhavethecapacitytobeimportanttolargescalescienceproblemsaswellasthoseatsmallscale.2) Cloudsexploittheeconomiesofthisscaleandsocanbeexpectedtobeacosteffectiveapproachtocomputing.Theirarchitectureexplicitlyaddressestheimportantfaulttoleranceissue.3) CloudsarecommerciallysupportedandsoonecanexpectreasonablyrobustsoftwarewithoutthesustainabilitydifficultiesseenfromtheacademicsoftwaresystemscriticaltomuchcurrentCyberinfrastructure.4) Thereare3majorvendorsofclouds(Amazon,Google,Microsoft)andmanyotherinfrastructureandsoftwarecloudtechnologyvendorsincludingEucalyptusSystemsthatspunoffUCSantaBarbaraHPCresearch.Thiscompetitionshouldensurethatcloudsshoulddevelopinahealthyinnovativefashion.Furtherattentionisalreadybeinggiventocloudstandards[3]5) TherearemanyCloudresearch,conferencesandotheractivitieswithresearchcloudinfrastructureeffortsincludingNimbus[4],OpenNebula[5],Sector/Sphere[6]andEucalyptus[7].6) ThereareagrowingnumberofacademicandsciencecloudsystemssupportingusersthroughNSFProgramsforGoogle/IBMandMicrosoftAzuresystems.InNSFOCI,FutureGrid[8]willofferaCloudtestbedandMagellan[9]isamajorDoEexperimentalcloudsystem.TheEUframework7projectVENUSC[10]isjuststarting.7) Cloudsoffer"ondemand"andinteractivecomputingthatismoreattractivethanbatchsystemstomanyusers.ListeningtosomeofthetalksattherecentCloudFuturesworkshop[11],onemightimaginethatallscientificcomputingcouldbeperformedonclouds.Thisisnottruebutratherthesituationissomewhereinthemiddlewithsomeimportantclassesofscientificcomputingbeingsuitableforcloudsbutothersnot.Theproblemswithusingcloudsarewelldocumentedandinclude8) Thecentralizedcomputingmodelforcloudsrunscountertotheconceptof"bringingthecomputingtothedata"andbringingthe"datatoacommercialcloudfacility"maybeslowandexpensive.9) Therearemanysecurity,legalandprivacyissues[12]thatoftenmimicthoseInternetwhichareespeciallyproblematicinareassuchhealthinformaticsandwhereproprietaryinformationcouldbeexposed.10) Thevirtualizednetworkingcurrentlyusedinthevirtualmachinesintoday’scommercialcloudsandjitterfromcomplexoperatingsystemfunctionsincreasessynchronization/communicationcosts.ThisisespeciallyseriousinlargescaleparallelcomputingandleadstosignificantoverheadsinmanyMPIapplications[13,14].Indeedtheusual(andattractive)faulttolerancemodelforcloudsrunscountertothetightsynchronizationneededinmostMPIapplications.Someoftheseissuescanbeaddressedwithcustomized(private)cloudsandenhancedbandwidthfromTeraGridtocommercialcloudnetworks.Forexample,therecouldbegrowinginterestin"HPCasaService"asexemplifiedbyPenguinComputingonDemand.Howeveritseemslikelythatcloudswillnotsupplanttraditionalapproachesforverylargescaleparallel(MPI)jobsinthenearfuture.ItisnaturaltoconsiderahybridmodelwithjobsrunningoneitherclassicHPCsystemsorcloudsorinfactbothasagivenworkflow(asinexamplebelow)couldwellhaveindividualjobssuitablefordifferentpartsofthishybridsystem.Commercialcloudssupport"massivelyparallel"applicationsbutonlythosethatarelooselycoupledandsoinsensitivetohighersynchronizationcosts.Letusfocus
on"massivelyparallel"or"manytask"cloudapplicationsasthesemostinterestingly"compete"withpossibleTeraGridimplementations.Inthiscase,theprogrammingmodelMapReduce[15]describesproblemssuitableforclouds.ThisisofferedonAmazoncloudsandisexpectedsoononothercommercialcloudswhileitcanbeimplementedonanyclusterusingtheopensourceHadoop[16]softwareforLinuxortheMicrosoftDryadsystem[17]forWindowsclusters.OnecancompareMPI,MapReduce(withorwithoutvirtualmachines)anddifferentnativecloudimplementationsandfindcomparable(witharangeof30%)performanceonapplicationssuitablefortheseparadigms[18].MapReduceanditsextensionsofferthemostuserfriendlyenvironment.OnecandescribethedifferencebetweenMPIandMapReduceasfollows.InMapReducemultiplemapprocessesareformed ‐‐typicallybyadomain(data)decompositionfamiliarfromMPI ‐‐theserunasynchronouslytypicallywritingresultstoafilesystemthatisconsumedbyasetofreducetasksthatmergeparallelresultsinsomefashion.Thisprogrammingmodelimpliesstraightforwardandefficientfaulttolerancebyrerunningfailedmaporreducetasks.MPIaddressesamorecomplicatedproblemarchitecturewithiterativecompute‐‐communicatestageswithsynchronizationatthecommunicationphase.Thissynchronizationmeansforexamplethatallprocesseswaitifoneisdelayedorfailed.ThisinefficiencyisnotpresentinMapReducewhereresourcesarereleasedwhenindividualmaporreducetaskscomplete.MPIofcoursesupportsgeneral(builtinanduserdefined)reductionssoMPIcouldbeusedforapplicationsoftheMapReducestyle.Howeverthelatteroffersgreaterfaulttoleranceanduserfriendlyhigherlevelenvironmentlargelystemmingfromthecoarsegrainfunctionalprogrammingmodelimplementedassideeffectfreetasks.Oversimplifying,MPIsupportsmultipleMapReducestagesbutMapReducejustone.CorrespondinglycloudssupportapplicationthathavetheloosecouplingsupportedbyMapReducewhileclassicHPCsupportsmoretightlycoupledapplications.ResearchintoextensionsofMapReduceattempttobridgethesedifferences[19].MapReducecoversmanyhighthroughputcomputingapplicationsincluding"parametersearches".ManydataanalysisapplicationsincludinginformationretrievalfittheMapReduceparadigm.InLHCorsimilaracceleratordata,mapsconsistsofMonteCarlogenerationoranalysisofeventswhilereductionisconstructionofhistogramsbymergingthosefromdifferentmaps.IntheSARdataanalysisoficesheetobservations,mapsconsistofindependentMatlabinvocationsondifferentdatasamples.LifeScienceshavemanynaturalcandidatesforMapReduceincludingsequenceassemblyandtheuseofBLASTandsimilarprograms.Ontheotherhandpartialdifferentialequationsolvers,particledynamicsandlinearalgebrarequirethefullMPImodelforhighperformanceparallelimplementation.GrandChallengeImplicationsofMapReduceandCloudsMapReduceandCloudscanbeusedforsomeoftheapplicationsthataremostrapidlygrowinginimportance.Theirsupportseemsessentialifoneistosupportlargescaledataintensiveapplications.Moregenerallyamorecarefulanalysisofcloudsversustraditionalenvironmentsisneededtoquantifythesimplisticanalysisgivenabove.ThereisaclearalgorithmchallengetodesignmorelooselycoupledalgorithmsthatarecompatiblewiththemapfollowedbyreducemodelofMapReduceormoregenerallywiththestructureofclouds.ThiscouldleadtogeneralizationsofMapReducewhicharestillcompatiblewiththecloudvirtualizationandfaulttolerancefeatures.TherearemanysoftwarechallengesincludingMapReduceitself;itsextensions(bothinfunctionalityandhigherlevelabstractions);andimprovedworkflowsystemssupportingMapReduceandthelinkingofclients,cloudsandMPIengines.Wehavenotedresearchchallengesinsecurityandthereisalsoactiveworkinthepreparation,managementanddeploymentofprogramimages(appliances)tobeloadedintovirtualmachines.Theintrinsicconflictbetweenvirtualizationandtheissuesaroundlocalityoraffinity(betweennodesinMPIorbetweencomputationanddata)needsmoreresearch.Ontheinfrastructureside,wehavealreadydiscussedtheimportanceofhighqualitynetworkingbetweenMPIandcloudsystems.AnothercriticalareaisfilesystemswherecloudsandMapReduceusenewapproachesthatarenotclearlycompatiblewithtraditionalTeraGridapproaches.SupportofnoveldatabasessuchasBigTableacrosscloudsandMPIclustersisprobablyimportant.ObviouslyNSFandthecomputationalsciencecommunityneedstodecideonthebalancebetweenuseofcommercialcloudsaswellas"private"TeraGridcloudsmimickingMagellanandprovidingthelargescaleproductionfacilitiesforcodesprototypedonFutureGrid.
MetagenomicsAGrandChallengeVignetteThestudyofmicrobialgenomesiscomplicatedbythefactthatonlysmallnumberofspeciescanbeisolatedsuccessfullyandthecurrentwayforwardismetagenomicstudiesofcultureindependent,collectivesetsofgenomesintheirnaturalenvironments.Thisrequiresidentificationofasmanyasmillionsofgenesandthousandsofspeciesfromindividualsamples.Newsequencingtechnologycanprovidetherequireddatasampleswithathroughputof1trillionbasepairsperdayandthisratewillincrease.Atypicalobservationanddatapipelineisshowninfigure1withsequencersproducingDNAsamplesthatareassembledandsubjecttofurtheranalysisincludingBLASTlikecomparisonwithexistingdatasetsaswellasclusteringandvisualizationtoidentifynewgenefamilies.Figure2showsinitialresultsfromanalysisof30,000sequenceswithclustersidentifiedandvisualizedusingdimensionreductiontomaptothreedimensionswithMultidimensionalscalingMDS.TheinitialpartsofthepipelinefittheMapReduceormanytaskCloudmodelbutthelatterstagesinvolveparallellinearalgebra.
Internet
FASTAFile NSequences
ReadInstruments Alignment Pairwise clustering Dissimilarity Visualization FormSequence Blocking MatrixPlotviz block alignment Pairings N(N1)/2values MDS Figure1PipelineforanalysisofmetagenomicsData
F gure:Resu tso17c ustersorusamp eus ngFigure3:Timetoprocessasinglebiologysequencefile(458reads)percorewithdifferentframeworks[18]Sammon’sversionofMDSforvisualization[20].2 StateoftheartMDSandclusteringalgorithmsscalelikeO(N )forNsequences;thetotalruntimeforMDSandclusteringisabout2hourseachona768corecommodityclusterobtainingaspeedupofabout500usingahybridMPIthreadingimplementationon24corenodes.TheinitialstepscanberunoncloudsandincludethecalculationofadistancematrixofN(N1)/2independentelements.MillionsequenceproblemsofthistypewillchallengethelargestcloudsandthelargestTeraGridresources.Figure3looksatarelatedsequenceassemblyproblemandcomparesperformanceofMapReduce(Hadoop,DryadLINQ)withandwithoutvirtualmachinesandthebasicAmazonandMicrosoftclouds.Theexecutiontimesaresimilar(rangeis30%)showingthatthisclassofalgorithmcanbeeffectivelyrunonmanydifferentinfrastructuresanditmakessensetoconsidertheintrinsicadvantagesof2 cloudsdescribedabove.InrecentworkwehavelookedhierarchicalmethodstoreduceO(N )executiontimetoO(NlogN)orO(N)andallowlooselycoupledcloudimplementationwithinitialresultsoninterpolationmethodspresentedin[21].
References
[1] MichaelArmbrust,ArmandoFox,ReanGriffith,AnthonyD.Joseph,RandyKatz,AndyKonwinski,GunhoLee,DavidPatterson,ArielRabkin,IonStoica,andMateiZahariaAbovetheClouds:ABerkeleyViewofCloudComputinghttp://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS200928.pdf[2] PressReleaseGartner's2009HypeCycleSpecialReportEvaluatesMaturityof1,650Technologieshttp://www.gartner.com/it/page.jsp?id=1124212[3]CloudComputingForum&WorkshopNISTInformationTechnologyLaboratoryWashingtonDCMay202010http://www.nist.gov/itl/cloud.cfm[4]NimbusCloudComputingforSciencehttp://www.nimbusproject.org/[5] OpenNebulaOpenSourceToolkitforCloudComputinghttp://www.opennebula.org/[6] SectorandSphereDataIntensiveCloudComputingPlatformhttp://sector.sourceforge.net/doc.html[7] EucalyptusOpenSourceCloudSoftwarehttp://open.eucalyptus.com/[8] FutureGridGridTestbedhttp://www.futuregrid.org[9] MagellanCloudforSciencehttp://magellan.alcf.anl.gov/,http://www.nersc.gov/nusers/systems/magellan/[10] EuropeanFramework7projectstartingJune12010VENUSCVirtualmultidisciplinaryEnviroNmentsUSingCloudinfrastructure.[11] RecordingsofPresentationsCloudFutures2010RedmondWA,April892010http://research.microsoft.com/enus/events/cloudfutures2010/videos.aspx[12] LockheedMartinCyberSecurityAllianceApril2010CloudComputingWhitepaperhttp://www.lockheedmartin.com/data/assets/isgs/documents/CloudComputingWhitePaper.pdf[13] EdwardWalker,BenchmarkingAmazonEC2forHighPerformanceScientificComputing,USENIX;login,vol.33(5),Oct2008http://www.usenix.org/publications/login/200810/openpdfs/walker.pdf[14] JaliyaEkanayake,XiaohongQiu,ThilinaGunarathne,ScottBeason,GeoffreyFoxHighPerformanceParallelComputingwithCloudsandCloudTechnologiestoappearasabookchaptertoCloudComputingandSoftwareServices:TheoryandTechniques,CRCPress(TaylorandFrancis),ISBN10:1439803153.http://grids.ucs.indiana.edu/ptliupages/publications/cloud_handbook_finalwithdiagrams.pdf[15] Dean,J.andS.Ghemawat.2008.MapReduce:simplifieddataprocessingonlargeclusters.Commun.ACM51(1):107113.[16] OpensourceMapReduceApacheHadoop,http://hadoop.apache.org/core/[17] JaliyaEkanayake,ThilinaGunarathne,JudyQiu,GeoffreyFox,ScottBeason,JongYoulChoi,YangRuan,SeungHeeBae,HuiLiApplicabilityofDryadLINQtoScientificApplicationsTechnicalReportJanuary302010http://grids.ucs.indiana.edu/ptliupages/publications/DryadReport.pdf[18] ThilinaGunarathne,TakLonWu,JudyQiu,andGeoffreyFox,CloudComputingParadigmsforPleasinglyParallelBiomedicalApplications,ProceedingsofEmergingComputationalMethodsfortheLifeSciencesWorkshopofACMHPDC2010conference,Chicago,Illinois,June2025,2010.[19]JaliyaEkanayake,HuiLi,BingjingZhang,ThilinaGunarathne,SeungHeeBae,JudyQiu,GeoffreyFoxTwister:ARuntimeforIterativeMapReduce,ProceedingsoftheFirstInternationalWorkshoponMapReduceanditsApplicationsofACMHPDC2010conference,Chicago,Illinois,June2025,2010.[20] GeoffreyFox,XiaohongQiu,ScottBeason,JongYoulChoi,MinaRho,HaixuTang,NeilDevadasan,GilbertLiuBiomedicalCaseStudiesinDataIntensiveComputingKeynotetalkatThe1stInternationalConferenceon
Un pour Un
Permettre à tous d'accéder à la lecture
Pour chaque accès à la bibliothèque, YouScribe donne un accès à une personne dans le besoin