La lecture en ligne est gratuite
Le téléchargement nécessite un accès à la bibliothèque YouScribe
Tout savoir sur nos offres
Télécharger Lire

Clouds and MapReduce for Scientific Applications

De
5 pages
Clouds and MapReduce for Scientific Applications  Introduction Cloud computing[1]  is at the peak of the Gartner technology hype curve[2] but there are good reasons to believe that as it matures that it will not disappear into their trough of disillusionment but rather move into the plateau of productivity as have for example service oriented architectures. Clouds are driven by large commercial markets where IDC estimates that clouds will represent 14% of IT expenditure in 2012 and there is rapidly growing interest from government and industry. There are several reasons why clouds should be important for large scale scientific computing 1) Clouds are the largest scale computer centers constructed and so they have the capacity to be important to large scale science problems as well as those at small scale. 2) Clouds exploit the economies of this scale and so can be expected to be a cost effective approach to computing. Their architecture explicitly addresses the important fault tolerance issue. 3) Clouds  are  commercially  supported  and  so  one  can  expect  reasonably  robust  software  without  the sustainability  difficulties  seen  from  the  academic  software  systems  critical  to  much  current Cyberinfrastructure. 4) There are 3 major vendors of clouds (Amazon, Google, Microsoft) and many other infrastructure and software cloud technology vendors including Eucalyptus Systems that spun off UC Santa Barbara HPC research ...
Voir plus Voir moins

Vous aimerez aussi

CloudsandMapReduceforScientificApplicationsIntroduction Cloudcomputing[1]isatthepeakoftheGartnertechnologyhypecurve[2]buttherearegoodreasonstobelievethatasitmaturesthatitwillnotdisappearintotheirtroughofdisillusionmentbutrathermoveintotheplateauofproductivityashaveforexampleserviceorientedarchitectures.CloudsaredrivenbylargecommercialmarketswhereIDCestimatesthatcloudswillrepresent14%ofITexpenditurein2012andthereisrapidlygrowinginterestfromgovernmentandindustry.Thereareseveralreasonswhycloudsshouldbeimportantforlargescalescientificcomputing1) Cloudsarethelargestscalecomputercentersconstructedandsotheyhavethecapacitytobeimportanttolargescalescienceproblemsaswellasthoseatsmallscale.2) Cloudsexploittheeconomiesofthisscaleandsocanbeexpectedtobeacosteffectiveapproachtocomputing.Theirarchitectureexplicitlyaddressestheimportantfaulttoleranceissue.3) CloudsarecommerciallysupportedandsoonecanexpectreasonablyrobustsoftwarewithoutthesustainabilitydifficultiesseenfromtheacademicsoftwaresystemscriticaltomuchcurrentCyberinfrastructure.4) Thereare3majorvendorsofclouds(Amazon,Google,Microsoft)andmanyotherinfrastructureandsoftwarecloudtechnologyvendorsincludingEucalyptusSystemsthatspunoffUCSantaBarbaraHPCresearch.Thiscompetitionshouldensurethatcloudsshoulddevelopinahealthyinnovativefashion.Furtherattentionisalreadybeinggiventocloudstandards[3]5) TherearemanyCloudresearch,conferencesandotheractivitieswithresearchcloudinfrastructureeffortsincludingNimbus[4],OpenNebula[5],Sector/Sphere[6]andEucalyptus[7].6) ThereareagrowingnumberofacademicandsciencecloudsystemssupportingusersthroughNSFProgramsforGoogle/IBMandMicrosoftAzuresystems.InNSFOCI,FutureGrid[8]willofferaCloudtestbedandMagellan[9]isamajorDoEexperimentalcloudsystem.TheEUframework7projectVENUSC[10]isjuststarting.7) Cloudsoffer"ondemand"andinteractivecomputingthatismoreattractivethanbatchsystemstomanyusers.ListeningtosomeofthetalksattherecentCloudFuturesworkshop[11],onemightimaginethatallscientificcomputingcouldbeperformedonclouds.Thisisnottruebutratherthesituationissomewhereinthemiddlewithsomeimportantclassesofscientificcomputingbeingsuitableforcloudsbutothersnot.Theproblemswithusingcloudsarewelldocumentedandinclude8) Thecentralizedcomputingmodelforcloudsrunscountertotheconceptof"bringingthecomputingtothedata"andbringingthe"datatoacommercialcloudfacility"maybeslowandexpensive.9) Therearemanysecurity,legalandprivacyissues[12]thatoftenmimicthoseInternetwhichareespeciallyproblematicinareassuchhealthinformaticsandwhereproprietaryinformationcouldbeexposed.10) Thevirtualizednetworkingcurrentlyusedinthevirtualmachinesintoday’scommercialcloudsandjitterfromcomplexoperatingsystemfunctionsincreasessynchronization/communicationcosts.ThisisespeciallyseriousinlargescaleparallelcomputingandleadstosignificantoverheadsinmanyMPIapplications[13,14].Indeedtheusual(andattractive)faulttolerancemodelforcloudsrunscountertothetightsynchronizationneededinmostMPIapplications.Someoftheseissuescanbeaddressedwithcustomized(private)cloudsandenhancedbandwidthfromTeraGridtocommercialcloudnetworks.Forexample,therecouldbegrowinginterestin"HPCasaService"asexemplifiedbyPenguinComputingonDemand.Howeveritseemslikelythatcloudswillnotsupplanttraditionalapproachesforverylargescaleparallel(MPI)jobsinthenearfuture.ItisnaturaltoconsiderahybridmodelwithjobsrunningoneitherclassicHPCsystemsorcloudsorinfactbothasagivenworkflow(asinexamplebelow)couldwellhaveindividualjobssuitablefordifferentpartsofthishybridsystem.Commercialcloudssupport"massivelyparallel"applicationsbutonlythosethatarelooselycoupledandsoinsensitivetohighersynchronizationcosts.Letusfocus
on"massivelyparallel"or"manytask"cloudapplicationsasthesemostinterestingly"compete"withpossibleTeraGridimplementations.Inthiscase,theprogrammingmodelMapReduce[15]describesproblemssuitableforclouds.ThisisofferedonAmazoncloudsandisexpectedsoononothercommercialcloudswhileitcanbeimplementedonanyclusterusingtheopensourceHadoop[16]softwareforLinuxortheMicrosoftDryadsystem[17]forWindowsclusters.OnecancompareMPI,MapReduce(withorwithoutvirtualmachines)anddifferentnativecloudimplementationsandfindcomparable(witharangeof30%)performanceonapplicationssuitablefortheseparadigms[18].MapReduceanditsextensionsofferthemostuserfriendlyenvironment.OnecandescribethedifferencebetweenMPIandMapReduceasfollows.InMapReducemultiplemapprocessesareformed ‐‐typicallybyadomain(data)decompositionfamiliarfromMPI ‐‐theserunasynchronouslytypicallywritingresultstoafilesystemthatisconsumedbyasetofreducetasksthatmergeparallelresultsinsomefashion.Thisprogrammingmodelimpliesstraightforwardandefficientfaulttolerancebyrerunningfailedmaporreducetasks.MPIaddressesamorecomplicatedproblemarchitecturewithiterativecompute‐‐communicatestageswithsynchronizationatthecommunicationphase.Thissynchronizationmeansforexamplethatallprocesseswaitifoneisdelayedorfailed.ThisinefficiencyisnotpresentinMapReducewhereresourcesarereleasedwhenindividualmaporreducetaskscomplete.MPIofcoursesupportsgeneral(builtinanduserdefined)reductionssoMPIcouldbeusedforapplicationsoftheMapReducestyle.Howeverthelatteroffersgreaterfaulttoleranceanduserfriendlyhigherlevelenvironmentlargelystemmingfromthecoarsegrainfunctionalprogrammingmodelimplementedassideeffectfreetasks.Oversimplifying,MPIsupportsmultipleMapReducestagesbutMapReducejustone.CorrespondinglycloudssupportapplicationthathavetheloosecouplingsupportedbyMapReducewhileclassicHPCsupportsmoretightlycoupledapplications.ResearchintoextensionsofMapReduceattempttobridgethesedifferences[19].MapReducecoversmanyhighthroughputcomputingapplicationsincluding"parametersearches".ManydataanalysisapplicationsincludinginformationretrievalfittheMapReduceparadigm.InLHCorsimilaracceleratordata,mapsconsistsofMonteCarlogenerationoranalysisofeventswhilereductionisconstructionofhistogramsbymergingthosefromdifferentmaps.IntheSARdataanalysisoficesheetobservations,mapsconsistofindependentMatlabinvocationsondifferentdatasamples.LifeScienceshavemanynaturalcandidatesforMapReduceincludingsequenceassemblyandtheuseofBLASTandsimilarprograms.Ontheotherhandpartialdifferentialequationsolvers,particledynamicsandlinearalgebrarequirethefullMPImodelforhighperformanceparallelimplementation.GrandChallengeImplicationsofMapReduceandCloudsMapReduceandCloudscanbeusedforsomeoftheapplicationsthataremostrapidlygrowinginimportance.Theirsupportseemsessentialifoneistosupportlargescaledataintensiveapplications.Moregenerallyamorecarefulanalysisofcloudsversustraditionalenvironmentsisneededtoquantifythesimplisticanalysisgivenabove.ThereisaclearalgorithmchallengetodesignmorelooselycoupledalgorithmsthatarecompatiblewiththemapfollowedbyreducemodelofMapReduceormoregenerallywiththestructureofclouds.ThiscouldleadtogeneralizationsofMapReducewhicharestillcompatiblewiththecloudvirtualizationandfaulttolerancefeatures.TherearemanysoftwarechallengesincludingMapReduceitself;itsextensions(bothinfunctionalityandhigherlevelabstractions);andimprovedworkflowsystemssupportingMapReduceandthelinkingofclients,cloudsandMPIengines.Wehavenotedresearchchallengesinsecurityandthereisalsoactiveworkinthepreparation,managementanddeploymentofprogramimages(appliances)tobeloadedintovirtualmachines.Theintrinsicconflictbetweenvirtualizationandtheissuesaroundlocalityoraffinity(betweennodesinMPIorbetweencomputationanddata)needsmoreresearch.Ontheinfrastructureside,wehavealreadydiscussedtheimportanceofhighqualitynetworkingbetweenMPIandcloudsystems.AnothercriticalareaisfilesystemswherecloudsandMapReduceusenewapproachesthatarenotclearlycompatiblewithtraditionalTeraGridapproaches.SupportofnoveldatabasessuchasBigTableacrosscloudsandMPIclustersisprobablyimportant.ObviouslyNSFandthecomputationalsciencecommunityneedstodecideonthebalancebetweenuseofcommercialcloudsaswellas"private"TeraGridcloudsmimickingMagellanandprovidingthelargescaleproductionfacilitiesforcodesprototypedonFutureGrid.
MetagenomicsAGrandChallengeVignetteThestudyofmicrobialgenomesiscomplicatedbythefactthatonlysmallnumberofspeciescanbeisolatedsuccessfullyandthecurrentwayforwardismetagenomicstudiesofcultureindependent,collectivesetsofgenomesintheirnaturalenvironments.Thisrequiresidentificationofasmanyasmillionsofgenesandthousandsofspeciesfromindividualsamples.Newsequencingtechnologycanprovidetherequireddatasampleswithathroughputof1trillionbasepairsperdayandthisratewillincrease.Atypicalobservationanddatapipelineisshowninfigure1withsequencersproducingDNAsamplesthatareassembledandsubjecttofurtheranalysisincludingBLASTlikecomparisonwithexistingdatasetsaswellasclusteringandvisualizationtoidentifynewgenefamilies.Figure2showsinitialresultsfromanalysisof30,000sequenceswithclustersidentifiedandvisualizedusingdimensionreductiontomaptothreedimensionswithMultidimensionalscalingMDS.TheinitialpartsofthepipelinefittheMapReduceormanytaskCloudmodelbutthelatterstagesinvolveparallellinearalgebra.
Internet
FASTAFile NSequences
ReadInstruments Alignment Pairwise clustering Dissimilarity Visualization FormSequence Blocking MatrixPlotviz block alignment Pairings N(N1)/2values MDS Figure1PipelineforanalysisofmetagenomicsData
F gure:Resu tso17c ustersorusamp eus ngFigure3:Timetoprocessasinglebiologysequencefile(458reads)percorewithdifferentframeworks[18]Sammon’sversionofMDSforvisualization[20].2 StateoftheartMDSandclusteringalgorithmsscalelikeO(N )forNsequences;thetotalruntimeforMDSandclusteringisabout2hourseachona768corecommodityclusterobtainingaspeedupofabout500usingahybridMPIthreadingimplementationon24corenodes.TheinitialstepscanberunoncloudsandincludethecalculationofadistancematrixofN(N1)/2independentelements.MillionsequenceproblemsofthistypewillchallengethelargestcloudsandthelargestTeraGridresources.Figure3looksatarelatedsequenceassemblyproblemandcomparesperformanceofMapReduce(Hadoop,DryadLINQ)withandwithoutvirtualmachinesandthebasicAmazonandMicrosoftclouds.Theexecutiontimesaresimilar(rangeis30%)showingthatthisclassofalgorithmcanbeeffectivelyrunonmanydifferentinfrastructuresanditmakessensetoconsidertheintrinsicadvantagesof2 cloudsdescribedabove.InrecentworkwehavelookedhierarchicalmethodstoreduceO(N )executiontimetoO(NlogN)orO(N)andallowlooselycoupledcloudimplementationwithinitialresultsoninterpolationmethodspresentedin[21].
References
[1] MichaelArmbrust,ArmandoFox,ReanGriffith,AnthonyD.Joseph,RandyKatz,AndyKonwinski,GunhoLee,DavidPatterson,ArielRabkin,IonStoica,andMateiZahariaAbovetheClouds:ABerkeleyViewofCloudComputinghttp://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS200928.pdf[2] PressReleaseGartner's2009HypeCycleSpecialReportEvaluatesMaturityof1,650Technologieshttp://www.gartner.com/it/page.jsp?id=1124212[3]CloudComputingForum&WorkshopNISTInformationTechnologyLaboratoryWashingtonDCMay202010http://www.nist.gov/itl/cloud.cfm[4]NimbusCloudComputingforSciencehttp://www.nimbusproject.org/[5] OpenNebulaOpenSourceToolkitforCloudComputinghttp://www.opennebula.org/[6] SectorandSphereDataIntensiveCloudComputingPlatformhttp://sector.sourceforge.net/doc.html[7] EucalyptusOpenSourceCloudSoftwarehttp://open.eucalyptus.com/[8] FutureGridGridTestbedhttp://www.futuregrid.org[9] MagellanCloudforSciencehttp://magellan.alcf.anl.gov/,http://www.nersc.gov/nusers/systems/magellan/[10] EuropeanFramework7projectstartingJune12010VENUSCVirtualmultidisciplinaryEnviroNmentsUSingCloudinfrastructure.[11] RecordingsofPresentationsCloudFutures2010RedmondWA,April892010http://research.microsoft.com/enus/events/cloudfutures2010/videos.aspx[12] LockheedMartinCyberSecurityAllianceApril2010CloudComputingWhitepaperhttp://www.lockheedmartin.com/data/assets/isgs/documents/CloudComputingWhitePaper.pdf[13] EdwardWalker,BenchmarkingAmazonEC2forHighPerformanceScientificComputing,USENIX;login,vol.33(5),Oct2008http://www.usenix.org/publications/login/200810/openpdfs/walker.pdf[14] JaliyaEkanayake,XiaohongQiu,ThilinaGunarathne,ScottBeason,GeoffreyFoxHighPerformanceParallelComputingwithCloudsandCloudTechnologiestoappearasabookchaptertoCloudComputingandSoftwareServices:TheoryandTechniques,CRCPress(TaylorandFrancis),ISBN10:1439803153.http://grids.ucs.indiana.edu/ptliupages/publications/cloud_handbook_finalwithdiagrams.pdf[15] Dean,J.andS.Ghemawat.2008.MapReduce:simplifieddataprocessingonlargeclusters.Commun.ACM51(1):107113.[16] OpensourceMapReduceApacheHadoop,http://hadoop.apache.org/core/[17] JaliyaEkanayake,ThilinaGunarathne,JudyQiu,GeoffreyFox,ScottBeason,JongYoulChoi,YangRuan,SeungHeeBae,HuiLiApplicabilityofDryadLINQtoScientificApplicationsTechnicalReportJanuary302010http://grids.ucs.indiana.edu/ptliupages/publications/DryadReport.pdf[18] ThilinaGunarathne,TakLonWu,JudyQiu,andGeoffreyFox,CloudComputingParadigmsforPleasinglyParallelBiomedicalApplications,ProceedingsofEmergingComputationalMethodsfortheLifeSciencesWorkshopofACMHPDC2010conference,Chicago,Illinois,June2025,2010.[19]JaliyaEkanayake,HuiLi,BingjingZhang,ThilinaGunarathne,SeungHeeBae,JudyQiu,GeoffreyFoxTwister:ARuntimeforIterativeMapReduce,ProceedingsoftheFirstInternationalWorkshoponMapReduceanditsApplicationsofACMHPDC2010conference,Chicago,Illinois,June2025,2010.[20] GeoffreyFox,XiaohongQiu,ScottBeason,JongYoulChoi,MinaRho,HaixuTang,NeilDevadasan,GilbertLiuBiomedicalCaseStudiesinDataIntensiveComputingKeynotetalkatThe1stInternationalConferenceon
Un pour Un
Permettre à tous d'accéder à la lecture
Pour chaque accès à la bibliothèque, YouScribe donne un accès à une personne dans le besoin