Clouds and MapReduce for Scientific Applications
5 pages
English

Clouds and MapReduce for Scientific Applications

-

Le téléchargement nécessite un accès à la bibliothèque YouScribe
Tout savoir sur nos offres
5 pages
English
Le téléchargement nécessite un accès à la bibliothèque YouScribe
Tout savoir sur nos offres

Description

Clouds and MapReduce for Scientific Applications  Introduction Cloud computing[1]  is at the peak of the Gartner technology hype curve[2] but there are good reasons to believe that as it matures that it will not disappear into their trough of disillusionment but rather move into the plateau of productivity as have for example service oriented architectures. Clouds are driven by large commercial markets where IDC estimates that clouds will represent 14% of IT expenditure in 2012 and there is rapidly growing interest from government and industry. There are several reasons why clouds should be important for large scale scientific computing 1) Clouds are the largest scale computer centers constructed and so they have the capacity to be important to large scale science problems as well as those at small scale. 2) Clouds exploit the economies of this scale and so can be expected to be a cost effective approach to computing. Their architecture explicitly addresses the important fault tolerance issue. 3) Clouds  are  commercially  supported  and  so  one  can  expect  reasonably  robust  software  without  the sustainability  difficulties  seen  from  the  academic  software  systems  critical  to  much  current Cyberinfrastructure. 4) There are 3 major vendors of clouds (Amazon, Google, Microsoft) and many other infrastructure and software cloud technology vendors including Eucalyptus Systems that spun off UC Santa Barbara HPC research ...

Sujets

Informations

Publié par
Publié le 24 juin 2011
Nombre de lectures 164
Langue English

Extrait

CloudsandMapReduceforScientificApplicationsIntroduction Cloudcomputing[1]isatthepeakoftheGartnertechnologyhypecurve[2]buttherearegoodreasonstobelievethatasitmaturesthatitwillnotdisappearintotheirtroughofdisillusionmentbutrathermoveintotheplateauofproductivityashaveforexampleserviceorientedarchitectures.CloudsaredrivenbylargecommercialmarketswhereIDCestimatesthatcloudswillrepresent14%ofITexpenditurein2012andthereisrapidlygrowinginterestfromgovernmentandindustry.Thereareseveralreasonswhycloudsshouldbeimportantforlargescalescientificcomputing1) Cloudsarethelargestscalecomputercentersconstructedandsotheyhavethecapacitytobeimportanttolargescalescienceproblemsaswellasthoseatsmallscale.2) Cloudsexploittheeconomiesofthisscaleandsocanbeexpectedtobeacosteffectiveapproachtocomputing.Theirarchitectureexplicitlyaddressestheimportantfaulttoleranceissue.3) CloudsarecommerciallysupportedandsoonecanexpectreasonablyrobustsoftwarewithoutthesustainabilitydifficultiesseenfromtheacademicsoftwaresystemscriticaltomuchcurrentCyberinfrastructure.4) Thereare3majorvendorsofclouds(Amazon,Google,Microsoft)andmanyotherinfrastructureandsoftwarecloudtechnologyvendorsincludingEucalyptusSystemsthatspunoffUCSantaBarbaraHPCresearch.Thiscompetitionshouldensurethatcloudsshoulddevelopinahealthyinnovativefashion.Furtherattentionisalreadybeinggiventocloudstandards[3]5) TherearemanyCloudresearch,conferencesandotheractivitieswithresearchcloudinfrastructureeffortsincludingNimbus[4],OpenNebula[5],Sector/Sphere[6]andEucalyptus[7].6) ThereareagrowingnumberofacademicandsciencecloudsystemssupportingusersthroughNSFProgramsforGoogle/IBMandMicrosoftAzuresystems.InNSFOCI,FutureGrid[8]willofferaCloudtestbedandMagellan[9]isamajorDoEexperimentalcloudsystem.TheEUframework7projectVENUSC[10]isjuststarting.7) Cloudsoffer"ondemand"andinteractivecomputingthatismoreattractivethanbatchsystemstomanyusers.ListeningtosomeofthetalksattherecentCloudFuturesworkshop[11],onemightimaginethatallscientificcomputingcouldbeperformedonclouds.Thisisnottruebutratherthesituationissomewhereinthemiddlewithsomeimportantclassesofscientificcomputingbeingsuitableforcloudsbutothersnot.Theproblemswithusingcloudsarewelldocumentedandinclude8) Thecentralizedcomputingmodelforcloudsrunscountertotheconceptof"bringingthecomputingtothedata"andbringingthe"datatoacommercialcloudfacility"maybeslowandexpensive.9) Therearemanysecurity,legalandprivacyissues[12]thatoftenmimicthoseInternetwhichareespeciallyproblematicinareassuchhealthinformaticsandwhereproprietaryinformationcouldbeexposed.10) Thevirtualizednetworkingcurrentlyusedinthevirtualmachinesintoday’scommercialcloudsandjitterfromcomplexoperatingsystemfunctionsincreasessynchronization/communicationcosts.ThisisespeciallyseriousinlargescaleparallelcomputingandleadstosignificantoverheadsinmanyMPIapplications[13,14].Indeedtheusual(andattractive)faulttolerancemodelforcloudsrunscountertothetightsynchronizationneededinmostMPIapplications.Someoftheseissuescanbeaddressedwithcustomized(private)cloudsandenhancedbandwidthfromTeraGridtocommercialcloudnetworks.Forexample,therecouldbegrowinginterestin"HPCasaService"asexemplifiedbyPenguinComputingonDemand.Howeveritseemslikelythatcloudswillnotsupplanttraditionalapproachesforverylargescaleparallel(MPI)jobsinthenearfuture.ItisnaturaltoconsiderahybridmodelwithjobsrunningoneitherclassicHPCsystemsorcloudsorinfactbothasagivenworkflow(asinexamplebelow)couldwellhaveindividualjobssuitablefordifferentpartsofthishybridsystem.Commercialcloudssupport"massivelyparallel"applicationsbutonlythosethatarelooselycoupledandsoinsensitivetohighersynchronizationcosts.Letusfocus
on"massivelyparallel"or"manytask"cloudapplicationsasthesemostinterestingly"compete"withpossibleTeraGridimplementations.Inthiscase,theprogrammingmodelMapReduce[15]describesproblemssuitableforclouds.ThisisofferedonAmazoncloudsandisexpectedsoononothercommercialcloudswhileitcanbeimplementedonanyclusterusingtheopensourceHadoop[16]softwareforLinuxortheMicrosoftDryadsystem[17]forWindowsclusters.OnecancompareMPI,MapReduce(withorwithoutvirtualmachines)anddifferentnativecloudimplementationsandfindcomparable(witharangeof30%)performanceonapplicationssuitablefortheseparadigms[18].MapReduceanditsextensionsofferthemostuserfriendlyenvironment.OnecandescribethedifferencebetweenMPIandMapReduceasfollows.InMapReducemultiplemapprocessesareformed ‐‐typicallybyadomain(data)decompositionfamiliarfromMPI ‐‐theserunasynchronouslytypicallywritingresultstoafilesystemthatisconsumedbyasetofreducetasksthatmergeparallelresultsinsomefashion.Thisprogrammingmodelimpliesstraightforwardandefficientfaulttolerancebyrerunningfailedmaporreducetasks.MPIaddressesamorecomplicatedproblemarchitecturewithiterativecompute‐‐communicatestageswithsynchronizationatthecommunicationphase.Thissynchronizationmeansforexamplethatallprocesseswaitifoneisdelayedorfailed.ThisinefficiencyisnotpresentinMapReducewhereresourcesarereleasedwhenindividualmaporreducetaskscomplete.MPIofcoursesupportsgeneral(builtinanduserdefined)reductionssoMPIcouldbeusedforapplicationsoftheMapReducestyle.Howeverthelatteroffersgreaterfaulttoleranceanduserfriendlyhigherlevelenvironmentlargelystemmingfromthecoarsegrainfunctionalprogrammingmodelimplementedassideeffectfreetasks.Oversimplifying,MPIsupportsmultipleMapReducestagesbutMapReducejustone.CorrespondinglycloudssupportapplicationthathavetheloosecouplingsupportedbyMapReducewhileclassicHPCsupportsmoretightlycoupledapplications.ResearchintoextensionsofMapReduceattempttobridgethesedifferences[19].MapReducecoversmanyhighthroughputcomputingapplicationsincluding"parametersearches".ManydataanalysisapplicationsincludinginformationretrievalfittheMapReduceparadigm.InLHCorsimilaracceleratordata,mapsconsistsofMonteCarlogenerationoranalysisofeventswhilereductionisconstructionofhistogramsbymergingthosefromdifferentmaps.IntheSARdataanalysisoficesheetobservations,mapsconsistofindependentMatlabinvocationsondifferentdatasamples.LifeScienceshavemanynaturalcandidatesforMapReduceincludingsequenceassemblyandtheuseofBLASTandsimilarprograms.Ontheotherhandpartialdifferentialequationsolvers,particledynamicsandlinearalgebrarequirethefullMPImodelforhighperformanceparallelimplementation.GrandChallengeImplicationsofMapReduceandCloudsMapReduceandCloudscanbeusedforsomeoftheapplicationsthataremostrapidlygrowinginimportance.Theirsupportseemsessentialifoneistosupportlargescaledataintensiveapplications.Moregenerallyamorecarefulanalysisofcloudsversustraditionalenvironmentsisneededtoquantifythesimplisticanalysisgivenabove.ThereisaclearalgorithmchallengetodesignmorelooselycoupledalgorithmsthatarecompatiblewiththemapfollowedbyreducemodelofMapReduceormoregenerallywiththestructureofclouds.ThiscouldleadtogeneralizationsofMapReducewhicharestillcompatiblewiththecloudvirtualizationandfaulttolerancefeatures.TherearemanysoftwarechallengesincludingMapReduceitself;itsextensions(bothinfunctionalityandhigherlevelabstractions);andimprovedworkflowsystemssupportingMapReduceandthelinkingofclients,cloudsandMPIengines.Wehavenotedresearchchallengesinsecurityandthereisalsoactiveworkinthepreparation,managementanddeploymentofprogramimages(appliances)tobeloadedintovirtualmachines.Theintrinsicconflictbetweenvirtualizationandtheissuesaroundlocalityoraffinity(betweennodesinMPIorbetweencomputationanddata)needsmoreresearch.Ontheinfrastructureside,wehavealreadydiscussedtheimportanceofhighqualitynetworkingbetweenMPIandcloudsystems.AnothercriticalareaisfilesystemswherecloudsandMapReduceusenewapproachesthatarenotclearlycompatiblewithtraditionalTeraGridapproaches.SupportofnoveldatabasessuchasBigTableacrosscloudsandMPIclustersisprobablyimportant.ObviouslyNSFandthecomputationalsciencecommunityneedstodecideonthebalancebetweenuseofcommercialcloudsaswellas"private"TeraGridcloudsmimickingMagellanandprovidingthelargescaleproductionfacilitiesforcodesprototypedonFutureGrid.
MetagenomicsAGrandChallengeVignetteThestudyofmicrobialgenomesiscomplicatedbythefactthatonlysmallnumberofspeciescanbeisolatedsuccessfullyandthecurrentwayforwardismetagenomicstudiesofcultureindependent,collectivesetsofgenomesintheirnaturalenvironments.Thisrequiresidentificationofasmanyasmillionsofgenesandthousandsofspeciesfromindividualsamples.Newsequencingtechnologycanprovidetherequireddatasampleswithathroughputof1trillionbasepairsperdayandthisratewillincrease.Atypicalobservationanddatapipelineisshowninfigure1withsequencersproducingDNAsamplesthatareassembledandsubjecttofurtheranalysisincludingBLASTlikecomparisonwithexistingdatasetsaswellasclusteringandvisualizationtoidentifynewgenefamilies.Figure2showsinitialresultsfromanalysisof30,000sequenceswithclustersidentifiedandvisualizedusingdimensionreductiontomaptothreedimensionswithMultidimensionalscalingMDS.TheinitialpartsofthepipelinefittheMapReduceormanytaskCloudmodelbutthelatterstagesinvolveparallellinearalgebra.
Internet
FASTAFile NSequences
ReadInstruments Alignment Pairwise clustering Dissimilarity Visualization FormSequence Blocking MatrixPlotviz block alignment Pairings N(N1)/2values MDS Figure1PipelineforanalysisofmetagenomicsData
F gure:Resu tso17c ustersorusamp eus ngFigure3:Timetoprocessasinglebiologysequencefile(458reads)percorewithdifferentframeworks[18]Sammon’sversionofMDSforvisualization[20].2 StateoftheartMDSandclusteringalgorithmsscalelikeO(N )forNsequences;thetotalruntimeforMDSandclusteringisabout2hourseachona768corecommodityclusterobtainingaspeedupofabout500usingahybridMPIthreadingimplementationon24corenodes.TheinitialstepscanberunoncloudsandincludethecalculationofadistancematrixofN(N1)/2independentelements.MillionsequenceproblemsofthistypewillchallengethelargestcloudsandthelargestTeraGridresources.Figure3looksatarelatedsequenceassemblyproblemandcomparesperformanceofMapReduce(Hadoop,DryadLINQ)withandwithoutvirtualmachinesandthebasicAmazonandMicrosoftclouds.Theexecutiontimesaresimilar(rangeis30%)showingthatthisclassofalgorithmcanbeeffectivelyrunonmanydifferentinfrastructuresanditmakessensetoconsidertheintrinsicadvantagesof2 cloudsdescribedabove.InrecentworkwehavelookedhierarchicalmethodstoreduceO(N )executiontimetoO(NlogN)orO(N)andallowlooselycoupledcloudimplementationwithinitialresultsoninterpolationmethodspresentedin[21].
References
[1] MichaelArmbrust,ArmandoFox,ReanGriffith,AnthonyD.Joseph,RandyKatz,AndyKonwinski,GunhoLee,DavidPatterson,ArielRabkin,IonStoica,andMateiZahariaAbovetheClouds:ABerkeleyViewofCloudComputinghttp://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS200928.pdf[2] PressReleaseGartner's2009HypeCycleSpecialReportEvaluatesMaturityof1,650Technologieshttp://www.gartner.com/it/page.jsp?id=1124212[3]CloudComputingForum&WorkshopNISTInformationTechnologyLaboratoryWashingtonDCMay202010http://www.nist.gov/itl/cloud.cfm[4]NimbusCloudComputingforSciencehttp://www.nimbusproject.org/[5] OpenNebulaOpenSourceToolkitforCloudComputinghttp://www.opennebula.org/[6] SectorandSphereDataIntensiveCloudComputingPlatformhttp://sector.sourceforge.net/doc.html[7] EucalyptusOpenSourceCloudSoftwarehttp://open.eucalyptus.com/[8] FutureGridGridTestbedhttp://www.futuregrid.org[9] MagellanCloudforSciencehttp://magellan.alcf.anl.gov/,http://www.nersc.gov/nusers/systems/magellan/[10] EuropeanFramework7projectstartingJune12010VENUSCVirtualmultidisciplinaryEnviroNmentsUSingCloudinfrastructure.[11] RecordingsofPresentationsCloudFutures2010RedmondWA,April892010http://research.microsoft.com/enus/events/cloudfutures2010/videos.aspx[12] LockheedMartinCyberSecurityAllianceApril2010CloudComputingWhitepaperhttp://www.lockheedmartin.com/data/assets/isgs/documents/CloudComputingWhitePaper.pdf[13] EdwardWalker,BenchmarkingAmazonEC2forHighPerformanceScientificComputing,USENIX;login,vol.33(5),Oct2008http://www.usenix.org/publications/login/200810/openpdfs/walker.pdf[14] JaliyaEkanayake,XiaohongQiu,ThilinaGunarathne,ScottBeason,GeoffreyFoxHighPerformanceParallelComputingwithCloudsandCloudTechnologiestoappearasabookchaptertoCloudComputingandSoftwareServices:TheoryandTechniques,CRCPress(TaylorandFrancis),ISBN10:1439803153.http://grids.ucs.indiana.edu/ptliupages/publications/cloud_handbook_finalwithdiagrams.pdf[15] Dean,J.andS.Ghemawat.2008.MapReduce:simplifieddataprocessingonlargeclusters.Commun.ACM51(1):107113.[16] OpensourceMapReduceApacheHadoop,http://hadoop.apache.org/core/[17] JaliyaEkanayake,ThilinaGunarathne,JudyQiu,GeoffreyFox,ScottBeason,JongYoulChoi,YangRuan,SeungHeeBae,HuiLiApplicabilityofDryadLINQtoScientificApplicationsTechnicalReportJanuary302010http://grids.ucs.indiana.edu/ptliupages/publications/DryadReport.pdf[18] ThilinaGunarathne,TakLonWu,JudyQiu,andGeoffreyFox,CloudComputingParadigmsforPleasinglyParallelBiomedicalApplications,ProceedingsofEmergingComputationalMethodsfortheLifeSciencesWorkshopofACMHPDC2010conference,Chicago,Illinois,June2025,2010.[19]JaliyaEkanayake,HuiLi,BingjingZhang,ThilinaGunarathne,SeungHeeBae,JudyQiu,GeoffreyFoxTwister:ARuntimeforIterativeMapReduce,ProceedingsoftheFirstInternationalWorkshoponMapReduceanditsApplicationsofACMHPDC2010conference,Chicago,Illinois,June2025,2010.[20] GeoffreyFox,XiaohongQiu,ScottBeason,JongYoulChoi,MinaRho,HaixuTang,NeilDevadasan,GilbertLiuBiomedicalCaseStudiesinDataIntensiveComputingKeynotetalkatThe1stInternationalConferenceon
  • Univers Univers
  • Ebooks Ebooks
  • Livres audio Livres audio
  • Presse Presse
  • Podcasts Podcasts
  • BD BD
  • Documents Documents