Lin-Dyer-tutorial-MapReduce

Lin-Dyer-tutorial-MapReduce

-

Documents
2 pages
Lire
Le téléchargement nécessite un accès à la bibliothèque YouScribe
Tout savoir sur nos offres

Description

Data­Intensive Text Processing with MapReduce  Jimmy Lin and Chris Dyer University of Maryland, College Park {jimmylin,redpony}@umd.edu Overview This half‐day tutorial introduces participants to data‐intensive text processing with the MapReduce programming model [1], using the open‐source Hadoop implementation. The focus will be on scalability and the tradeoffs associated with distributed processing of large datasets. Content will include general discussions about algorithm design, presentation of illustrative algorithms, case studies in HLT applications, as well as practical advice in writing Hadoop programs and running Hadoop clusters. Amazon has generously agreed to provide each participant with $100 in Amazon Web Services (AWS) credits that can used toward its Elastic Compute Cloud (EC2) “utility computing” service (sufficient for 1000 instance‐hours). EC2 allows anyone to rapidly provision Hadoop clusters “on the fly” without upfront hardware investments, and provides a low‐cost vehicle for exploring Hadoop. Intended Audience The tutorial is targeted at any NLP researcher interested in data‐intensive processing and scalability issues in general. No background in parallel or distributed computing is necessary, but a prior knowledge of HLT is assumed. Course Objectives • Acquire understanding of the MapReduce programming model and how it relates to alternative approaches to concurrent programming. • ...

Sujets

Informations

Publié par
Nombre de visites sur la page 50
Langue Español
Signaler un problème
DataǦIntensiveTextProcessingwithMapReduceJimmyLinandChrisDyerUniversityofMaryland,CollegePark{jimmylin,redpony}@umd.edu
OverviewThishalfdaytutorialintroducesparticipantstodataintensivetextprocessingwiththeMapReduceprogrammingmodel[1],usingtheopensourceHadoopimplementation.Thefocuswillbeonscalabilityandthetradeoffsassociatedwithdistributedprocessingoflargedatasets.Contentwillincludegeneraldiscussionsaboutalgorithmdesign,presentationofillustrativealgorithms,casestudiesinHLTapplications,aswellaspracticaladviceinwritingHadoopprogramsandrunningHadoopclusters.
Amazonhasgenerouslyagreedtoprovideeachparticipantwith$100inAmazonWebServices(AWS)creditsthatcanusedtowarditsElasticComputeCloud(EC2)“utilitycomputing”service(sufficientfor1000instancehours).EC2allowsanyonetorapidlyprovisionHadoopclusters“onthefly”withoutupfronthardwareinvestments,andprovidesalowcostvehicleforexploringHadoop.
IntendedAudienceThetutorialistargetedatanyNLPresearcherinterestedindataintensiveprocessingandscalabilityissuesingeneral.Nobackgroundinparallelordistributedcomputingisnecessary,butapriorknowledgeofHLTisassumed.
CourseObjectivesAcquireunderstandingoftheMapReduceprogrammingmodelandhowitrelatestoalternativeapproachestoconcurrentprogramming.AcquireunderstandingofhowdataintensiveHLTproblems(e.g.,textretrieval,iterativeoptimizationproblems,etc.)canbesolvedusingMapReduce.AcquireunderstandingofthetradeoffsinvolvedindesigningMapReducealgorithmsandawarenessofassociatedengineeringissues.
TutorialTopicsThefollowingliststopicsthatwillbecovered:
MapReducealgorithmdesignDistributedcountingapplications(e.g.,relativefrequencyestimation)ApplicationstotextretrievalApplicationstographalgorithmsApplicationstoiterativeoptimizationalgorithms(e.g.,EM)PracticalHadoopissuesLimitationsofMapReduce
InstructorBiosJimmyLinisanassistantprofessorintheiSchoolattheUniversityofMaryland,CollegePark.Hejoinedthefacultyin2004aftercompletinghisPh.D.inElectricalEngineeringandComputerScienceatMIT.Dr.Lin’sresearchinterestslieattheintersectionofnaturallanguageprocessingandinformationretrieval.
HeleadstheUniversityofMaryland’seffortintheGoogle/IBMAcademicCloudComputingInitiative.Dr.LinhastaughttwosemesterlongHadoopcourses[2]andhasgivennumeroustalksaboutMapReducetoawideaudience.ChrisDyerisaPh.D.studentattheUniversityofMaryland,CollegePark,intheDepartmentofLinguistics.Hiscurrentresearchinterestsincludestatisticalmachinetranslation,machinelearning,andtherelationshipbetweenartificiallanguageprocessingsystemsandthehumanlinguisticprocessingsystem.HehasservedonprogramcommitteesforAMTA,ACL,COLING,EACL,EMNLP,NAACL,ISWLT,andtheACLWorkshopsonMachinetranslation,andisoneofthedevelopersoftheMosesopensourcemachinetranslationtoolkit.HehaspracticalexperiencesolvingNLPproblemswithboththeHadoopMapReduceframeworkandGoogle’sMapReduceimplementation,whichwasmadepossiblebyaninternshipwithGoogleResearchin2008.AcknowledgmentsThisworkissupportedbyNSFunderawardsIIS0705832andIIS0836560;theIntramuralResearchProgramoftheNIH,NationalLibraryofMedicine;DARPA/IPTOContractNo.HR00110620001undertheGALEprogram.Anyopinions,findings,conclusions,orrecommendationsexpressedherearetheinstructors’anddonotnecessarilyreflectthoseofthesponsors.WearegratefultoAmazonforitssupportoftutorialparticipants.References[1]Dean,JeffreyandSanjayGhemawat.MapReduce:SimplifiedDataProcessingonLargeClusters.Proceedingsofthe6thSymposiumonOperatingSystemDesignandImplementation(OSDI2004),p.137150,2004,SanFrancisco,California.[2]JimmyLin.ExploringLargeDataIssuesintheCurriculum:ACaseStudywithMapReduce.ProceedingsoftheThirdWorkshoponIssuesinTeachingComputationalLinguistics(TeachCL08)atACL2008,p.5461,2008,Columbus,Ohio.