1 / 45

Machine Learning for Natural Language Processing

Machine Learning for Natural Language Processing. Seminar Information Extraktion Wednesday , 6th November 2013. Robin Tibor Schirrmeister. Outline. Informal definition with example Five machine learning algorithms Motivation and main idea Example Training and classification

maree
Download Presentation

Machine Learning for Natural Language Processing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Machine Learning forNatural Language Processing Seminar Information Extraktion Wednesday, 6th November 2013 Robin Tibor Schirrmeister

  2. Outline • Informal definitionwithexample • Fivemachinelearningalgorithms • Motivation andmainidea • Example • Training andclassification • Usageandassumptions • Typesofmachinelearning • Improvementofmachinelearningsystems

  3. Informal DefinitionInformal Definition • Machine Learning means a programgetsbetterbyautomaticallylearningfromdata • Let‘sseewhatthismeansbylookingat an example

  4. NamedEntity Recognition Example • Howtorecognizementionsofpersons in a text? • Angela Merkel made a decision. • Tiger Woods played golf. • Tiger ran intothewoods.Couldusehandwrittenrules, mighttake a lotof time todefine all rulesandtheirinteractions… So howtolearnrulesandtheirinteractions?

  5. Person Name Features • Create featuresforyour ML algorithm • For a machinetolearnif a wordis a name, youhavetodefineattributesorfeaturesthatmightindicate a name • Word itself • Word is a knownnounlikehammer, boat, tiger • Word iscapitalizedor not • Part ofspeech (verb, noun etc.) • …Human hasto design thesefeatures!

  6. Person Name Features • ML algorithmlearnstypicalfeatures • Whatvaluesofthesefeaturesaretypicalfor a personname? • Michael, Ahmed, Ina • Word is not a knownnoun • Word iscapitalized • Proper nounMachineshouldlearnthesetypicalvaluesautomatically!

  7. Person Name Training Set • Create a trainingset • Containswordsthatarepersonnamesandwordsthatare not personnames

  8. Naive Bayes Training • For eachfeaturevaluecounthowfrequentitoccurs in personnamesandotherwords. • This tellsyouhowlikely a randompersonnameiscapitalized. Howcouldpersonname not becapitalized? Think noisytextliketweets..

  9. Naive Bayes Training • Count totalfrequencyofclasses • Count howfrequentpersonnamesare in general: Howmanywordsarepartofpersonnamesandhowmanyare not • Do wholetrainingthe same wayforwordsthatare not personnames

  10. Naive BayesClassificationExample • Let‘strytoclassify Tiger in Tiger Woodsplayedgolf. • Assume5% ofwordsarepartsofpersonnames

  11. Naive BayesClassification • Classify a wordbymultiplyingfeaturevalueprobabilities • (valueof i-thfeature) • Higher Score wins! 

  12. Naive BayesOverview Training • For all classes • For all featuresand all possiblefeaturevalues • Compute (e.g. chanceofwordiscapitalizedifit‘s a personname) • Compute total probability Classification • For all classescomputeclassscoreas: • Data pointisclassifiedbyclasswithhighest score

  13. Naive BayesAssumptions • Probabilityofonefeatureisindependentofanotherfeaturewhenweknowtheclass • Whenweknowwordispartofpersonname, probabilityofcapitalizationisindependentofprobabilitythatwordis a knownnounThis is not completelytrueIfthewordistiger, wealreadyknowitis a knownnoun • That‘swhyit‘scalledNaiveBayesNaive Bayesoftenclassifieswellevenifassumptionisviolated!

  14. Evaluation • Evaluateperformanceon testset • Correctclassification rate on testsetofwordsthatwere not used in trainingCorrectclassification rate not necessarilythemost informative… • IfweclassifyeverywordasOthersandonlyhave5% personnamewords, weget a 95% classification rate! • More informative measuresexist • Words correctlyandincorrectlyclassifiedaspersonname(true positives, false positives) • andothers (true negatives, false negatives)

  15. Evaluation Metric • Best performanceis in partsubjective • Recall: Maybewanttocapture all personsoccuring in textevenatcostofsomenon-personsE.g. ifyouwanttocapture all personsmentioned in connectionwith a crime • Precision: OnlywanttocapturewordsthataredefinitelypersonsE.g. ifyouwanttobuild a reliablelistoftalkedaboutpersons

  16. Interpretation • Feature valueprobabilitiesshowfeaturecontributiontoclassification • Comparingtrainedvaluesand tellsyouifthisfeaturevalueismorelikelyfor class1 or class2 • meanscapitalizedwordsaremorelikelytobepartsofpersonnames • Youcanlookateachfeatureindependently

  17. Machine Learning System Overview Training Set Test Set Evaluate Data Acquisition Importanttogetdatasimilartodatayou will classify! Data RepresentationasFeatures Importanttohavetheinformation in thefeaturesthatallowsyoutoclassify MachineLearning AlgorithmTraining Importanttousealgorithmwhoseassumptions fit thedatawellenough Performance Evaluation Importanttoknowwhatmeasureofqualityyouareinterested in

  18. Logistic Regression Motivation • Correlatedfeaturesmightdisturbourclassification • Tiger isalways a knownnoun • Bothfeatures (knownnoun, wordtiger) indicatethatit‘s not a name • Since Naive Bayesignoresthatthewordtigeralreadydeterminesthatitis a knownnoun, it will underestimatechanceoftigerbeing a nameModellingrelationfromcombinationoffeaturevaluestoclassmoredirectlymighthelp?

  19. Logistic Regression Idea • Idea: Learnweightstogether, not separately • Makeall featuresnumerical • Insteadof Part Of Speech = verb, nounetc.: • Onefeatureforverbwhichis 0 or 1 • Onefeaturefornounwhichis 0 or 1 etc. • Thenyoucanmakesumsofthesefeaturevalues * weightsandlearntheweights • Sumshouldbevery high for Person Namesandverysmallfor non-person names • Weights will indicatehowstrongly a featurevalueindicates a personnameCorrelatedfeaturescangetappropriate, not too high weights, becausetheyarelearnedtogether!

  20. Logistic Regression 1 0 • Estimateprobabilityforclass • Usesumof linear functionchainedto a link function • Link functionrisessharplyaroundclassboundary

  21. Example • Let‘s lookatTiger Woods played golf. • Assumewelearnedtheseweights: • > 0.5 => looksmorelike a name • 0.62 canbeinterpretedas 62% probabilitythatit‘s a name

  22. Training • Solveequationsystemiteratively • Fromourtrainingexamplesweget a systemofequations. Using • () = 0 • () = 1 • … • Best Fit cannotbecomputeddirectly, issolvedbyiterativeproceduresNot ourtopichere • Just havetoknowthatweightsareestimatedtogether!

  23. Interpretation • Higher weightsmeanprobabilityofyes(1) isincreasedbythecorrespondingfeature • Weightshavetobeinterpretedtogether (noconditionalindependenceassumed) • Ifwehavethefeaturewordprecededby Dr.andanotherfeaturewordprecededby Prof. Dr. • But in all ourtextsthereareonlyProf. Dr.Bothfeatures will alwayshavethe same value! • Then and leadsto same predictionsas and • Also, weightsareaffectedbyhowbigandhowsmallfeaturevaluerangeis

  24. Naive BayesvsLogistic Regression Jordan, Ng(2005) • LogisticRegression betterwithmoredata, Naive Bayesbetterwithlessdata • Naive Bayesreachesitsoptimumfaster • LogisticRegression hasbetteroptimal classification

  25. Support Vector Machines Animaland plant words in sentence Tiger asAnimal Tiger as Name Support VectorMachinetriestoseparatetrueandfalseexamplesby a bigboundary Sports references in sentence

  26. Training • Soft Margin forinseparabledata • In practiceexamplesusually not perfectlyseparable • Soft-Margintoallowforwrongclassifications • Parametertoadjusttradeoffbetween: • Datapoints shouldbe on thecorrectsideoftheboundaryandoutside ofthemargin • Margin shouldbebig • Specializedoptimizationalgorithmsfor SVMs exist

  27. Usage • SVM usedoftenverysuccessfully, verypopular • Very robust, and fast, onlysupportvectorsareneededfortheclassification • => robust againstmissingdata

  28. Interpretation Animaland plant words in sentence Tiger asAnimal Tiger as Name Hyperplane cantellyouwhichfeatures matter moreforclassification Sports references in sentence

  29. DecisionTrees Word tiger spears Sports references michael Capital Name < 2 >= 2 yes no 90% Name Other Name Other Recursivelyseparate databyfeaturesthatsplitwellinto different classes

  30. DecisionTrees Training • Start with all trainingexamplesatrootnode • Pick featuretosplittrainingexamplesintonextsubtrees • Pick a feature, so thattrainingexamples in onesubtreearemostlyfromoneclass • Recursivelyrepeatprocedure on subtrees • Finished, whensubtreeonlycontainsexamplesfromoneclass (converttoleaf, e.g. name) • Ormostexamplesfromoneclass (usingsomepredefinedthreshold)

  31. DecisionTreesUsage • Usefulespeciallyifyouassumesomefeatureinteractions • Also usefulforsomenon-linear relationshipsoffeaturestoclassification • Word shorterthan 3 characters: Unlikelytobe a name • Word between 3 and 10 charachters: Mightbe a name • Word longerthan 10 characters: Unlikelytobe a name • Oftenmanytreesareusedtogetherasforests(ensemblemethods) • Verycleartointerpretlearningofsingletree • Forforests, methodsexisttodeterminefeatureimportance

  32. Conditional Random Fields Motivation • Compare • Tiger Woods played golf. • Tiger ran intothewoods.We still wanttoknowforbothoccurencesof Tiger ifitis a name. Thereisonehelpfulcharacteristicofthesesentenceswedid not use. Can youguesswhatitis? … • Tiger andWoods couldbothbenamesandtwopartsof a namestandingtogetheraremorelikelythanonepartof a namebyitself.

  33. Conditional Random Fields Idea • Toclassify a datapoint, usethesurroundingclassificationsanddatapoints • E.g. usethefactthatnamesoften stand together • (Sequential) input-> sequentialoutput • Weonlyuseneighbouringclassifications (Linear-Chain-CRFs)

  34. Conditional Random Fields Sketch Feature Functions Score forvalue of Usefeaturefunctionstodetermineprobabilityofclasssequence

  35. Feature Functions • Feature functionsforlinear-chain CRFs canuse • thecompletesentence • Currentposition (word) in sentence • theclassoftheoutputnodebefore • theclassofthecurrentoutputnode • Return real value • Eachfeaturefunctionismultipliedby a weight, thatneedstobelearned

  36. Examples • Washington Post wroteabout Mark Post. • City + Post usuallyis a newspaper, First Name + Postmorelikelytobe a name • Dr. Woods methisclient. • Salutation (Mr./Dr. etc) usuallyfollowedbyname • Feature functiondoes not havetouse all inputs • E.g. featurefunctioncan just lookatiswordcapitalized, whatispartofspeechofthenextword etc.

  37. Usage • Definefeaturefunctions • Learnweightsforfeaturefunctions • Classify • Find sequencethatmaximizessumofweights * featurefunctions • Can bedone in polynomial time withdynamicprogramming • Used a lotfor NLP taskslikenamedentityrecognition, partsofspeechin noisytext

  38. AlgorithmCharacteristics • Assumptions on data • Linear relationshiptooutput/ non-linear • Interpretabilityoflearning • Meaningoffeatureweights etc. • Computationaltime andspace • Type ofinputandoutput • Categorical, numerical • Single datapoints, sequences

  39. Supervised/Unsupervised • Unsupervisedalgorithmsforlearningwithoutknownclasses • These algorithmsweresupervisedalgorithms • Wehadpreclassifiedtrainingdata • Sometimeswemightneedunsupervisedalgorithms • Rightnow, whatkindoftopicsarecovered in newsarticles? • Oftenworkbyclustering • Similardatagetsassignedthe same classE.g. textwithsimilarwordsmayrefertothe same newstopic

  40. Semi-Supervised • Create moretrainingdataautomatically • Big amountoftrainingdataimportantforgoodclassification • Creatingtrainingdatabyhandtime-demanding • Yourunsupervisedalgorithmalreadygivesyoudatapointswithclasses • Other simple rulescan also giveyoutrainingdataE.g. Dr. isalmostalwaysfollowedby a name • New datayouclassifiedwith high confidencecan also beusedastrainingdata

  41. Improvement ML System • Itisimportanttoknowhowandifyoucanimproveyourmachinelearningsystem • Maybe in youroverall (NLP) systemtherearebiggersourcesoferrorthanyour ML system • Maybefromcurrentdataitisimpossibletolearnmorethanyouralgorithmdoes • Youcantryto: • Getmoredata • Use different featuresAlso maybepreprocessmore • Use different algorithmsAlso different combinations

  42. Machine Learning in NLP • Verywidelyused • Makeiteasiertocreatesystemsthat deal withnew/noisytextForexampletweets, freetext on medicalrecords • Can beeasiertospecifyfeaturesthatmaybeimportantandlearnclassificationautomaticallythanwrite all rulesbyhand

  43. Summary Typicalmachinelearningconsistsofdataacquisition, feature design, algorithmtrainingandperformanceevaluation Manyalgorithmsexistwith different assumptions on thedata Importanttoknowwhetheryourassumptionsmatchyourdata Importanttoknowwhatthegoalofyouroverallsystemis

  44. Helpful Resources • Wikipedia  • Courseexplaining ML includinglogisticregressionand SVMs • Anothersimilarone, slidesfreelyavailable • Lectureabout Naive Bayesfordocumentclassification • CRF intuitive introduction • Guidetochoose ML classifier

  45. References Ng, Andrew Y., and Michael I. Jordan. On Discriminative vs. Generative classifiers: A comparison of logistic regression and naive Bayes. (2002).

More Related