html5-img
1 / 14

Learning to Mine Definitions from Slovene Structured and Unstructured Knowledge - Rich Resources

Darja Fišer, Senja Pollak, Špela Vintar University of Ljubljana, Dept. of Translation Studies {darja.fiser, spela.vintar}@guest.arnes.si, senja.pollak@ff.uni-lj.si. Learning to Mine Definitions from Slovene Structured and Unstructured Knowledge - Rich Resources. Aim.

thanos
Download Presentation

Learning to Mine Definitions from Slovene Structured and Unstructured Knowledge - Rich Resources

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Darja Fišer, Senja Pollak, Špela Vintar University of Ljubljana, Dept. of Translation Studies {darja.fiser, spela.vintar}@guest.arnes.si, senja.pollak@ff.uni-lj.si Learning to MineDefinitionsfromSloveneStructuredandUnstructuredKnowledge-RichResources

  2. Aim • Extractdefinitionsofspecialisedconceptsfromtexts (journals, textbooksetc.). • Use Wikipedia to learnrulesthathelpdistinguishbetweenproperdefinitionsandnon-definitions. • Extractcandidatesentencesfromtextsusing 3 approaches: • patterns (A cell is thesmallestlivingunit in anorganism) • automatic term recognition • wordnet • Applyrules to selectgooddefinitionsanddiscardnon-definitions LREC2010 Malta

  3. LearningrulesfromWikipedia title definition non-definition LREC2010 Malta

  4. LearningrulesfromWikipedia • Slovene Wikipedia (December 2009): 162,500 articles • only well-formed pages retained • morphosyntactic annotation and lemmatization with ToTaLe (Erjavec et al. 2005) • structural parsing: 19,964instances • building a classification model in Weka (Witten and Frank 2005) • features: most frequent PoS and lemmata LREC2010 Malta

  5. Learningrules - Results • best: J48 decision tree classifier • experimenting with full and merged PoS, absolute frequency (AF) and binary values • 10-fold cross-validation LREC2010 Malta

  6. Extractingdefinitionsfromtexts:Resources • “unstructuredtexts”: subsetoftheFidaPluscorpus (http://www.fidaplus.net) • knowledge-rich: textbooks, popularsciencevolumes (e.g. “Allaboutmushrooms”) • variousdomains: astronomy, physics, geography, botany ... • sloWNet – Slovenewordnet(Fišer 2007, http://lojze.lugos.si/~darja/slownet.html) • Automatic term recognitionsystemforSlovene(Vintar 2004, http://lojze.lugos.si/cgitest/extract.cgi) LREC2010 Malta

  7. Extractingdefinitionsfromtexts: 1. Usingwordnethyperonymy • Thesentence is a definitioncandidateif: • the sentence starts with a sloWNet literal andcontains at least one more literal from the samehyperonymy chain (i.e. its hyponym or itshypernym) <term id=ENG20-13313485-n>Diabetes</term> je <termid=ENG20-13268088-n>bolezen</term>, ki je posledica pomanjkanja inzulina, hormona, ki skrbi, da celice v telesu dobivajo glukozo (sladkor). [Diabetes is a disease resulting from insulin deficiency,thehormone providing glucose (sugar) for body cells.] LREC2010 Malta

  8. Extractingdefinitionsfromtexts:2. Usingautomatic term recognition • Thesentence is a definitioncandidateif: • the sentence contains at least twodomain-specific termsandthefirst term is in the nominative case <term score=“80.45“>Ekvator</term> je najdaljši vzporednik, ki deli Zemljo na severno in <term score=”43.21”>južno poloblo</term>. [The Equator is the largest circle of latitude dividing the Earth into the Northern and the SouthernHemispheres.] LREC2010 Malta

  9. Extractingdefinitionsfromtexts:3. Usingpatterns • Thesentence is a definitioncandidateif: • the sentence contains a definingmorphosyntacticpattern(NP[nominative] is_a NP [nominative]). NP is_a NPCelica je strukturna in funkcionalna enota vseh živih organizmov. [A cell is a structural and functional unit of all livingorganisms.] LREC2010 Malta

  10. Results manual evaluation of all definition candidates sloWNet: best precision, ATR: best recall what is a definition?? LREC2010 Malta

  11. Classificationaccuracy For definitions only: LREC2010 Malta

  12. Which is the “best” definition? The Equator is an imaginary line on the Earth's surface equidistant from the North Pole and South Pole that divides the Earth into a Northern Hemisphere and a Southern Hemisphere. An equator is the intersection of a sphere's surface with the plane perpendicular to the sphere's axis of rotation and containing the sphere's center of mass. The longest of the five main circles of latitude on Earth (the others being the Arctic and Antarctic Circles and the Tropics of Cancer and Capricorn) is called the Equator. LREC2010 Malta

  13. Definitionsdepend on context... andmayspanoverseveralsentences Head lice are parasites that live in the hair and scalp of humans. HEAD LICE, also called Pediculus Humanus Capitis are small blood-sucking, wingless insects found on the human scalp. They are approximately the size of a sesame seed and cannot jump or fly.  They are six-legged creatures with claws, which help them cling to and crawl through human hair.  Head lice are an emerging social problem, not only in economically poor countries but also in practically all other societies. LREC2010 Malta

  14. Conclusions & futurework • Wikipediacanhelp us learnthepropertiesofdefinitions, • Knowledge-richtexts are a goodsourceofdefinitions, • A semantically-richapproach (usingwordnetand ATR) yieldsmanydefinitionsanddefiningcontexts. • Defining a definition is hard... • Encyclopaedicdefinitionsdifferfromthosefound in runningtexts, • Futurework: • useotherfeatures in learning, • useactivelearning, • redefinedefinitionsandpossiblyre-evaluatedefinitioncandidates LREC2010 Malta

More Related