1 / 22

Supporting e-learning with automatic glossary extraction Experiments with Portuguese

Rosa Del Gaudio, António Branco RANLP, Borovets 2007. Supporting e-learning with automatic glossary extraction Experiments with Portuguese. LT4eL project ILIAS Corpus Tool Grammars Copula Other Verbs Punctuation Results Conclusion. Presentation Plan.

mort
Download Presentation

Supporting e-learning with automatic glossary extraction Experiments with Portuguese

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Rosa Del Gaudio, António Branco RANLP, Borovets 2007 Supporting e-learning with automatic glossaryextractionExperiments with Portuguese

  2. LT4eL project ILIAS Corpus Tool Grammars Copula Other Verbs Punctuation Results Conclusion Presentation Plan

  3. Improve retrieval and accessibility of LO in learning management systems Employ language technology resources and tools for the semi-automatic generation of descriptive metadata. Develop new functionalities such as a key word extractor and a glossary candidate detector, semantic search, tuned for the various languages addressed in the project (Bulgarian, Czech, Dutch, English, German, Maltese, Polish, Portuguese, Romanian). LT4eL

  4. ILIAS

  5. Build a Glossary in an automatic way to support e-learning process. In practice this means to extract a definition from unstructured text (scientific papers, enciclopedia, web pages) Better access to information for student Accelerate the work of the tutor Objective

  6. ILIAS: Glossary Candidate Detector

  7. The Corpus • 274.000 tokens • Tutorials • PhD Thesis • Scientific papers • 3 Domains evenly represented • e-learning • Technology for non experts • Calimera

  8. XML format <definingText continue="y" def="m147" def_type1="is_def" id="d5"> <markedTerm dt="y" id="m147" kw="y"> <tok base="intranet" class="word" ctag="PNM" id="t9032" sp="y">Intranet</tok> </markedTerm> <tok base="ser" class="word" ctag="V" id="t9033" msd="pi-3s" sp="y">é</tok> <tok base="uma" class="word" ctag="UM" id="t9034" msd="fs" sp="y">uma</tok> <tok base="rede" class="word" ctag="CN" id="t9035" msd="fs" sp="y">rede</tok> <tok base="desenvolver,desenvolvido" class="word" ctag="PPA" id="t9036" msd="fs" sp="y">desenvolvida</tok> <tok base="para" class="word" ctag="PREP" id="t9037" sp="y">para</tok> <tok base="processamento" class="word" ctag="CN" id="t9038" msd="ms" sp="y">processamento</tok> <tok base="de" class="word" ctag="PREP" id="t9039" sp="y">de</tok> <tok base="informação" class="word" ctag="CN" id="t9040" msd="fp" sp="y">informações</tok> <tok base="em" class="word" ctag="PREP" id="t9041" sp="y">em</tok> <tok base="uma" class="word" ctag="UM" id="t9042" msd="fs" sp="y">uma</tok> <tok base="empresa" class="word" ctag="CN" id="t9043" msd="fs" sp="y">empresa</tok> <tok base="ou" class="word" ctag="CJ" id="t9044" sp="y">ou</tok> <tok base="organização" class="word" ctag="CN" id="t9045" msd="fs">organização</tok> <tok class="punctuation" ctag="PNT" id="t9046" sp="y">.</tok> </definingText>

  9. Input: simple text or xml Regular expressions Substitution and markup Output the same file with changes Match tree using elements Quick Unicode friendly freeware Easy to integrate in other tools (java) LxTransduce

  10. <rule name="Conj"> <query match="tok[@ctag = 'CJ']"/> </rule> <rule name="Coor"> <!--Conjunctions or comma --> <first> <query match="tok[. = ',']"/> <ref name="Conj" mult="+"/> </first> </rule> <rule name="PARopen"> <query match="tok[.~'^\($']"/> </rule> <rule name="PARcl"> <query match="tok[.~'^\($']"/> </rule> <rule name="parenthetic"> <seq> <ref name="PARopen"/> <repeat-until name="tok"> <ref name="PARcl"/> </repeat-until> <ref name="PARcl"/> </seq> </rule> Rules in lxtransduce

  11. Precision Recall F2 Gr 00 0.14 0.44 0.26 Gr 01 0.31 0.20 0.22 First developmentphase • Less than 50% of the corpus • Focus on the verb • Precision: manually marked/all automatic • Recall: correct automatic/manually marked • F2 :3*(precision*recall)/2*precision+recall

  12. Second developing phase • 75% of the corpus for developing • 25% of the corpus for testing • Specific grammar/rules for each type

  13. <rule name="euristic"> <seq> <repeat-until name="tok"> <ref name="SERdef" mult="+"/> </repeat-until> <ref name="SERdef" mult="+"/> <not> <ref name="PPA"/> </not> <ref name="tok" mult="*"/> <end/> </seq> </rule> Verb “to be” third person singular or plural present indicative <rule name="SERdef"> <best> <ref name="Ser3"/> <ref name="PoderSer"/> </best> </rule> Copula baseline grammar

  14. Copula base result • Sentence level results • Problem with precision

  15. Copula Grammar

  16. <!-- To Be 3rd person pl and s --> <rule name="Serdef"> <query match="tok[@ctag = ’V’ and @base=’ser’ and (@msd[starts-with(.,’fi-3’ )] or @msd[starts-with(.,’pi-3’ )])] </rule> .... <rule name="copula1"> <seq> <ref name="SERdef"/> <best> <seq> <ref name="Art"/> <ref name="adj|adv|prep|" mult="*"/> <ref name="Noun" mult="+"/> </seq> .... </best> <ref name="tok" mult="*"/> <end/> </seq> </rule> Rules for is_type

  17. Confronting Results Include that patterns that were excluded Try to gather the syntactic pattern of non definition and confront with the syntactic pattern of definition.

  18. Collect verbs in a lexicon Three different category: reflexive, active, passive. 22 different verbs <lex word="chamar"> <cat>ref</cat> </lex> <lex word="chamar,chamado"> <cat>pas</cat> </lex> <rule name="Vpas"> <seq> <ref name="tok"/> <not> <ref name="not"/> </not> <ref name="tok" mult="?"/> <query match="tok[mylex(@base) and (@ctag='PPA')]" constraint="mylex(@base)/cat='pas'"/> </seq> </rule> Other_Verbs grammar

  19. Results for verb_type • Analyze each verbs separately as with is_type • Richer syntactic patterns

  20. Punctuation Grammar <rule name="punct_def"> <seq> <start/> <ref name="CompmylexSN" mult="+"/> <query match="tok[.~’^:\$’]"/> <ref name="tok" mult="+"/> <end/> </seq> </rule> • Preliminary work • Definition introduced by colon mark (most frequent)

  21. All-in-one • Combination of the previous grammars • The type is not take into account to calculate precision and recall

  22. Overall results: Recall 86%, Precision 14% Difference among domains: the style of a document influence the result. Improve the rules for verb_type and punc_type Combining with other techniques such as ML Conclusions and Future Work

More Related