1 / 20

A Preliminary Investigation into the Automatic EuroVoc Indexing of Greek Documents

A Preliminary Investigation into the Automatic EuroVoc Indexing of Greek Documents. Eleni Galiotou, Dept. of Informatics Technological Educational Institute of Athens. Keyword Identification. Search & Retrieving in modern information retrieval systems: Keywords Key-phrases

tricia
Download Presentation

A Preliminary Investigation into the Automatic EuroVoc Indexing of Greek Documents

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Preliminary Investigation into the Automatic EuroVoc Indexing of Greek Documents Eleni Galiotou, Dept. of Informatics Technological Educational Institute of Athens PCI 2014, Athens, Oct. 2-4, 2014

  2. Keyword Identification • Search & Retrieving in modern information retrieval systems: • Keywords • Key-phrases • Keywords (Lancaster 1998) • Keyword extraction • keyword assignment • Identified in a controlled vocabulary list (thesaurus) • Descriptors do not necessarily appear explicitly in the text PCI 2014, Athens, Oct. 2-4, 2014

  3. Thesauri • Natural Language • WordNet (English): Lexical semantic relations structured around an exhaustive list of synonym sets • EuroWordnet (European languages) • Balkanet (Balkan languages) • Conceptual • Descriptors: abstract conceptual terms • EuroVoc • multilingual, multidisciplinary thesaurus • covers activities of the EU, (in particular those of the European Parliament) • Used by: European Parliament, European Commission’s Publications Office, many other institutions PCI 2014, Athens, Oct. 2-4, 2014

  4. EuroVoc • Terms in at least 27 languages (translated one-to-one) • 23 official languages of the EU (Bulgarian, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Greek, Hungarian, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovenian, Spanish, Swedish). • Basque, Catalan, Russian, Serbianand other non-official translations • Over 6,700 classes organized hierarchically into eight levels: • Relations: Broader Term, Narrower Term and Related Term (BT /NT/RT). • Fields covered: law, politics, finance, social issues, transport, environment, geography, science, organizations, etc. PCI 2014, Athens, Oct. 2-4, 2014

  5. Advantages -Drawbacks Hierarchical nature allows query expansion in document retrieval by subject field without having to use other possible search terms multilingual document collections can be searched monolingually since there is an one-to-one translation for each descriptor. human indexing : very complex and therefore, slow and expensive PCI 2014, Athens, Oct. 2-4, 2014

  6. Automatic Categorization using EuroVoc • Automatic multi-label categorization tool: • could be used as support tool for human annotators by helping them to improve speed and consistency. • Language-independent output of such a software: • could be used as an input to different text mining applications such as cross-lingual clustering and classification • Multi-label classification: • Improves the indexing process by the usage of more than one labels to categorize a single document PCI 2014, Athens, Oct. 2-4, 2014

  7. JEX – JRC EuroVoc Indexer • Multi-label categorization tool developed at the European Commission Joint Research Center (JRC) • Freely available from http://langtech.jrc.ec.europa.eu/Eurovoc.html • Performs indexing with the EuroVoc descriptors (classes) following a machine-learning approach: • Using statistical methods, the system can learn from manually indexed documents what are the associates of each descriptor (the words that are typical of a document belonging to a particular class). • When a new document undergoes the indexing procedure and the software finds associates for a EuroVoc class, it assigns the descriptor to the document in question. PCI 2014, Athens, Oct. 2-4, 2014

  8. The Data set • Geodata.gov.gr: • aims “to provide a focal point for the aggregation, search, provision and portrayal of open public geospatial information” . • in the road map to support enforcement of Law 3979/2011 for eGovernment, • as a best practice example for the application of Information & Communication Technologies (ICT) in the public administration, • as an open data repository for the provision of geospatial information. • first attempt for the free distribution of open geospatial data to citizens or enterprises in Greece. • A challenging case for use of EuroVoc descriptors for indexing • would allow the creation of a common vocabulary • could consist a first step towards the interlinking of Greek geospatial data to similar documents. PCI 2014, Athens, Oct. 2-4, 2014

  9. The nature of the data Collection of texts containing open geospatial data descriptions (http://www.geodata.gov.gr/geodata) Texts of various lengths ranging from 22 words (approx. 150 characters) to 350 words (approx. 2.000 characters). Four to five descriptors are already assigned to these documents manually, some of them in English. So, the results of the automatic indexing with EuroVoc descriptors could be compared to the initial characterization of the documents in question. PCI 2014, Athens, Oct. 2-4, 2014

  10. Text registered under the label “Υποδομές και Επικοινωνίες” (“Infrastructures and Communications”) • Document title: “Σταθμοί και στάσεις των Αστικών Συγκοινωνιών της Αθήνας” (“Stations and stops of urban transports in Athens”) • Describes geospatial data on the stations and stops of buses, trolleybuses, tram, metro and suburban railway in Athens. PCI 2014, Athens, Oct. 2-4, 2014

  11. Towards an automatic indexing JEX (JRC - EuroVoc indexer) ver. 1.0 for the Greek language (http://langtech.jrc.ec.europa.eu/Eurovoc.html) 15 texts under the labels “Environment”, “Culture”, “Energy”, “Infrastructure and Communications” from the Greek geodata web site (http://geodata.gov.gr) initial indexing of documents using JEX without changing any of its parameters. PCI 2014, Athens, Oct. 2-4, 2014

  12. Automatic descriptor assignment PCI 2014, Athens, Oct. 2-4, 2014

  13. Manual vs. Automatic Indexing • Keywords manually assigned to the document: • “Transport networks” • “αστικές συγκοινωνίες” (“urban transport”), • “λεωφορεία” (“buses”), • “τρόλλευ” (trolley), • “τραμ” (“tram”). • Descriptors automatically assigned to document (default number): • “bus”, • “disclosure of information”, • “data processing”, • “national implementation of community law”, • “merge control”, • “competition”. • Related terms • “bus station” • “electronic document” • JEX Field term • Transport (appears in the title of the document) • Communications (related to document category) PCI 2014, Athens, Oct. 2-4, 2014

  14. Descriptor “bus”. • Associates : • “έκτακτες” (“non-regular’ – nominative /accusative plural), • “στάσης” (“stop” – genitive singular), • “λεωφορείων” (“bus” – genitive plural), “στάσεις” • (“stop” – nominative / accusative plural). Associates PCI 2014, Athens, Oct. 2-4, 2014

  15. Evaluation (1) PCI 2014, Athens, Oct. 2-4, 2014

  16. Evaluation (2) Keywords produced by the automatic indexing process did not fully match keywords that were already manually assigned to the documents. We cannot draw safe conclusions due to the small size of our corpus. Results are somehow expected since the JEX software was used as it was initially trained without taking into account the particular data JEX should be trained with geodata descriptors in order to meet the requirements of a more accurate keyword assignment to documents containing geospatial information PCI 2014, Athens, Oct. 2-4, 2014

  17. Evaluation (3) • Question remains open as for the general terms assignment which could characterize most documents in the collection. • The software has correctly assigned such terms to certain documents but the initial annotation had not taken the particular terms into account • Existence of different inflected word-forms in the text • implies that the task of automatic indexing for texts written in a highly inflected language such Greek would be greatly facilitated by the use of tools such as lemmatizers in a linguistic preprocessing phase PCI 2014, Athens, Oct. 2-4, 2014

  18. Conclusions A first attempt to automatically assign EuroVoc descriptors to Greek open government data Automatic indexing task was performed using the JEX multi-label categorization software on a small corpus of geospatial data descriptions available from the geodata.gov.gr web site. The task of automatic indexing was performed using the JEX multi-label categorization software on a small corpus of geospatial data descriptions available from the geodata.gov.gr web site. General terms were more or less correctly assigned to the documents but, practically no words in common between the two sets of keywords PCI 2014, Athens, Oct. 2-4, 2014

  19. Future Work Repeat our experimentation involving training the software with the appropriate keywords. Examine different sets of stop-words that may result to a better performance of the software.  Develop linguistic pre-processing tools such as lemmatizers which will take into account linguistic knowledge on a highly inflected language such as Greek. PCI 2014, Athens, Oct. 2-4, 2014

  20. Thank you! PCI 2014, Athens, Oct. 2-4, 2014

More Related