1 / 17

Character Gazetteer for Named Entity Recognition with Linear Matching Complexity

Character Gazetteer for Named Entity Recognition with Linear Matching Complexity. Dlugolinsky S., Nguyen G ., Laclavik M., Seleng M. Institute of Informatics, Slovak Academy of Sciences giang.ui@savba.sk. Content.

Download Presentation

Character Gazetteer for Named Entity Recognition with Linear Matching Complexity

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Character Gazetteer for Named Entity Recognition with Linear Matching Complexity Dlugolinsky S., Nguyen G., Laclavik M., Seleng M. Institute of Informatics, Slovak Academy of Sciences giang.ui@savba.sk

  2. Content • Context: Big Data, Natural Language Processing (NLP), Named Entity Recognition (NER) • Gazetteers • Tree structures: design and realizations • NER with linear matching complexity • Evaluations • Future work

  3. Work context NER important task in order to gain the information Big Data produced daily in • Social media: Twitter, Google+, Facebook, Instagram, etc. • Wikipedia, Wikia, newspapers … • Other internal sources like transactions, logs, emails, … Knowledgeand Informationhiddenin (un|semi-)structured data • useful for • business or political sentiment analysis • public opinion assessment • emergency response, etc. • text, images, audio, video Text  NLP  Information

  4. Natural Language Processing (NLP) • Incoming text comes continuously from websites, portals, social media, etc. • Need to recognize well-known NEs and theirs occurrences with references • NER is important task in order to gain information

  5. Gazetteers • Basic, independent and very effective NER technique for NE identification in text • Processing approaches • Token-based: split input text into a sequence of tokens (words) • Character-based: processing input text character by character • NE recognitions • Machine learning techniques • Finite-state machines (FSM)

  6. Related work Ontotext Hash Gazetteer • Based on hash tables • Authors: “3x faster and 4x less memory than FSM equivalent” • As a part of the GATE only Ontotext Stand-Alone Gazetteer • Stand-alone version of the Hash Gazetteer • No longer available Ontotext Large Knowledge Base Gazetteer • Support for ontology-aware NLP • As a part of the GATE only Other gazetteers implemented as a proprietary look-up piece of code or complex solutions

  7. Our requirements Standalone • no 3rd party libraries needed • does not rely on external preprocessing; e.g. tokenization Linear complexity lookup algorithm • fast and effective processing of input text as a stream, especially for Big Data Editable data structure • add/remove NEs between lookups Memory efficient data structure • “learn” tens of millions of entities Robust • input texts of any size • any language

  8. Gazetteer tree data structures for HMT and CST realizations

  9. Named entity recognitionCharacter-based with Linear matching complexity

  10. HMT and CST realizations • HMT: Hash Map Tree (multi-way tree) • implemented by Java HashMap, constant-time performance O(1) in average for basic operations (get and put) • (-) consumes a lot of memory • (+) very fast • CST: Child-Sibling Tree • pure and simple Java structure for nodes • (+) memory efficiency (only 25% vs. HMT) • (-) slower (cca. 10x vs. HMT for big data) • Deal with overlapping, prefix, postfix NE cases

  11. Evaluation datasets • Gazetteer datasets: • Freebase organizations: 778,814 unique entities • Freebase locations: 1,256,552 unique entities • Freebase persons: 2,614,401 unique entities • Wikipedia titles and alternative names: 9,319,611 unique entities • Incoming data sets • 9,909 documents acquired from CoNLL-2003 datasets (Reuters’ text) with approximately 29MB of text

  12. Memory consumptions

  13. Rating characters per node

  14. Matching time

  15. Simple output example

  16. Next steps • Improving the tree data structure in order to • Decrease memory requirements • More efficient for traversing and matching • Possible direction is collapsing nodes: • PHT - Patricia Hash Map Trie • Work completions • Integration to our projects and existing complex tools • Open source at http://ikt.ui.sav.sk/gazetteer

  17. Thank you for attention Giang Nguyen giang.ui@savba.sk Cite: Stefan Dlugolinsky, Giang Nguyen, Michal Laclavik, Martin Seleng: "Character Gazetteer for Named Entity Recognition with Linear Matching Complexity", 3rd World Congress on Information and Communication Technologies, WICT'2013, pp. 364-368, IEEE Catalog Number: CFP1395H-ART, ISBN: 978-1-4799-3230-6

More Related