1 / 33

Tokeniser

Tokeniser. Francisco Miguel Pérez Romero University of Sevilla. Roadmap. Introduction Class Diagram Libraries Conclusions. Roadmap. Introduction Class Diagram Libraries Conclusions. Web Wrapping. Extractor. Information retrieval. Ontologiser. Verifier. FormFiller. Navigator.

shiela
Download Presentation

Tokeniser

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Tokeniser Francisco Miguel Pérez Romero University of Sevilla

  2. Roadmap Introduction Class Diagram Libraries Conclusions

  3. Roadmap Introduction Class Diagram Libraries Conclusions

  4. Web Wrapping Extractor Information retrieval Ontologiser Verifier FormFiller Navigator Query

  5. Tokeniser • Tokenisation Rules • Configuration File • Web Page • Parser

  6. TokeniserUsage • Web Page Classification • Information Extraction Learners • Information Extraction

  7. Example Config File Token List Tokeniser XML File Token List Web Page

  8. Concepts • Configuration File • Token • Tokenisation types

  9. Roadmap Introduction Class Diagram Libraries Conclusions

  10. Example • 3 TokenClasses: • Word • Space • Digit Space Digit

  11. Class Diagram: Tokenisation

  12. Tokenisation Example

  13. Class Diagram: Tokeniser

  14. Roadmap Introduction Class Diagram Libraries Conclusions

  15. ComparisonFeatures 1 • Comparison Features: • Javadoc documentation? • Support UNICODE UTF-8 • Support UNICODE UTF-16 • Named Groups • Indexable Groups > 9 • Negative Groups • Nested groups • Lazy qualifications?

  16. ComparisonFeatures 2 • Comparison Features: • Fuzzy matching? • Support POSIX? • Support Ignore Case? • Support New Line Option? • Use State Machine? • Support accent?

  17. Libraries • Tabla 1

  18. Libraries • Tabla 2

  19. Libraries • Tabla 3

  20. Benchmark 1 • Regular Expression List • String List • Matching all one another • Time in ms

  21. Benchmark 1: 10000 Iterations • org.apache: -> 7078 ms • com.stevesoft : -> 19782 ms • kmy.regex : -> 781 ms • java.util : -> 1266 ms • jregex.Pattern : -> 1000 ms • org.apache.oro : -> 2156 ms • dk.brics.automaton : -> 265 ms • com.karneim.util.collection : -> 407 ms

  22. Benchmark 1: 20000 Iterations • org.apache: -> 11796 ms • com.stevesoft : -> 26641 ms • kmy.regex : -> 906 ms • java.util : -> 1891 ms • jregex.Pattern : -> 1422 ms • org.apache.oro : -> 3375 ms • dk.brics.automaton : -> 312 ms • com.karneim.util.collection : -> 610 ms

  23. Benchmark 1: 50000 Iterations • org.apache: -> 28656 ms • com.stevesoft : -> 63297 ms • kmy.regex : -> 1781 ms • java.util : -> 4281 ms • jregex.Pattern : -> 3219 ms • org.apache.oro : -> 7641 ms • dk.brics.automaton : -> 531 ms • com.karneim.util.collection : -> 1312 ms

  24. Diagram

  25. Benchmark 2 • Source Code • Matching tags

  26. Benchmark 2: Amazon • org.apache : -> 218 ms • com.stevesoft : -> 63 ms • kmy.regex : ->94 ms • java.util : -> 0 ms • jregex.Pattern : -> 93 ms • org.apache.oro : -> 32 ms • dk.brics.automaton : -> 0 ms • com.karneim.util.collection : -> 47 ms

  27. Benchmark 2: Marca • org.apache : -> 62 ms • com.stevesoft : -> 47 ms • kmy.regex : ->93 ms • java.util : -> 0 ms • jregex.Pattern : -> 94 ms • org.apache.oro : -> 16 ms • dk.brics.automaton : -> 0 ms • com.karneim.util.collection : -> 62 ms

  28. Benchmark 2: Ebay • org.apache : -> 31 ms • com.stevesoft : -> 125 ms • kmy.regex : ->266 ms • java.util : -> 0 ms • jregex.Pattern : -> 156 ms • org.apache.oro : -> 47 ms • dk.brics.automaton : -> 0 ms • com.karneim.util.collection : -> 172 ms

  29. Diagram

  30. Tosum up… • Dk.brics.automaton is the faster • Dk.brics and com.karneim fail with URL • Kmy.regex or java.util

  31. Roadmap Introduction Class Diagram Libraries Conclusions

  32. Conclusions • Tokenisation test • Searching information • A real project • Experience

  33. Thanks!

More Related