1 / 41

Search And Text Analysis

Search And Text Analysis. An Introduction to Java-based Open Source Tools and Techniques Grant Ingersoll October 15, 2008 Charlotte JUG. Overview. Background Taming Text Importance Foundations Language Basics Obtaining Text Tools for Search and Text Analysis Concepts Demos Resources.

Mia_John
Download Presentation

Search And Text Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Search And Text Analysis An Introduction to Java-based Open Source Tools and Techniques Grant Ingersoll October 15, 2008 Charlotte JUG

  2. Overview • Background • Taming Text • Importance • Foundations • Language Basics • Obtaining Text • Tools for Search and Text Analysis • Concepts • Demos • Resources

  3. Taming Text • http://www.manning.com/ingersoll • Grant Ingersoll • Lucene/Solr/Mahout committer • Tom Morton • OpenNLP author • Practical aspects of Text Analysis • Open Source libraries • No math :-) • In progress - Early Access Available • Code: • Solr as platform for enabling NLP • Open source

  4. Why is Search and Text Analysis Important?

  5. Quiz 1. Can you read this? 2. How about this? 3. How many emails did you get today? 4. How many websites/articles/books did you read today? 5. How many searches did you do? (Google, Y!, local, proprietary) 6. How much time did you spend doing all of these things? 7. How did you do it? Literally, what processes did you use?

  6. Importance: Text is Hard! • “Numbers don’t lie!” • Unless you’re a politician? • Text “lies” all the time! • Even people disagree on it • Computer Expectations: • Be as good as people at a task (which isn’t perfect)

  7. Importance: Info Overload • IDC estimates: • We generated 161 exabytes of digital info in 2006 • Even if a lot is non-textual, we deal with it by writing about it: • Tags, summaries, reports, closed captioning, etc. • Info Workers spend, in hours per week: • 14.5 hours reading and answering email • 13.3 creating documents • 9.6 searching for info • 9.5 analyzing info

  8. Importance: Intelligent Web • The Web is driven by text • The current and future of the web is based on intelligence, as in • Human: a.k.a “The Masses” • Ratings, reviews, connections • Pros: dynamic, ad hoc or guided • Cons: sloppy, cheating, ad hoc • Artificial: • Pros: Can do a good/decent job on a lot of things, cheaper than manual • Cons: Can be really hard, not always as good as people • Examples: • Google, Y!, Amazon, Facebook, LinkedIn, Blogs, Wikipedia, many, many startups

  9. Importance: Text and You • You: • Personal Organization • Email, IM, Docs • Importance, Prioritization, Organization • Career • Always work on hard problems • In-demand skill

  10. Importance: Your Company • Make sense of disparate sources of info to gain competitive advantage • Reduce time/expense to understand large volumes of data • Enhance productivity • Mine connections/relationships

  11. Foundations

  12. Pieces of the Text Pie • Characters • Encoding, case, punctuation, accents, numbers • Tokens/Words • Segmentation, Parts of Speech, Stemming • Multi-word and Sentences • Phrases, parsing, sentence detection, co-reference resolution • Paragraphs • Summarization, meaning

  13. Pieces II • Document • Meaning • Reading Level • Multi-document/Corpus • Summaries, similar docs • You/Author • Beliefs, knowledge, culture, training

  14. Fields of Interest • Information Retrieval (IR) • Natural Language Processing (NLP) • Computational Linguistics • Math/Statistics • Artificial Intelligence • Biology

  15. Search and Text Analysis in the Real World • Focus on Search • Most robust, but far from perfect • Integrate others into Search platform

  16. Obtaining Text • It’s everywhere, but is it how we want it? • Crawl file/web • DBs • CMS • Many different file formats • Office, PDF, HTML, XML • Need to extract usable content

  17. Extracting Text • Many open source tools exist: • PDFBox, POI, TextMining, SAX, DOM, StaX, nekoHTML, HTMLParser, etc. • Use a framework instead of one-offs for each tool • Common API for all tools • Aperture: http://aperture.sourceforge.net/ • Crawlers, extractors, RDF • Tika: http://incubator.apache.org/tika/ • SAX-like plus metadata

  18. Text Applications • Find items that meet an information need • Identify important people, places, things • Fuzzy Strings • Categorization and Classification • Organize groups of documents • Answer questions • Much, much more: • Sentiment, Machine Translation, Summarization…

  19. Search

  20. Search Concepts • User inputs one or more keywords along with some operators and expects to get back a ranked list of documents relevant to the keywords • User sorts through the documents, reading/using those he thinks are most relevant • User’s relevant docs does not always equal search engines

  21. Making Content Searchable • Search engines generally: • Extract Tokens from Content • Optionally transform said tokens depending on needs • Stemming • Expand with synonyms (usually done at query time) • Remove token (stopword) • Other Text Analysis • Add metadata • Store tokens and related metadata (position, etc.) in a data structured optimized for searching • Called an Inverted Index

  22. Libraries • Apache Lucene • Apache Solr • Sphinx • Minion • Xapian

  23. Apache Solr • Lucene-based Search server • HTTP-based, but many native clients • Lucene best practices • Replication/Distribution • Caching • Plug and Play extensions • http://lucene.apache.org/solr

  24. People, Places, Things http://news.yahoo.com/s/ap/20081013/ap_on_sp_fo_ne/fbn_cowboys_romo_10

  25. Named Entity Recognition • Identify people, places, things, numerical quantities • Approaches • Rule-based • Write rules to extract • Lists, gazetteers, others • Statistical • Annotate data and learn stats • Change domains, languages, etc.

  26. Libraries • OpenNLP • Minor Third • Stanford NER • Mallet • LingPipe (dual license) • OpenCalais (dual)

  27. OpenNLP • Maximum Entropy library • Parser • Chunker • Sentence Detection • NER

  28. Fuzzy Strings • Spell checking • Record Matching • Address book merging • US Census • Document/Question Similarity • Log analysis • De-duplication

  29. Strings • Algorithms • Edit Distance (Levenstein) • Jaro-Winkler • Many others • Libraries • Regular Expressions • Second String • Lucene Spell Checker (contrib)

  30. Organization: Classification http://www.dmoz.org

  31. C & C • Automatically label content based on one or more categories • Supervised • Unsupervised • Useful for: • Spam • Genre (sports, business, tech, etc.)

  32. Libraries • Mahout • Naïve Bayes • Genetic • OpenNLP • libSVM • Neural Network implementations

  33. Organization: Clustering

  34. Clustering • Group similar content into clusters for easy browsing • Types: • Search Results • Documents • Data • Approaches: • K-Means • Mean-shift • Hierarchical

  35. Libraries • Carrot2 (search results) • Many different approaches • Mahout • k-Means • Mean Shift • Canopy • SOLR-769 • https://issues.apache.org/jira/browse/SOLR-769 • Various others

  36. Q & A http://www.answers.com/who%20is%20Bobby%20Orr%3F

  37. Question Answering • Find the answer to a question • Phrase, sentence, passage, document(s) • Combination of a lot of the previous parts • Difficult • Easier • Who is John Wayne? • Hard (impossible?): • What are the pros and cons of the bailout package?

  38. Libraries • QANDA • OpenEphyra • Taming Text • Future • Fact-based • Demo only • Some others

  39. Resources • http://lucene.apache.org • /solr • /java • /mahout • http://opennlp.sourceforge.net • http://project.carrot2.org/ • gsingers@a.o • a.o ==apache.org

  40. Demo • Download from • Unzip • cd apache-solr-1.3.0/example • java -jar start.jar • In another terminal, cd example/exampledocs • java -jar post.jar *.xml

  41. d1 q1 Θ Vector Space Model • Goal: Identify documents that are similar to input query • Represent each word with a weight w • The words in the document and the query each define a Vector in an n-dimensional space • Common weighting scheme is called TF-IDF • TF = Term Frequency • IDF = Inverse Document Freq. • Intuition behind TF-IDF: • A term that frequently occurs in a few documents relative to the collection is more important than one that occurs in a lot of documents • Sim(q1, d1) = cos Θ dj= <w1,j,w2,j,…,wn,j> q= <w1,q,w2,q,…wn,q> w = weight assigned to term

More Related