Search and text analysis
Download
1 / 41

- PowerPoint PPT Presentation


  • 359 Views
  • Updated On :

Search And Text Analysis. An Introduction to Java-based Open Source Tools and Techniques Grant Ingersoll October 15, 2008 Charlotte JUG. Overview. Background Taming Text Importance Foundations Language Basics Obtaining Text Tools for Search and Text Analysis Concepts Demos Resources.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about '' - Mia_John


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Search and text analysis l.jpg

Search And Text Analysis

An Introduction to Java-based Open Source Tools and Techniques

Grant Ingersoll

October 15, 2008

Charlotte JUG


Overview l.jpg
Overview

  • Background

    • Taming Text

    • Importance

  • Foundations

    • Language Basics

    • Obtaining Text

  • Tools for Search and Text Analysis

    • Concepts

    • Demos

  • Resources


Taming text l.jpg
Taming Text

  • http://www.manning.com/ingersoll

  • Grant Ingersoll

    • Lucene/Solr/Mahout committer

  • Tom Morton

    • OpenNLP author

  • Practical aspects of Text Analysis

    • Open Source libraries

    • No math :-)

  • In progress - Early Access Available

  • Code:

    • Solr as platform for enabling NLP

    • Open source



Slide5 l.jpg
Quiz

1. Can you read this?

2. How about this?

3. How many emails did you get today?

4. How many websites/articles/books did you read today?

5. How many searches did you do? (Google, Y!, local, proprietary)

6. How much time did you spend doing all of these things?

7. How did you do it? Literally, what processes did you use?


Importance text is hard l.jpg
Importance: Text is Hard!

  • “Numbers don’t lie!”

    • Unless you’re a politician?

  • Text “lies” all the time!

    • Even people disagree on it

  • Computer Expectations:

    • Be as good as people at a task (which isn’t perfect)


  • Importance info overload l.jpg
    Importance: Info Overload

    • IDC estimates:

      • We generated 161 exabytes of digital info in 2006

      • Even if a lot is non-textual, we deal with it by writing about it:

        • Tags, summaries, reports, closed captioning, etc.

      • Info Workers spend, in hours per week:

        • 14.5 hours reading and answering email

        • 13.3 creating documents

        • 9.6 searching for info

        • 9.5 analyzing info


    Importance intelligent web l.jpg
    Importance: Intelligent Web

    • The Web is driven by text

    • The current and future of the web is based on intelligence, as in

      • Human: a.k.a “The Masses”

        • Ratings, reviews, connections

        • Pros: dynamic, ad hoc or guided

        • Cons: sloppy, cheating, ad hoc

      • Artificial:

        • Pros: Can do a good/decent job on a lot of things, cheaper than manual

        • Cons: Can be really hard, not always as good as people

    • Examples:

      • Google, Y!, Amazon, Facebook, LinkedIn, Blogs, Wikipedia, many, many startups


    Importance text and you l.jpg
    Importance: Text and You

    • You:

      • Personal Organization

        • Email, IM, Docs

        • Importance, Prioritization, Organization

      • Career

        • Always work on hard problems

        • In-demand skill


    Importance your company l.jpg
    Importance: Your Company

    • Make sense of disparate sources of info to gain competitive advantage

    • Reduce time/expense to understand large volumes of data

    • Enhance productivity

    • Mine connections/relationships



    Pieces of the text pie l.jpg
    Pieces of the Text Pie

    • Characters

      • Encoding, case, punctuation, accents, numbers

    • Tokens/Words

      • Segmentation, Parts of Speech, Stemming

    • Multi-word and Sentences

      • Phrases, parsing, sentence detection, co-reference resolution

    • Paragraphs

      • Summarization, meaning


    Pieces ii l.jpg
    Pieces II

    • Document

      • Meaning

      • Reading Level

    • Multi-document/Corpus

      • Summaries, similar docs

    • You/Author

      • Beliefs, knowledge, culture, training


    Fields of interest l.jpg
    Fields of Interest

    • Information Retrieval (IR)

    • Natural Language Processing (NLP)

    • Computational Linguistics

    • Math/Statistics

    • Artificial Intelligence

    • Biology


    Search and text analysis in the real world l.jpg
    Search and Text Analysis in the Real World

    • Focus on Search

      • Most robust, but far from perfect

    • Integrate others into Search platform


    Obtaining text l.jpg
    Obtaining Text

    • It’s everywhere, but is it how we want it?

      • Crawl file/web

      • DBs

      • CMS

    • Many different file formats

      • Office, PDF, HTML, XML

      • Need to extract usable content


    Extracting text l.jpg
    Extracting Text

    • Many open source tools exist:

      • PDFBox, POI, TextMining, SAX, DOM, StaX, nekoHTML, HTMLParser, etc.

    • Use a framework instead of one-offs for each tool

      • Common API for all tools

      • Aperture: http://aperture.sourceforge.net/

        • Crawlers, extractors, RDF

      • Tika: http://incubator.apache.org/tika/

        • SAX-like plus metadata


    Text applications l.jpg
    Text Applications

    • Find items that meet an information need

    • Identify important people, places, things

    • Fuzzy Strings

    • Categorization and Classification

    • Organize groups of documents

    • Answer questions

    • Much, much more:

      • Sentiment, Machine Translation, Summarization…



    Search concepts l.jpg
    Search Concepts

    • User inputs one or more keywords along with some operators and expects to get back a ranked list of documents relevant to the keywords

    • User sorts through the documents, reading/using those he thinks are most relevant

      • User’s relevant docs does not always equal search engines


    Making content searchable l.jpg
    Making Content Searchable

    • Search engines generally:

      • Extract Tokens from Content

      • Optionally transform said tokens depending on needs

        • Stemming

        • Expand with synonyms (usually done at query time)

        • Remove token (stopword)

        • Other Text Analysis

        • Add metadata

      • Store tokens and related metadata (position, etc.) in a data structured optimized for searching

        • Called an Inverted Index


    Libraries l.jpg
    Libraries

    • Apache Lucene

    • Apache Solr

    • Sphinx

    • Minion

    • Xapian


    Apache solr l.jpg
    Apache Solr

    • Lucene-based Search server

      • HTTP-based, but many native clients

      • Lucene best practices

      • Replication/Distribution

      • Caching

      • Plug and Play extensions

    • http://lucene.apache.org/solr


    People places things l.jpg
    People, Places, Things

    http://news.yahoo.com/s/ap/20081013/ap_on_sp_fo_ne/fbn_cowboys_romo_10


    Named entity recognition l.jpg
    Named Entity Recognition

    • Identify people, places, things, numerical quantities

    • Approaches

      • Rule-based

        • Write rules to extract

        • Lists, gazetteers, others

      • Statistical

        • Annotate data and learn stats

        • Change domains, languages, etc.


    Libraries26 l.jpg
    Libraries

    • OpenNLP

    • Minor Third

    • Stanford NER

    • Mallet

    • LingPipe (dual license)

    • OpenCalais (dual)


    Opennlp l.jpg
    OpenNLP

    • Maximum Entropy library

    • Parser

    • Chunker

    • Sentence Detection

    • NER


    Fuzzy strings l.jpg
    Fuzzy Strings

    • Spell checking

    • Record Matching

      • Address book merging

      • US Census

    • Document/Question Similarity

      • Log analysis

      • De-duplication


    Strings l.jpg
    Strings

    • Algorithms

      • Edit Distance (Levenstein)

      • Jaro-Winkler

      • Many others

    • Libraries

      • Regular Expressions

      • Second String

      • Lucene Spell Checker (contrib)


    Organization classification l.jpg
    Organization: Classification

    http://www.dmoz.org


    Slide31 l.jpg
    C & C

    • Automatically label content based on one or more categories

      • Supervised

      • Unsupervised

    • Useful for:

      • Spam

      • Genre (sports, business, tech, etc.)


    Libraries32 l.jpg
    Libraries

    • Mahout

      • Naïve Bayes

      • Genetic

    • OpenNLP

    • libSVM

    • Neural Network implementations



    Clustering l.jpg
    Clustering

    • Group similar content into clusters for easy browsing

    • Types:

      • Search Results

      • Documents

      • Data

    • Approaches:

      • K-Means

      • Mean-shift

      • Hierarchical


    Libraries35 l.jpg
    Libraries

    • Carrot2 (search results)

      • Many different approaches

    • Mahout

      • k-Means

      • Mean Shift

      • Canopy

    • SOLR-769

      • https://issues.apache.org/jira/browse/SOLR-769

    • Various others


    Slide36 l.jpg
    Q & A

    http://www.answers.com/who%20is%20Bobby%20Orr%3F


    Question answering l.jpg
    Question Answering

    • Find the answer to a question

      • Phrase, sentence, passage, document(s)

    • Combination of a lot of the previous parts

    • Difficult

      • Easier

        • Who is John Wayne?

      • Hard (impossible?):

        • What are the pros and cons of the bailout package?


    Libraries38 l.jpg
    Libraries

    • QANDA

    • OpenEphyra

    • Taming Text

      • Future

      • Fact-based

      • Demo only

    • Some others


    Resources l.jpg
    Resources

    • http://lucene.apache.org

      • /solr

      • /java

      • /mahout

    • http://opennlp.sourceforge.net

    • http://project.carrot2.org/

    • [email protected]

      • a.o ==apache.org


    Slide40 l.jpg
    Demo

    • Download from

    • Unzip

    • cd apache-solr-1.3.0/example

    • java -jar start.jar

    • In another terminal, cd example/exampledocs

    • java -jar post.jar *.xml


    Vector space model l.jpg

    d1

    q1

    Θ

    Vector Space Model

    • Goal: Identify documents that are similar to input query

    • Represent each word with a weight w

    • The words in the document and the query each define a Vector in an n-dimensional space

    • Common weighting scheme is called TF-IDF

      • TF = Term Frequency

      • IDF = Inverse Document Freq.

    • Intuition behind TF-IDF:

      • A term that frequently occurs in a few documents relative to the collection is more important than one that occurs in a lot of documents

    • Sim(q1, d1) = cos Θ

    dj= <w1,j,w2,j,…,wn,j>

    q= <w1,q,w2,q,…wn,q>

    w = weight assigned to term


    ad