search and text analysis l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Search And Text Analysis PowerPoint Presentation
Download Presentation
Search And Text Analysis

Loading in 2 Seconds...

play fullscreen
1 / 41

Search And Text Analysis - PowerPoint PPT Presentation


  • 367 Views
  • Uploaded on

Search And Text Analysis. An Introduction to Java-based Open Source Tools and Techniques Grant Ingersoll October 15, 2008 Charlotte JUG. Overview. Background Taming Text Importance Foundations Language Basics Obtaining Text Tools for Search and Text Analysis Concepts Demos Resources.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Search And Text Analysis' - Mia_John


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
search and text analysis

Search And Text Analysis

An Introduction to Java-based Open Source Tools and Techniques

Grant Ingersoll

October 15, 2008

Charlotte JUG

overview
Overview
  • Background
    • Taming Text
    • Importance
  • Foundations
    • Language Basics
    • Obtaining Text
  • Tools for Search and Text Analysis
    • Concepts
    • Demos
  • Resources
taming text
Taming Text
  • http://www.manning.com/ingersoll
  • Grant Ingersoll
    • Lucene/Solr/Mahout committer
  • Tom Morton
    • OpenNLP author
  • Practical aspects of Text Analysis
    • Open Source libraries
    • No math :-)
  • In progress - Early Access Available
  • Code:
    • Solr as platform for enabling NLP
    • Open source
slide5
Quiz

1. Can you read this?

2. How about this?

3. How many emails did you get today?

4. How many websites/articles/books did you read today?

5. How many searches did you do? (Google, Y!, local, proprietary)

6. How much time did you spend doing all of these things?

7. How did you do it? Literally, what processes did you use?

importance text is hard
Importance: Text is Hard!
  • “Numbers don’t lie!”
      • Unless you’re a politician?
  • Text “lies” all the time!
    • Even people disagree on it
  • Computer Expectations:
    • Be as good as people at a task (which isn’t perfect)
importance info overload
Importance: Info Overload
  • IDC estimates:
    • We generated 161 exabytes of digital info in 2006
    • Even if a lot is non-textual, we deal with it by writing about it:
      • Tags, summaries, reports, closed captioning, etc.
    • Info Workers spend, in hours per week:
        • 14.5 hours reading and answering email
        • 13.3 creating documents
        • 9.6 searching for info
        • 9.5 analyzing info
importance intelligent web
Importance: Intelligent Web
  • The Web is driven by text
  • The current and future of the web is based on intelligence, as in
    • Human: a.k.a “The Masses”
      • Ratings, reviews, connections
      • Pros: dynamic, ad hoc or guided
      • Cons: sloppy, cheating, ad hoc
    • Artificial:
      • Pros: Can do a good/decent job on a lot of things, cheaper than manual
      • Cons: Can be really hard, not always as good as people
  • Examples:
    • Google, Y!, Amazon, Facebook, LinkedIn, Blogs, Wikipedia, many, many startups
importance text and you
Importance: Text and You
  • You:
    • Personal Organization
      • Email, IM, Docs
      • Importance, Prioritization, Organization
    • Career
      • Always work on hard problems
      • In-demand skill
importance your company
Importance: Your Company
  • Make sense of disparate sources of info to gain competitive advantage
  • Reduce time/expense to understand large volumes of data
  • Enhance productivity
  • Mine connections/relationships
pieces of the text pie
Pieces of the Text Pie
  • Characters
    • Encoding, case, punctuation, accents, numbers
  • Tokens/Words
    • Segmentation, Parts of Speech, Stemming
  • Multi-word and Sentences
    • Phrases, parsing, sentence detection, co-reference resolution
  • Paragraphs
    • Summarization, meaning
pieces ii
Pieces II
  • Document
    • Meaning
    • Reading Level
  • Multi-document/Corpus
    • Summaries, similar docs
  • You/Author
    • Beliefs, knowledge, culture, training
fields of interest
Fields of Interest
  • Information Retrieval (IR)
  • Natural Language Processing (NLP)
  • Computational Linguistics
  • Math/Statistics
  • Artificial Intelligence
  • Biology
search and text analysis in the real world
Search and Text Analysis in the Real World
  • Focus on Search
    • Most robust, but far from perfect
  • Integrate others into Search platform
obtaining text
Obtaining Text
  • It’s everywhere, but is it how we want it?
    • Crawl file/web
    • DBs
    • CMS
  • Many different file formats
    • Office, PDF, HTML, XML
    • Need to extract usable content
extracting text
Extracting Text
  • Many open source tools exist:
    • PDFBox, POI, TextMining, SAX, DOM, StaX, nekoHTML, HTMLParser, etc.
  • Use a framework instead of one-offs for each tool
    • Common API for all tools
    • Aperture: http://aperture.sourceforge.net/
      • Crawlers, extractors, RDF
    • Tika: http://incubator.apache.org/tika/
      • SAX-like plus metadata
text applications
Text Applications
  • Find items that meet an information need
  • Identify important people, places, things
  • Fuzzy Strings
  • Categorization and Classification
  • Organize groups of documents
  • Answer questions
  • Much, much more:
    • Sentiment, Machine Translation, Summarization…
search concepts
Search Concepts
  • User inputs one or more keywords along with some operators and expects to get back a ranked list of documents relevant to the keywords
  • User sorts through the documents, reading/using those he thinks are most relevant
    • User’s relevant docs does not always equal search engines
making content searchable
Making Content Searchable
  • Search engines generally:
    • Extract Tokens from Content
    • Optionally transform said tokens depending on needs
      • Stemming
      • Expand with synonyms (usually done at query time)
      • Remove token (stopword)
      • Other Text Analysis
      • Add metadata
    • Store tokens and related metadata (position, etc.) in a data structured optimized for searching
      • Called an Inverted Index
libraries
Libraries
  • Apache Lucene
  • Apache Solr
  • Sphinx
  • Minion
  • Xapian
apache solr
Apache Solr
  • Lucene-based Search server
    • HTTP-based, but many native clients
    • Lucene best practices
    • Replication/Distribution
    • Caching
    • Plug and Play extensions
  • http://lucene.apache.org/solr
people places things
People, Places, Things

http://news.yahoo.com/s/ap/20081013/ap_on_sp_fo_ne/fbn_cowboys_romo_10

named entity recognition
Named Entity Recognition
  • Identify people, places, things, numerical quantities
  • Approaches
    • Rule-based
      • Write rules to extract
      • Lists, gazetteers, others
    • Statistical
      • Annotate data and learn stats
      • Change domains, languages, etc.
libraries26
Libraries
  • OpenNLP
  • Minor Third
  • Stanford NER
  • Mallet
  • LingPipe (dual license)
  • OpenCalais (dual)
opennlp
OpenNLP
  • Maximum Entropy library
  • Parser
  • Chunker
  • Sentence Detection
  • NER
fuzzy strings
Fuzzy Strings
  • Spell checking
  • Record Matching
    • Address book merging
    • US Census
  • Document/Question Similarity
    • Log analysis
    • De-duplication
strings
Strings
  • Algorithms
    • Edit Distance (Levenstein)
    • Jaro-Winkler
    • Many others
  • Libraries
    • Regular Expressions
    • Second String
    • Lucene Spell Checker (contrib)
organization classification
Organization: Classification

http://www.dmoz.org

slide31
C & C
  • Automatically label content based on one or more categories
    • Supervised
    • Unsupervised
  • Useful for:
    • Spam
    • Genre (sports, business, tech, etc.)
libraries32
Libraries
  • Mahout
    • Naïve Bayes
    • Genetic
  • OpenNLP
  • libSVM
  • Neural Network implementations
clustering
Clustering
  • Group similar content into clusters for easy browsing
  • Types:
    • Search Results
    • Documents
    • Data
  • Approaches:
    • K-Means
    • Mean-shift
    • Hierarchical
libraries35
Libraries
  • Carrot2 (search results)
    • Many different approaches
  • Mahout
    • k-Means
    • Mean Shift
    • Canopy
  • SOLR-769
    • https://issues.apache.org/jira/browse/SOLR-769
  • Various others
slide36
Q & A

http://www.answers.com/who%20is%20Bobby%20Orr%3F

question answering
Question Answering
  • Find the answer to a question
    • Phrase, sentence, passage, document(s)
  • Combination of a lot of the previous parts
  • Difficult
    • Easier
      • Who is John Wayne?
    • Hard (impossible?):
      • What are the pros and cons of the bailout package?
libraries38
Libraries
  • QANDA
  • OpenEphyra
  • Taming Text
    • Future
    • Fact-based
    • Demo only
  • Some others
resources
Resources
  • http://lucene.apache.org
    • /solr
    • /java
    • /mahout
  • http://opennlp.sourceforge.net
  • http://project.carrot2.org/
  • gsingers@a.o
    • a.o ==apache.org
slide40
Demo
  • Download from
  • Unzip
  • cd apache-solr-1.3.0/example
  • java -jar start.jar
  • In another terminal, cd example/exampledocs
  • java -jar post.jar *.xml
vector space model

d1

q1

Θ

Vector Space Model
  • Goal: Identify documents that are similar to input query
  • Represent each word with a weight w
  • The words in the document and the query each define a Vector in an n-dimensional space
  • Common weighting scheme is called TF-IDF
    • TF = Term Frequency
    • IDF = Inverse Document Freq.
  • Intuition behind TF-IDF:
    • A term that frequently occurs in a few documents relative to the collection is more important than one that occurs in a lot of documents
  • Sim(q1, d1) = cos Θ

dj= <w1,j,w2,j,…,wn,j>

q= <w1,q,w2,q,…wn,q>

w = weight assigned to term