text mining analysis of text data l.
Skip this Video
Loading SlideShow in 5 Seconds..
Text-Mining: analysis of text data PowerPoint Presentation
Download Presentation
Text-Mining: analysis of text data

Loading in 2 Seconds...

play fullscreen
1 / 35

Text-Mining: analysis of text data - PowerPoint PPT Presentation

  • Uploaded on

Text-Mining: analysis of text data. Dunja Mladenić J.Stefan Institute, Ljubljana, Slovenia and Carnegie Mellon University, USA http://www-ai.ijs.si/DunjaMladenic/ http://www.cs.cmu.edu/~dunja/. Web user profiling. imagine the user browsing the Web, most of the time by clicking hyperlinks

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Text-Mining: analysis of text data' - KeelyKia

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
text mining analysis of text data

Text-Mining: analysis of text data

Dunja Mladenić

J.Stefan Institute, Ljubljana, Slovenia and Carnegie Mellon University, USA



web user profiling
Web user profiling
  • imagine the user browsing the Web, most of the time by clicking hyperlinks
  • goal: provide help by highlighting the clicked hyperlinks (we assume that the user is clicking on interesting hyperlinks)
    • induce a profile for each user separately
    • the profile can be used to predict clicking on hyperlinks (in our case), to collect interesting Web-pages, to compare different users and share knowledge between them (collaborative agents)
structure of the personal browsing assistant personal webwatcher
Structure of the personal browsing assistant - Personal WebWatcher







The Web

User profile

modified page

Personal WebWatcher

personal webwatcher in action 1996
Personal WebWatcher in action (1996)

Highlight interesting hyperlinks

data pyramid
Data Pyramid


Knowledge plus experience


Information plus rules


Data plus context


what is data mining
What is Data Mining?
  • Data mining (knowledge discovery in databases - KDD, business intelligence):
    • finding interesting (non-trivial, hidden, previously unknown and potentially useful)regularities in large datasets
      • “Say something interesting about the data.”
      • “Decribe this data.”
data mining potential usage
Data Mining: Potential usage
  • Market analysis
  • Risk analysis
  • Fraud detection
  • Text Mining
  • Web Mining
  • ...
why text analysis
Why text analysis?
  • The amount of text data on electronic media is growing daily
    • e-mail, business documents, the Web, organized databases of documents,...
  • There is a lot of information contained in the text
  • Available methods and approaches enabling solving interesting and non-trivial problems
problem description i
Problem description (I)
  • Text information filtering
  • Help with browsing the Web
  • Generation and analysis of user profiles
  • Automatic document categorization and keyword assignment to documents
  • Document clustering
  • Document visualization
  • Document authorship detection
  • Document copying identification
  • Language identification in text
document categorization
Document categorization

Document Classifier

labeled documents


document category


unlabeled document

automatic document categorization
Automatic document categorization
  • Problem: given is a set of content categories filled with documents.
  • The goal is: to automatically insert a new document (assign one or more relevant categories to a new document).
  • Content categories can be structured (eg., Yahoo, Medline) or unstructured (eg., Reuters)
  • The problem is similar to assigning keywords to documents

Document to categorize:

CFP for CoNLL-2000


Some predicted


our approach to document categorization
Our approach to document categorization
  • Data is obtained from the existing collection of manually categorized documents, where the used content categories are structured
  • Using Text Mining methods, we constructed a model that captures manual work of editors
  • The model is used to automatically assign content categories and the corresponding keywords to new, previously unseen documents

System architecture

Feature construction


vectors of n-grams

Subproblem definition

Feature selection

Classifier construction

labeled documents

(from Yahoo! hierarchy)


Document Classifier

unlabeled document

document category (label)

summary of experiments and results
Summary of experiments and results
  • learning from categorization hierarchy: considering only promising categories during the classification (5%-15% of categories)
  • extended document representation: new features for sequences of two words
  • feature subset selection: Odds ratio using 50-100 best features (0.2%-5%)
More can be found at our project page




document authorship detection
Document authorship detection
  • Problem: based on a database of documents and authors, assign the most probable author to a new document
  • Solution is based on the fact that each author uses a characteristic frequency distribution over words and phrases
document copying identification
Document copying identification
  • Problem: predict probability that a given document was copied (partially or completely) from some other document(s) from our database
  • Algorithm uses complex indexing methods on (different length) parts of documents and compares them against the given document
natural language identification
Natural language identification
  • Text data analysis systems commonly use some natural language dependent methods
  • Need for identification of natural language the document is written in
  • Problem: for a given text identify the natural language it is written in selecting among the predefined languages
algorithm for natural language identification
Algorithm for natural language identification
  • Basic algorithms are simple: for each language build a characteristic frequency table of pairs and triples of letters that can be simply used to identify a document language (TextCat publicly available system, covers 60 languages)
  • Problem is with short documents - in this case we can use mechanisms for language dependent stop-words detection (stop-words are frequent in all languages)
problem description ii
Problem description (II)
  • Topic identification and tracking in time series of documents
  • Document indexing based on content and not only keywords
  • Content segmentation of text
  • Document summarization
  • Link analysis
  • Information extraction
topic identification and tracking in time series of documents
Topic identification and tracking in time series of documents
  • Problem: given is a time-sequence of documents (news) - based on this document sequence we want to:
    • identify document that introduces new topic
    • from the sequence of new documents identify documents about existing topics and connect them into a topic sequence
text segmentation based on content
Text segmentation based on content
  • Problem: divide text that has no given structure (content table, paragraphs, etc.) into segments with similar content
  • Example applications:
    • topic tracking in news (spoken news)
    • identification of topics in large, unstructured text databases
algorithm for text segmentation
Algorithm for text segmentation
  • Algorithm:
    • Divide text into sentences
    • Represent each sentence with words and phrases it contains
    • Calculate similarity between the pairs of sentences
    • Find a segmentation (sequence of delimiters), so that the similarity between the sentences inside the same segment is maximized and minimized between the segments
text summarization
Text Summarization
  • Task: Given a text document create a summary reflecting the document’s contents
  • Three main phases:
    • Analyzing the source text
    • Determining its important points
    • Synthesizing an appropriate output
  • Most methods adopt linear weighting model – each text unit (sentence) is assessed by:
    • Weight(U)=LocationInText(U)+CuePhrase(U)+Statistics(U)+AdditionalPresence(U)
  • …output consists from topmost text units (sentences)
i nformation extraction
Information extraction
  • Collect a set of Home pages from the Web and build a “soft” database of people (name, address, coworkers, research areas and publications, biography...)
  • Collect electronic seminar announcements and extract location (room number), start and end time, name of the speaker
where are we now
Where are we now?
  • Growing interest and need for handling large collections of text
  • The area is present in Slovenia for over 5 years with strong international connection
    • joint R&D project with: Microsoft Research, European and American research institutions, cooperation with Boeing
  • Organization of international events focused on Text Mining (ICML-99, KDD-2000, ICDM-2001)
instead of conclusions
Instead of conclusions...
  • Text Mining enables solving some problems that are often not expected to be addressed by computers:
    • document authorship detection, identification of related content or finding “interesting” people, document segmentation and organization, automatic collection of officer names for the selected sector companies, finding experts in some area, who is involved with whom (discovering social networks), ...
To find more information check:





get research papers at <http://www.researchindex.com>

  • KDD-2000 Text Mining Workshop <http://www.cs.cmu.edu/~dunja/WshKDD2000.html>
  • ECAI-2000 ML for Information Extraction <http://www.dcs.shef.ac.uk/~fabio/ecai-workshop.html>
  • PRICAI-2000Text and Web MiningWorkshop <http://textmining.krdl.org.sg/cfp.html>
  • IJCAI-2001 Adaptive Text Extraction and Mining Workshop <http://www.smi.ucd.ie/ATEM2001/>, Text Learning: Beyond Supervision <http://www.cs.cmu.edu/~mccallum/textbeyond/>
  • ICDM-2001 Text Mining Workshop <http://www-ai.ijs.si/DunjaMladenic/TextDM01/>
  • ECML/PKDD-2001 Text Mining tutorial <http://www-ai.ijs.si/DunjaMladenic/TextDM01/Tutorial.ps>
link analysis
Link Analysis
  • Mechanisms for detecting which vertices in the graph (pages on the web) are more important on the basis of link structure:
    • Hits algorithm (Hubs & Authorities) (Kleinberg 1998)
    • PageRank (Page 1999) weighting (used by Google to better rank good pages)
link analysis on amazon data
Link analysis on Amazon data
  • We downloaded product pages from Amazon.com web site:
    • …products are connected with cross-sell relation (“customers who bought this product also bought following products…”)
    • 130.000 books and 32.000 music CDs connected into graph
  • Question: which products (books or CDs) are the most important?
  • …we used Hits algorithm to calculate the weights
    • Harry Potter & Beatles won the test.
popular books
Popular books
  • Harry Potter and the Goblet of Fire (Book 4): J K Rowling, Mary Grandpre
  • The Beatles Anthology: The Beatles, Paul McCartney, George Harrison, Ringo Starr, Lennon, John Lennon
  • Prodigal Summer: Barbara Kingsolver
  • Harry Potter and the Sorcerer's Stone (Book 1): J K Rowling
  • The Mark : The Beast Rules the World (Left Behind #8): Tim LaHaye, Jerry B Jenkins
  • Harry Potter and the Chamber of Secrets (Book 2): J K Rowling
  • Harry Potter and the Prisoner of Azkaban (Book 3): J K Rowling, Mary Grandpre
  • The Sibley Guide to Birds (Audubon Society Nature Guides Ser.): David Allen Sibley
  • ....
popular cds
Popular CDs
  • The Beatles
  • A Day Without Rain: Enya
  • Lovers Rock: Sade
  • All That You Can't Leave Behind: U2
  • Riding With The King: Eric Clapton, BB King
  • Black and Blue: Backstreet Boys
  • Sailing To Philadelphia: Mark Knopfler
  • You're The One: Paul Simon
  • Kid A: Radiohead
  • Music: Madonna
  • Red Dirt Girl: Emmylou Harris
  • Renee Fleming
  • ...