1 / 69

7.Text Mining

Text Mining

Dhanamma
Download Presentation

7.Text Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Advanced Analytical Methods 4.Text Mining /D S Jagli

  2. Contents to be Covered • Text Analysis • Text Analysis Steps • A Text Analysis Example • Collecting Raw text and Representing text, • TF and TFIDF • Categorizing documents by topics • Determining Sentiments • Time Series Analytics -Overview , ARIMA Model 4.Text Mining /D S Jagli

  3. Introduction • Text analysis, sometimes called text analytics, refers to the representation, processing, and modeling of textual data to derive useful insights. • An important component of text analysis is text mining, the process of discovering relationships and interesting patterns in large text collections. • Text analysis often deals with textual data that is far more complex. A corpus (plural: corpora) is a large collection of texts used for various purposes in Natural Language Processing (NLP). 4.Text Mining /D S Jagli

  4. Introduction • Turn text data into high-quality information or actionable knowledge • – Minimizes human effort (on consuming text data) • – Supplies knowledge for optimal decision making • Related to text retrieval, which is an essential component in any text mining system • – Text retrieval can be a preprocessor for text mining • – Text retrieval is needed for knowledge provenance 4.Text Mining /D S Jagli

  5. Motivation: Harnessing Big text to Data • Text Data is Ubiquitous and growing rapidly • Blogs • Internet • News • Emails • Literature • Twitter Many Applications Knowledge 4.Text Mining /D S Jagli

  6. Text vs. Non-Text Data: Humans as Subjective “Sensors” 4.Text Mining /D S Jagli

  7. Challenges with text data • Text data present some challenges for people. • Twitter is another representative text data representing social media. Now these text data present some challenges for people. • It's very hard for anyone to digest all the text data quickly. • In particular, it's impossible for scientists to read all of the for example or for anyone to read all the tweets. • So there's a need for tools to help people digest text data more efficiently. 4.Text Mining /D S Jagli

  8. General problem of Data Mining 4.Text Mining /D S Jagli

  9. Land scape of Text Mining and Analytics 4.Text Mining /D S Jagli

  10. Techniques for text data • The main techniques for harnessing big text data are • Text Retrieval • Text Mining. 4.Text Mining /D S Jagli

  11. Basic Concepts in NLP 4.Text Mining /D S Jagli

  12. NLP • Natural language is designed to make human communication efficient. As a result, • –we omit a lot of common sense knowledge, which we assume the hearer/reader possesses. • –we keep a lot of ambiguities, which we assume the hearer/reader knows how to resolve. • •This makes EVERY step in NLP hard • –Ambiguity is a killer! • –Common sense reasoning is pre-required 4.Text Mining /D S Jagli

  13. 4.Text Mining /D S Jagli

  14. 1 Text Analysis Steps 4.Text Mining /D S Jagli

  15. Text Analysis Steps • A text analysis problem usually consists of three important steps: • Parsing • Search and retrieval • Text mining • other subtasks • Discourse • Segmentation 4.Text Mining /D S Jagli

  16. Parsing • Parsing is the process that takes unstructured text and imposes a structure for further analysis. • Example: a plain text file, a weblog, an Extensible Markup Language (XML) file, a Hyper Text Markup Language (HTML) file, or a Word document. • Parsing deconstructs the provided text and extracts it in a more structured way for the subsequent steps. Parsing Structured Text Unstructured Text 4.Text Mining /D S Jagli

  17. Search and retrieval • Search and retrieval is the identification of the documents in a corpus that contain search items(key terms). • Example: specific words, phrases, topics, or entities like people or organizations. • Search and retrieval originated from the field of library science and is now used extensively by web search engines. 4.Text Mining /D S Jagli

  18. Text mining • Text mining uses the terms and indexes produced by the prior two steps to discover meaningful insights relating to domains or problems of interest. • With the proper representation of the text, many of the clustering and classification techniques can be adapted to text mining. • Example: The k-means clustering , Sentiment analysis, Spam filtering. • Text mining may utilize methods and techniques from various fields, such as statistical analysis, information retrieval, data mining, and natural language processing. 4.Text Mining /D S Jagli

  19. Part-of-Speech (POS) Tagging, Lemmatization, and Stemming Model I/P: sentence O/P :tag sequence. • The goal of POS tagging is to build a model whose input is a sentence and output is a tag sequence. • Example: • he saw a fox Each tag marks PRP VBD DT NN • The penn treebank POS tags suggests • The four words are mapped to pronoun (personal), verb (past tense), determiner, and noun (singular), respectively. POS tagging 4.Text Mining /D S Jagli

  20. Lemmatization • Both lemmatization and stemming are techniques to reduce the number of dimensions and reduce variant forms to the base form to more accurately measure the number of times each word appears. • With the use of a given dictionary, lemmatization finds the correct dictionary base form of a word. • Example: obesity causes many problems = obesity cause many problem • Different from lemmatization, stemming does not need a dictionary. 4.Text Mining /D S Jagli

  21. Stemming • Stemming refers to a crude process of stripping affixes based on a set of heuristics to reduce variant forms. After the process, words are stripped to become stems. • A stem is not necessarily an actual word defined in the natural language, but it is sufficient to differentiate itself from the stems of other words. • A well-known rule-based stemming algorithm is Porter’s stemming algorithm. • It defines a set of production rules to iteratively transform words into their stems. • obesity causes many problems =obes caus mani problem 4.Text Mining /D S Jagli

  22. 2 A Text Analysis Example 4.Text Mining /D S Jagli

  23. Text Analysis Example • consider the fictitious company ACME, maker of two products: bPhone and bEbook. • ACME is in strong competition with other companies that manufacture and sell similar products. To succeed, ACME needs to produce excellent phones and eBook readers and increase sales. • One of the ways the company does this is to monitor what is being said about ACME products in social media. In other words, what is the buzz on its products? ACME wants to search all that is said about ACME products in social media sites, such as Twitter and Facebook, and popular review sites, such as Amazon and Consumer Reports. 4.Text Mining /D S Jagli

  24. Text Analysis Example • Are people mentioning its products? • What is being said? Are the products seen as good or bad? If people think an ACME product is bad, • why? For example, are they complaining about the battery life of the bPhone, or the response timein their bEbook? 4.Text Mining /D S Jagli

  25. Text Analysis Process 4.Text Mining /D S Jagli

  26. Collect Raw Text Text Analysis Process Steps Represent Text Compute The Usefulness Of Each Word In The Reviews Categorize Documents By Topics Determine Sentiments Of The Reviews Review The Results And Gain Greater Insights 4.Text Mining /D S Jagli

  27. Collect raw text • The data science team investigates the problem, understands the necessary data sources, and formulates initial hypotheses(Data Analytics Lifecycle). • Data must be collected before anything can happen. • The data science team starts by actively monitoring various websites for user-generated contents. • The user-generated contents being collected could be related articles from news portals and blogs, comments on ACME’s products from online shops or reviews sites, or social media posts that contain keywords bphone or bebook. 4.Text Mining /D S Jagli

  28. Collect raw text • Regardless of Data source, the team would deal with semi-structured data such as HTML web pages, Really Simple Syndication (RSS) feeds, XML, or JavaScript Object Notation (JSON) files. • Enough structure needs to be imposed to find the part of the raw text that the team really cares about. • Many news portals and blogs provide data feeds that are in an open standard format, such as RSS or XML. • Regular expressions can find words and strings that match particular patterns in the text effectively and efficiently. 4.Text Mining /D S Jagli

  29. Example Regular Expressions • If one chooses not to build a data collector from scratch, many companies such as GNIP [9] and DataSift. • Depending on how the fetched raw data will be used, the Data Science team needs to be careful not to violate the rights of the owner of the information. 4.Text Mining /D S Jagli

  30. Represent Text • In this data representation step, raw text is first transformed with text normalization techniques: • Tokenization • Case Folding. • Tokenization or tokenizing is the task of separating words from the body of text. • Raw text is converted into collections of tokens after the tokenization, where each token is a word. • A common approach is tokenizing on spaces. For example, with the tweet shown previously 4.Text Mining /D S Jagli

  31. Represent Text :Tokenization • A common approach is tokenizing on spaces. • Example: Text analysis sometimes called text analytics. {Text, analysis, sometimes, called, text, analytics} • Another way is to tokenize the text based on punctuation marks and spaces. • “Data Science and Big Data Analytics,” has become well accepted across academia and the industry. • “,Data, Science, and, Big, Data, Analytics,,,” ,has, become, well, accepted, across, academia, and, the, industry, 4.Text Mining /D S Jagli

  32. Tokenization • Tokenizing based on punctuation marks might not be well suited to certain scenarios. • Example: we'll =we and ll and can't, = can and t • Tokenization is a much more difficult task than one may expect. • Example: state-of-the-art, Wi-Fi • It’s safe to say that there is no single tokenizer that will work in every scenario. • In reality, it’s common to pair a standard tokenization technique with a lookup table to address the contractions and terms that should not be tokenized. 4.Text Mining /D S Jagli

  33. Represent Text : case folding • Case folding: It reduces all letters to lowercase (or the opposite if applicable). • Example: Text analysis sometimes called text analytics = text analysis sometimes called text analytics • One needs to be cautious applying case folding to tasks such as information extraction, sentiment analysis, and machine translation. • Example: General Motors = general and motors and WHO (World Health Organization) = who • If case folding must be present, one way to reduce problems is to create a lookup table of words not to be case folded. 4.Text Mining /D S Jagli

  34. Represent Text :bag-of-words • After normalizing the text by tokenization and case folding, it needs to be represented in a more structured way. • A simple yet widely used approach to represent text is called bag-of-words. • bag-of-words additionally assumes every term in the document is independent. • bag-of-words represents the document as a set of terms, ignoring information order, context, inferences, and discourse. • With bag-of-words, many texts with different meanings are combined into one form. Example: “a dog bites a man” and “a man bites a dog”. 4.Text Mining /D S Jagli

  35. bag-of-words • Using single words as identifiers with the bag-of-words representation, the term frequency (TF) of each word can be calculated. • Term frequency represents the weight of each term in a document, and it is proportional to the number of occurrences of the term in that document. • The morphological features may need to be included. • Example: root words, affixes, part-of-speech tags, named entities, or intonation (variations of spoken pitch). 4.Text Mining /D S Jagli

  36. Topic modeling • Sometimes creating features is a text analysis task all to itself. • Topic modeling provides a way to quickly analyze large volumes of raw text and identify the latent topics. • Topic modeling may not require the documents to be labeled or annotated. • It can discover topics directly from an analysis of the raw text. • It is important not only to create a representation of a document but also to create a representation of a corpus. 4.Text Mining /D S Jagli

  37. Corpus • A corpus is a collection of documents. • A corpus could be so large that it includes all the documents in one or more languages, or it could be smaller. • For a web search engine, the entire World Wide Web is the relevant corpus. • The Brown Corpus was the first million-word electronic corpus of English, created in 1961 at Brown University. • It includes text from around 500 sources, and the source has been categorized into 15 genres, such as news, editorial, fiction, and so on. 4.Text Mining /D S Jagli

  38. Categories of the Brown Corpus 4.Text Mining /D S Jagli

  39. Term Frequency—Inverse DocumentFrequency (TFIDF) • TFIDF, a measure widely used in information retrieval and text analysis. • Instead of using a traditional corpus as a knowledge base, TFIDF directly works on top of the fetched documents and treats these documents as the “corpus.” • TFIDF is robust and efficient on dynamic content, because document changes require only the update of frequency counts. 4.Text Mining /D S Jagli

  40. TFIDF • Given a term t and a document d={t1,t2, t3, t4…tn} containing n terms, the simplest form of term frequency of t in d can be defined as the number of times t appears in d. • Example: consider a bag-of-words vector space of 10 words: • i, love, acme, my, bebook, bphone, fantastic, slow, terrible, and terrific. 4.Text Mining /D S Jagli

  41. Sample Term Frequency Vector 4.Text Mining /D S Jagli

  42. A Sample Term Frequency Vector 4.Text Mining /D S Jagli

  43. A Sample Term Frequency Vector • The term frequency function can be logarithmically scaled. • Similarly, the logarithm can be applied to word frequencies whose distribution also contains a long tail. • Longer documents contain more terms, they tend to have higher term frequency values and more distinct terms. • These factors can conspire to raise the term frequency values of longer documents. 4.Text Mining /D S Jagli

  44. Term Frequency Normalization • The term frequency can be normalized. • A term frequency vector can become very high dimensional because the bag-of words. • The high dimensionality makes it difficult to store and parse the text and contribute to performance issues related to text analysis. • Remove stop words: a, the,to,for, of 4.Text Mining /D S Jagli

  45. IDF • Besides stop words, words that are more general in meaning tend to appear more often, thus having higher term frequencies. • The additional variable should reduce the effect of the term frequency as the term appears in more documents. • The highest corpus-wide term frequencies (TF), • The highest document frequencies (DF), • The highest inverse document frequencies (IDF) 4.Text Mining /D S Jagli

  46. Example 4.Text Mining /D S Jagli

  47. Example • The Inverse document frequency of a term t is obtained by dividing N by the document frequency of the term and then taking the logarithm of that quotient. 4.Text Mining /D S Jagli

  48. TFIDF • The TFIDF of a term t in a document d is defined as the term frequency of t in d multiplying the document frequency of t in the corpus 4.Text Mining /D S Jagli

  49. Categorize Documents By Topics • With the reviews collected and represented, the data science team at ACME wants to categorize the reviews by topics. • Example: • The bPhone5x has coverage everywhere. It’s much less flaky than my old bPhone4G. • While I love ACME’s bPhone series, I’ve been quite disappointed by the bEbook. The text is illegible, and it makes even my old NBook look blazingly fast. 4.Text Mining /D S Jagli

  50. Topic modeling • Topic modeling provides tools to automatically organize, search, understand, and summarize information. • Topic models are statistical models that examine words from a set of documents, determine the themes over the text. • The process of topic modeling can be simplified to the following. • 1. Uncover the hidden topical patterns within a corpus. • 2. Annotate documents according to these topics. • 3. Use annotations to organize, search, and summarize texts. 4.Text Mining /D S Jagli

More Related