1 / 19

CSC 9010: Text Mining Applications

CSC 9010: Text Mining Applications. Dr. Paula Matuszek Paula_A_Matuszek@glaxosmithkline.com (610) 270-6851. So What Next?. Evaluating systems Systems available Some good resources. Evaluating Text Mining Systems. There are dozens of text mining tools and systems available commercial

yuri
Download Presentation

CSC 9010: Text Mining Applications

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CSC 9010: Text Mining Applications Dr. Paula Matuszek Paula_A_Matuszek@glaxosmithkline.com (610) 270-6851

  2. So What Next? • Evaluating systems • Systems available • Some good resources

  3. Evaluating Text Mining Systems • There are dozens of text mining tools and systems available • commercial • open source • research • How do you decide which to use?

  4. Determine Information Need • First step: what are you trying to find out? • Locate a specific piece of information? • Locate and capture a large amount of specific information • Locate a specific document? • Get the gist of one or more documents? • Organize documents into groups? • Find out something about the overall domain which is reflected in a set of documents? • ???

  5. Determine Environment • What operating system? • What document formats? • ASCII or something richer? • What level of software maturity? • COTS, with support available, maybe already tuned for your specific problem • Open source or other fairly stable • Research tool • What is the cost justification?

  6. Thinking About Information Needs • How specific is your need? • How much do you know already? • How big a corpus? How well-defined? • One-time question or continuing? • Incremental or episodic?

  7. Information Extraction Tools Extract specific information, probably from a large number of documents. • What's the typical precision and recall? • KB info: • What entities are already defined? • How easy is it to add enumerated lists? • How easy is it to add patterns? • What document formats does it accept? • Performance?

  8. Document Retrieval Need a specific document or some information • For spidering: • Coverage, including kinds of documents • Performance, which affects refresh speed • flexibility/configuration of spiders • special needs? (focused crawling) • For retrieval: • Relevance ranking • Performance • Richness of query engine • Precision and recall • Query broadening and narrowing • For both: ease of use

  9. Document Categorization You need to sort your documents • Does system perform in real time? • How many categories total can it handle? • How many categories/document? Flat or hierarchical? • Categories defined automatically or by hand? • Automatically: • Assumes significant vocabulary differences among different groups. • Requires training examples • By hand assumes: • Time to do it! • Readily identifiable characteristics to distinguish groups

  10. Document Clustering What is going on in this domain? • What features of document are used to cluster? Linguistic? Semantic? TF*IDF? • What methods are used for clustering? (How do we define "similar"?) • Any capability for incorporating domain knowledge? • Performance • Incremental? Or do you have to start over again to add new documents?

  11. Document Summarization What do I have? • Sentence extraction or capture and generate? • How much can it be shortened? • How many documents at once? • Sentence extraction methods are heavily dependent on the method used to identify "important" words.

  12. Grab Bag of Systems Available: Entity or Information Extraction • AeroText: Lockheed Martin • GATE: U of Sheffield • Sophia: CELI • iMiner: IBM • ClearTag: ClearForest • Thing Finder: Inxight • LexiQuest: SPSS • Faustus/TextPRO: SRI

  13. Categorization/Clustering • Semio: Entrieva • Oracle Text: Oracle • Inxight Categorizer: Inxight • Verity K2: Verity • Autonomy • ClearForest • LexiMine: SPSS • iMiner, Lotus Discovery Server: IBM (IBM)

  14. Summarizing • All over the place! • Every search engine • Mac OS 10.2 and later • Many others

  15. What's Happening • Some specific domains are very hot or interesting or intriguing • Expertise finder • Patent retrieval, visualization • Reputation Minder • Biological text mining • Semantic web • In fact, anything web-related • ??

  16. What's Happening • Some technologies are also gaining speed: • Taxonomy identification/extraction • Question answering • Automatic markup: for the semantic web, for instance • Integrated domain-based and statistical approaches • Machine learning of KBs

  17. Some Useful Resources: Links • Portal text mining links, kept reasonably up to date: • filebox.vt.edu/users/wfan/text_mining.html • www.cs.utexas.edu/users/pebronia/text-mining • A really excellent overview paper, still useful although 2001: • www.mitre.org/work/tech_papers/tech_papers_01/maybury_unstructured/maybury_unstructured.pdf • Best site to start with for software, conferences, etc: • www.kdnuggets.com/index.html

  18. Useful Resources: Conferences • AAAI and IJCAI: Basic NL research; some good workshops and tutorials on text mining. Some of everything. • KDD: Text Mining often included as a form of data mining, especially more statistical approaches. KDD cup sometimes text based. • SIGIR: Lots of information retrieval • ACL: Lots of linguistic-based info, especially things like entity recognition and tagging. • Data mining conferences: often include text mining component. ICDM, for example. • Domain-specific conferences: often include a text mining component too.

  19. So Where Now? • You now all have a good background in the techniques and applications of text mining, and some ideas of how it's been applied. • Where do you think it will it be in 10 years, and what will we be doing with it?

More Related