1 / 70

Domain Adaptation in Natural Language Processing

Domain Adaptation in Natural Language Processing. Jing Jiang Department of Computer Science University of Illinois at Urbana-Champaign. Textual Data in the Information Age. Contains much useful information E.g. >85% corporate data stored as text Hard to handle

kitty
Download Presentation

Domain Adaptation in Natural Language Processing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Domain Adaptation in Natural Language Processing Jing Jiang Department of Computer Science University of Illinois at Urbana-Champaign

  2. Textual Data in the Information Age • Contains much useful information • E.g. >85% corporate data stored as text • Hard to handle • Large amount: e.g. by 2002, 2.5 billion documents on surface Web, +7.3 million / day • Diversity: emails, news, digital libraries, Web logs, etc. • Unstructured: vs. relation databases How to manage textual data?

  3. Information retrieval: to rank documents based on relevance to keyword queries • Not always satisfactory • More sophisticated services desired

  4. Automatic Text Summarization

  5. Question Answering

  6. Information Extraction

  7. Beyond Information Retrieval • Automatic text summarization • Question answering • Information extraction • Sentiment analysis • Machine translation • Etc. All relies on Natural Language Processing (NLP) techniques to deeply understand and analyze text

  8. Typical NLP Tasks “Larry Page was Google’s founding CEO” • Part-of-speech tagging Larry/noun Page/noun was/verb Google/noun ’s/possessive-end founding/adjective CEO/noun • Chunking [NP: Larry Page] [V: was] [NP: Google ’s founding CEO] • Named entity recognition [person:Larry Page] was [organization:Google] ’s founding CEO • Relation extraction Founder(Larry Page, Google) • Word sense disambiguation “Larry Page” vs. “Page 81” state-of-the-art solution: supervised machine learning

  9. Supervised Learning for NLP representative corpus human annotation WSJ articles POS-tagged WSJ articles Larry/NNP Page/NNP was/VBD Google/NNP ’s/POS founding/ADJ CEO/NN Standard Supervised Learning Algorithm training part-of-speech tagging on news articles trained POS tagger

  10. In Reality… X human annotation is expensive representative corpus human annotation MEDLINE articles POS-tagged MEDLINE articles POS-tagged WSJ articles We/PRP analyzed/VBD the/DT mutations/NNS of/IN the/DT H-ras/NN genes/NNS Standard Supervised Learning Algorithm training part-of-speech tagging on biomedical articles trained POS tagger

  11. Many Other Examples • Named entity recognition • News articles  personal blogs • Organism A  organism B • Spam filtering • Public email collection  personal inboxes • Sentiment analysis of product reviews (positive vs. negative) • Movies  books • Cell phones  digital cameras Problem with this non-standard setting with domain difference?

  12. Domain Difference Performance Degradation ideal setting POS Tagger MEDLINE MEDLINE ~96% realistic setting POS Tagger MEDLINE WSJ ~86%

  13. Another Example ideal setting gene name recognizer 54.1% realistic setting gene name recognizer 28.1%

  14. Domain Adaptation source domain target domain Labeled Labeled Unlabeled to design learning algorithms that are aware of domain difference and exploit all available data to adapt to the target domain Domain Adaptive Learning Algorithm

  15. With Domain Adaptation Techniques… standard learning gene name recognizer Yeast Fly + Mouse 63.3% domain adaptive learning gene name recognizer Yeast Fly + Mouse 75.9%

  16. Roadmap • What is domain adaptation in NLP? • Our work • Overview • Instance weighting • Feature selection • Summary and future work

  17. Overview Source Domain Target Domain

  18. Ideal Goal Source Domain Target Domain

  19. Standard Supervised Learning Source Domain Target Domain

  20. Standard Semi-Supervised Learning Source Domain Target Domain

  21. Idea 1: Generalization Source Domain Target Domain

  22. Idea 2: Adaptation Source Domain Target Domain

  23. Source Domain Target Domain How to formally formulate the ideas?

  24. Instance Weighting instance space (each point represents an observed instance) Source Domain Target Domain to find appropriate weights for different instances

  25. Feature Selection feature space (each point represents a useful feature) Source Domain Target Domain to separate generalizable features from domain-specific features

  26. Roadmap • What is domain adaptation in NLP? • Our work • Overview • Instance weighting • Feature selection • Summary and future work

  27. Observation source domain target domain

  28. Observation source domain target domain

  29. Analysis of Domain Difference x: observed instance y: class label (to be predicted) p(x, y) ps(y | x) ≠ pt(y | x) p(x)p(y | x) ps(x) ≠ pt(x) labeling difference instance difference ? labeling adaptation instance adaptation

  30. Labeling Adaptation source domain target domain pt(y | x) ≠ ps(y | x) remove/demote instances

  31. Labeling Adaptation source domain target domain pt(y | x) ≠ ps(y | x) remove/demote instances

  32. Instance Adaptation (pt(x) < ps(x)) source domain target domain pt(x) < ps(x) remove/demote instances

  33. Instance Adaptation (pt(x) < ps(x)) source domain target domain pt(x) < ps(x) remove/demote instances

  34. Instance Adaptation (pt(x) > ps(x)) source domain target domain pt(x) > ps(x) promote instances

  35. Instance Adaptation (pt(x) > ps(x)) source domain target domain pt(x) > ps(x) promote instances

  36. Instance Adaptation (pt(x) > ps(x)) source domain target domain pt(x) > ps(x) • Target domain instances are useful

  37. Empirical Risk Minimization with Three Sets of Instances Dt, l Dt, u Ds loss function optimal classification model use empirical loss to replace expected loss expected loss

  38. Using Ds Dt, l Dt, u Ds XDs instance difference (hard for high-dimensional data) labeling difference (need labeled target data)

  39. Using Dt,l Dt, l Dt, u Ds XDt,l small sample size estimation not accurate

  40. Using Dt,u Dt, l Dt, u Ds XDt,u use predicted labels (bootstrapping)

  41. Combined Framework a flexible setup covering both standard methods and new domain adaptive methods

  42. Experiments • NLP tasks • POS tagging: WSJ (Penn TreeBank) Oncology (biomedical) text (Penn BioIE) • NE type classification: newswire  conversational telephone speech (CTS) and web-log (WL) (ACE 2005) • Spam filtering: public email collection  personal inboxes (u01, u02, u03) (ECML/PKDD 2006) • Three heuristics to partially explore the parameter settings

  43. useful in most cases; failed in some case When is it guaranteed to work? (future work) Instance Pruningremoving “misleading” instances from Ds POS NE Type Spam

  44. Dt,l with Larger Weights POS NE Type Dt,l is very useful promoting Dt,l is more useful Spam

  45. Bootstrapping with Larger Weightsuntil Ds and Dt,u are balanced POS NE Type promoting target instances is useful, even with predicted labels Spam

  46. Roadmap • What is domain adaptation in NLP? • Our work • Overview • Instance weighting • Feature selection • Summary and future work

  47. Observation 1Domain-specific features wingless daughterless eyeless apexless …

  48. Observation 1Domain-specific features wingless daughterless eyeless apexless … • describing phenotype in fly gene nomenclature • feature “-less” useful for this organism CD38 PABPC5 … feature still useful for other organisms? No!

  49. …decapentaplegic and winglessare expressed in analogous patterns in each… …that CD38 is expressed by both neurons and glial cells…that PABPC5 is expressed in fetal brain and in a range of adult tissues. Observation 2Generalizable features

  50. …decapentaplegic and winglessare expressed in analogous patterns in each… …that CD38 is expressed by both neurons and glial cells…that PABPC5 is expressed in fetal brain and in a range of adult tissues. Observation 2Generalizable features feature “X be expressed”

More Related