1 / 34

Domain Adaptation for Biomedical Information Extraction

Domain Adaptation for Biomedical Information Extraction. Jing Jiang BeeSpace Seminar Oct 17, 2007. Outline. Why do we need domain adaptation? Solutions: Intelligent learning methods Knowledge bases Expert supervision Connections with BeeSpace V4. Why do we need domain adaptation?.

lottie
Download Presentation

Domain Adaptation for Biomedical Information Extraction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Domain Adaptation for Biomedical Information Extraction Jing Jiang BeeSpace Seminar Oct 17, 2007

  2. Outline • Why do we need domain adaptation? • Solutions: • Intelligent learning methods • Knowledge bases • Expert supervision • Connections with BeeSpace V4

  3. Why do we need domain adaptation? • Many biomedical information extraction problems are solved by supervised machine learning methods such as support vector machines (SVMs). • Entity recognition • Relation extraction • Sentence categorization • In supervised machine learning, it is assumed that the training data and the test data have the same distribution.

  4. Why do we need domain adaptation? • Existing labeled training data is often limited to certain domains. • GENIA corpus  human, blood cells, transcription factors • PennBioIE  Genetic variation in malignancy, Cytochrome P450 inhibition • Training data for sentence categorization in gene summarizer  fly • Even when the training data is diverse (containing multiple domains), it would still be nice to customize the classifier for the particular target domain that we are working on.

  5. Why do we need domain adaptation?

  6. Solutions to domain adaptation • Intelligent learning methods • Instance weighting • Feature selection • Knowledge bases • Expert supervision thesis research future work discussion

  7. Domain adaptive learning methods • Two-stage approach • Two frameworks • Instance weighting • Feature selection • Use of unlabeled data

  8. Intuition Source Domain Target Domain

  9. Goal Source Domain Target Domain

  10. Start from the source domain Source Domain Target Domain

  11. Focus on the common part Source Domain Target Domain

  12. Pick up some part from the target domain Source Domain Target Domain

  13. Formal formulation? Source Domain Target Domain How to formally formulate these ideas?

  14. Instance weighting Source Domain Target Domain instance space (each point represents an example) to assign different weights to different instances in the objective function

  15. Instance weightingObservation source domain target domain

  16. Instance weightingObservation source domain target domain

  17. Instance weightingAnalysis of domain difference p(x, y) ps(y | x) ≠ pt(y | x) p(x)p(y | x) ps(x) ≠ pt(x) labeling difference instance difference ? labeling adaptation instance adaptation

  18. X  Ds+ Dt,l+ Dt,u? Instance weightingThree sets of instances Dt, l Dt, u Ds

  19. Instance weightingFramework labeled source data labeled target data unlabeled target data a flexible setup covering both standard methods and new domain adaptive methods

  20. Feature selection Source Domain Target Domain feature space (each point represents a feature) to identify features that behave similarly across domains

  21. Feature selectionObservation • Domain-specific features wingless daughterless eyeless apexless … “suffix -less” weighted high in the model trained from fly data • Useful for other organisms? in general NO! • May cause generalizable features to be downweighted fly genes

  22. Feature selectionObservation • Generalizable features: generalize well in all domains fly mouse …decapentaplegic and winglessare expressed in analogous patterns in each… …that CD38 is expressed by both neurons and glial cells…that PABPC5is expressed in fetal brain and in a range of adult tissues.

  23. Feature selectionObservation • Generalizable features: generalize well in all domains fly mouse …decapentaplegic and winglessare expressed in analogous patterns in each… …that CD38 is expressed by both neurons and glial cells…that PABPC5is expressed in fetal brain and in a range of adult tissues. “wi+2 = expressed” is generalizable

  24. Feature selectionIntuition for identification of generalizable features source domains fly mouse D3 … DK 1 2 3 4 5 6 7 8 … … -less … … expressed … … 1 2 3 4 5 6 7 8 … … … expressed … … … -less 1 2 3 4 5 6 7 8 … … … expressed … … -less … … 1 2 3 4 5 6 7 8 … … … … expressed … … -less … expressed … … … -less … …

  25. Feature selectionFramework • Matrix A is for feature selection

  26. Feature selection results on gene/protein recognition

  27. New directions to explore • Knowledge bases • Expert supervision

  28. Knowledge bases – entity recognition • Well-documented nomenclatures • Fly, Mouse, Rat • Help filter out false positives? • Help select features? • Dictionaries of entities • “Dictionary features” • Automatic summarization of nomenclatures? • Automatic identification of good features?

  29. Knowledge bases – sentence categorization in gene summarizer • For fly, the training sentences are automatically extracted from FlyBase. For other organisms, do we have similar resources?

  30. Expert supervision – entity recognition • Computer system selects ambiguous examples for human experts to judge. • Computer system asks human experts other questions. • Similar organisms? • Typical surface features? (e.g. cis-regulatory elements, “-RE”) • Computer system summarizes possible features from pseudo labeled data, and asks human experts for confirmation.

  31. Connections to BeeSpace V4 • A major challenge in BeeSpace V4 is extraction of new types of entities and relations. • Exploiting knowledge bases and expert supervision is especially important. • For new types, no labeled data is available even from other domains. Use of bootstrapping methods should be explored.

  32. New entity types • Recognition of many new types will be dictionary based: organism, anatomy, biological process, etc. • Recognition of some new types will need some NER techniques: chemical, regulatory element

  33. New relation types • Bootstrapping (?) • Seed patterns from knowledge bases or human experts • Human inspection of newly discovered patterns?

  34. The end

More Related