1 / 49

Lecture 19 Lexical networks

Lecture 19 Lexical networks. Slides modified from Dragomir R. Radev. Social data. Blog postings News stories Speeches in Congress Query logs Movie and book reviews Scientific papers Financial reports Query logs Encyclopedia entries Email Chat room discussions Social networking sites.

chelsey
Download Presentation

Lecture 19 Lexical networks

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lecture 19 Lexical networks Slides modified from Dragomir R. Radev

  2. Social data • Blog postings • News stories • Speeches in Congress • Query logs • Movie and book reviews • Scientific papers • Financial reports • Query logs • Encyclopedia entries • Email • Chat room discussions • Social networking sites WHAT DO ALL OF THESE HAVE IN COMMON?

  3. Natural language processing • Part of speech tagging • Prepositional phrase attachment • Parsing • Word sense disambiguation • Document indexing • Text summarization • Machine translation • Question answering • Information retrieval • Social network extraction • Topic modeling

  4. Talk outline • Lexical networks • Semantic networks • Lexical centrality • Latent networks • Conclusion

  5. Lexical networks

  6. Lexical networks • A special case of networks where nodes are words or documents and edges link semantically related nodes • Other examples: • Words used in dictionary definitions • Names of people mentioned in the same story • Words that translate to the same word • A semantic network consists of a set of nodes that are connected by labeled arcs. • The nodes represent concepts and • The arcs represent relations between concepts.

  7. Semantic network

  8. Free word associations The large-scale structure of semantic networks: statistical analyses and a model of semantic growth M. Steyvers, J. B. Tenenbaum (2005) Cognitive Science, 29(1)

  9. Dependency network bought Meredith yesterday apples green

  10. Dependency network

  11. Semantic Networks

  12. So again… A Semantic Network is… • A semantic (or associative) network is a simple representation scheme which uses a graph of labelednodes and labeled, directed arcs to encode knowledge. • Labeled nodes: objects/classes/concepts. • Labeled links: relations/associations between nodes • Labels define the semantics of nodes and links • Usually used to represent static, taxonomic, concept dictionaries

  13. Nodes and Arcs • Nodes denote objects/classes • arcs define binary relationships between objects. mother age Sue john 5 wife age father mother(john,sue) age(john,5) wife(sue,max) age(sue,34) ... husband 34 Max age

  14. Common Semantic Relations • There is no standard set of relations for semantic networks, but the following relations are very common: • INSTANCE: X is an INSTANCE of Y if X is a specific example of the general concept Y. • Example: Elvis is an INSTANCE of Human • ISA: X ISA Y if X is a subset of the more general concept Y. • Example: sparrow ISA bird • HASPART: X HASPART Y if the concept Y is a part of the concept X. • Or this can be any other property • Example: sparrow HASPART tail

  15. Animal isa hasPart Bird isa Wings Robin isa isa Rusty Red ISA hierarchy • The ISA (is a) or AKO (a kind of) relation is often used to link a class and its superclass. • And sometimes an instance and it’s class. • Some links (e.g. has-part) are inherited along ISA paths. • The semantics of a semantic net can be relatively informal or very formal • often defined at the implementation level

  16. Machine Animal Wings isa has-part Has-part isa Bird Airplane can-do isa can-do isa Fly Robin Boeing 747 isa isa isa Rusty Red Air Force one owner passenger George Bob Inference by association • Red (a robin) is related to Air Force One by association (as directed path originated from these two nodes join at nodes Wings and Fly) • Bob and George are not related (no paths originated from them join in this network

  17. Frames – A Semantic Network with properties • A frame represents an entity as a set of slots (attributes) and associated values. • act, look, etc. like objects in C++ • a more robust/compact version of a semantic network • Each slot may have constraints that describe legal values that the slot can take. • A frame can represent a specific entity, or a general concept. • Frames are implicitly associated with one another because the value of a slot can be another frame.

  18. Semantic Networks • Rules are appropriate for some types of knowledge, • but do not easily map to others. • Semantic nets can easily represent inheritance and exceptions, • but are not well-suited for representing negation, disjunction, preferences, conditionals, and cause/effect relationships. • Frames allow arbitrary functions (demons) and typed inheritance. • Implementation is a bit more cumbersome.

  19. Lexical Centrality

  20. LexRank – Centrality in Text Graphs Vertices Units of text (sentences or documents) Edges Pairwise similarity between text

  21. LexRank – Centrality in Text Graphs Intuition LexRank score is propagated through edges Central vertices are those that are similar to other central vertices

  22. LexRank – Centrality in Text Graphs Recurrence Relation 0.3 0.1 0.9 0.3 s 0.5 0.8 Can guarantee solution by allowing “jump” probability d/N. 0.2 0.4 0.2

  23. http://tangra.si.umich.edu/clair/lexrank/

  24. NLP and network analysis

  25. ... , sagte der Sprecher bei der Sitzung . ... , rief der Vorsitzende in der Sitzung . ... , warf in die Tasche aus der Ecke . C1: sagte, warf, rief C2: Sprecher, Vorsitzende, Tasche C3: in C4: der, die Part of speech tagging Word sense disambiguation Document indexing [Mihalcea et al 2004] [Mihalcea et al 2004] [Biemann 2006] Subjectivity analysis Semantic class induction Passage retrieval relevance inter-similarity Q [Pang and Lee 2004] [Widdows and Dorow 2002] [Otterbacher,Erkan,Radev05]

  26. MavenRank – Centrality in Speech Graphs Vertices Speech transcripts from a given topic Edges tf-idf cosine similarity (with threshold) Hypothesis Key speakers will have speeches with high centrality.

  27. MavenRank: Example Speech Scores 1 0.13 2 0.13 3 0.10 4 0.19 5 0.10 6 0.14 7 0.08 8 0.13 Speaker Scores (mean speech score) 1 0.12 2 0.15 3 0.12 Speaker 1 Speeches 3 2 4 Speaker 2 Speeches 1 5 6 8 7 Speaker 3 Speeches

  28. GIN: Gene Interaction Network Motivation: • Biomedical literature is growing rapidly. Manually curated databases cover small portion of the available information • Most protein interaction information is uncovered in biomedical articles Approach: text mining and network analysis for • Automatic extraction of molecule interactions • Automatic article summarization • Interaction and citation networks • Inferring gene-disease associations

  29. Feature Extraction from Dependency Trees “The results demonstrated that KaiC interacts rhythmically with KaiA, KaiB, and SasA.” Path1: KaiC – nsubj – interacts – obj – SasA Path2: KaiC – nsubj – interacts – obj – SasA – conj_and – KaiA Path3: KaiC – nsubj – interacts – obj - SasA – conj_and – KaiB Path4: SasA – conj_and – KaiA Path5: SasA – conj_and – KaiB Path6: KaiA - prep_with - SasA – conj_and – KaiB

  30. Inferring Genes Related to Prostate Cancer • Hypothesis: • Genes that are interacting with many genes that are known to be related to prostate cancer are likely to be related to prostate cancer • Approach: • Extract the interaction network of genes (seed genes) that are known to be related to prostate cancer automatically from the literature • Infer new genes related to prostate cancer from the network topology • Use eigenvalue centrality to rank gene-prostate cancer associations • Hypothesis restatement: • Genes central in the constructed network are most probably related to prostate cancer.

  31. Approach • Corpus: • PMCOA (PubMed Central Open Access) – full text articles • Articles in PMCOA split into sentences and sentences tagged with GeniaTagger • Compile seed list of genes known to be related to prostate cancer • 20 genes compiled from OMIM (Online Mendelian Inheritance in Man) Database • Extend seed gene list with synonyms from HGNC (HUGO Gene Nomenclature Committee) database. • Use the automatic interaction extraction pipeline to extract the interaction network of the seed genes and their neighbors (genes interacting with the seed genes).

  32. Seed Genes • 20 genes that are reported in OMIM to be related to prostate cancer

  33. Interactions of the seed genes(gene names normalized to their HGNC symbols)

  34. Sample Extracted Interaction Sentences • A study by Jin et al. [20] indicated that the association of Tax with hsMAD1, a mitotic spindle checkpoint (MSC) protein, led to the translocation of both MAD1 and MAD2 to the cytoplasm. • PTEN is transcriptionally regulated by transcription factors such as p53, Egr-1, NFκB and SMADs, while protein levels and activity are modulated by phosphorylation, oxidation, subcellular localisation, phospholipid binding and protein stability [29]. • Interestingly, one of these, HPC1, is linked to RNASEL [10,11]. • In response to DNA damage, the cell-cycle checkpoint kinase CHEK2 can be activated by ATM kinase to phosphorylate p53 and BRCA1, which are involved in cell-cycle control, apoptosis, and DNA repair [1,2]. • The interactions of RAD51 with TP53, RPA and the BRC repeats of BRCA2 are relatively well understood (see Discussion). • The interaction of BRCA2 with HsRad51 is significantly more different to both RadA and RecA (Figure 2c). • Max interactor protein, MXI1 (gene L07648) competes for MAX thus negatively regulates MYC function and may play a role in insulin resistance. • Mad2 binds to Cdc20, an activator of the anaphase-promoting complex (APC), to inhibit APC activity and arrest cells in metaphase in response to checkpoint activation.

  35. Inferred Genes(evaluation of top-20 scoring genes) • 6 are seed genes; 14 genes are inferred to be related to prostate cancer • (Check GeneGo Pathway database; if no evidence there, check PubMed literature) • 9 genes: marked as being related to prostate cancer by GeneGo Pathway Database • 1 gene: Found evidence in PubMed that gene related to prostate cancer • 4 genes: no evidence found

  36. Other networks • Diabetes Type I • Diabetes Type II • Bipolar Disorder

  37. Properties of lexical networks

  38. Dependency network

  39. Random network

  40. Analyzing networks • Properties of networks • Clustering coefficient • Watts/Strogatz cc = #triangles/#triples • Power law coefficient a • Diameter (longest shortest path) • Average shortest path (ASP) • Properties of nodes • Centrality: degree, closeness, betweenness, eigenvector

  41. Types of networks • Regular networks • Uniform degree distribution • Random networks • Memoryless • Poisson degree distribution • Characteristic value • Low clustering coefficient • Large asp • Small world networks • High transitivity • Presence of hubs (memory) • High clustering coefficient (e.g., 1000 times higher than random) • Small ASP • Power law degree distribution (typical value of a between 2 and 3)

  42. Comparing the dependency graph to a random (Poisson) graph

  43. universe letter character nature world actor Properties of lexical networks • Entries in a thesaurus[Motter et al. 2002] • c/c0 = 260 (n=30,000) • Co-occurrence networks [Dorogovtsev and Mendes 2001, Sole and Ferrer i Cancho 2001] • c/c0 = 1,000 (n=400,000) • Mental lexicon [Vitevitch 2005] • c/c0 = 278 (n=19,340)

More Related