Cognitive computation group resources for semantic similarity
This presentation is the property of its rightful owner.
Sponsored Links
1 / 43

Cognitive Computation Group Resources for Semantic Similarity PowerPoint PPT Presentation


  • 75 Views
  • Uploaded on
  • Presentation posted in: General

Cognitive Computation Group Resources for Semantic Similarity. http://cogcomp.cs.illinois.edu. Textual Inference. Given a task like Question Answering… …you have a large set of documents , e.g. all articles from the New York times for 2011 and 2012

Download Presentation

Cognitive Computation Group Resources for Semantic Similarity

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Cognitive computation group resources for semantic similarity

Cognitive Computation GroupResources for Semantic Similarity

http://cogcomp.cs.illinois.edu


Textual inference

Textual Inference

  • Given a task like Question Answering…

    • …you have a large set of documents, e.g. all articles from the New York times for 2011 and 2012

    • …and a set of questions, e.g. “Who participated in the gubernatorial debates in January 2012?”

    • …You must return excerpts of the documents that answer the questions.

  • What are the challenges?


Qa example

QA Example

Consider the following example question, and a sample document excerpt that might answer it:

Q. Where is the headquarters of the parent company of Solahart Services?

A. Aztec Solar, Inc. recently acquired Solahart Services of Stockton California. Aztec Solar, a Sacramento based residential and commercial solar company, is excited about acquiring Solahart's regional and national solar customers.


Qa example1

QA Example

Given the QA pair on the previous page, a human reader might make the following inference steps:

Aztec Solar, Inc. recently acquired Solahart Services  Aztec Solar, Inc. is the parent company of Solahart Services.

“Aztec Solar, Inc.” looks like a company name

Aztec Solar, a Sacramento based residential and commercial solar company  Aztec Solar is based in Sacramento

“Aztec Solar” == “Aztec Solar, Inc.”


Qa example2

QA Example

  • An automated system may use a matching process like this:

  • Rewrite the question:

    The headquarters of the parent company of Solahart Services is in <LOCATION>

  • Match question entities and tokens

    LOCATION  Sacramento; company  Aztec Solar, Inc.

  • Apply structure-mapping rules

    <LOCATION>-based <COMPANY>  <COMPANY> headquarters in <LOCATION>

  • This example can be easily perturbed to be more difficult (to thwart a shallow system)


Outline

Outline

  • Introduction: Textual Inference example

  • Semantic Textual Similarity task

  • LLM: a baseline system

  • Comparators

    • Overview

    • Instances: WNSim, NESim

  • Annotators

    • POS, Chunk, NER, Coreference, SRL

  • Curator

  • Edison

    • Data structures

    • Calling Curator

    • Feature Extraction


Textual inference semantic similarity

Textual Inference: Semantic Similarity

Grand NLP challenge: work at level of meaning of text

  • Do these two sentences mean the same thing?

    1. John said he is considered a witness but not a suspect.

    2. "He is not a suspect anymore," John said.

  • If they are different, how different are they?

  • …rate similarity on a scale 0…5: 0 == different topic; 5 == paraphrase

  • http://www.cs.york.ac.uk/semeval-2012/task6/data/uploads/datasets/train-readme.txt


Examples from sts training corpus

Examples from STS training corpus

Nationally, the federal Centers for Disease Control and Prevention recorded 4,156 cases of West Nile, including 284 deaths.

There were 293 human cases of West Nile in Indiana in 2002, including 11 deaths statewide.

Score: 1.667

Chavez said investigators feel confident they've got "at least one of the fires resolved in that regard."

Albuquerque Mayor Martin Chavez said investigators felt confident that with the arrests they had "at least one of the fires resolved.“

Score: 3.800


Candidate baseline lexical level matching llm

Candidate baseline:Lexical level matching (LLM)


Words matter

Words Matter

John Smith bought three cakes and two oranges

John bought two oranges

John Smith bought three cakes and two oranges

John bought three oranges

Approximate similarity of meaning via lexical overlap – how many words in common

But this isn’t exactly fool-proof…


Llm scoring

LLM Scoring

Designed for Textual Entailment (inherently asymmetric)

Proportion of matched Hypothesis tokens, normalized by length of shorter text

Let T be the Text, containing tokens indexed by j

Let H be the Hypothesis, with tokens indexed by i

Let S(word1, word2) be a lexical similarity function that returns a value in the range [0,1]


Llm code

LLM code

http://cogcomp.cs.illinois.edu/page/software_view/LLM

import edu.illinois.cs.cogcomp.mrcs.comparators.LlmComparator;

String source = "Of the three kings referred to by their last names, Atawanabawas the oldest.";

String target = "Three kings were named in the lawsuit.";

LlmComparatorllm = new LlmComparator( config );

double result = llm.compareStrings( source, target );


Can we do better

Can we do better?

  • Depends on the application… more advanced task may require more sophisticated patterns to separate classes

  • Sparsity of features

    • Many words/sequences of words may not occur very often

    • This means a learned classifier may not generalize well

    • More abstract representation can help

  • Ambiguity of words – e.g. “terminal”, “moving”

    • Additional information may help

  • Meaning encoded in structure – e.g.

    “Matthew Smith, the Maverick’s last hope…”

  • NLP annotation tools generally abstract over underlying words so that features generalize better


Comparators

Comparators


So you want to compare some text

So you want to compare some text….

  • How similar are two words? Two strings? Two paragraphs?

    • Depends on what they are

    • String edit distance is usually a weak measure

    • … think about coreference resolution…

  • Solution: specialized metrics


Wnsim

WNSim

  • Generate table mapping terms linked in WordNet ontology

    • Synonymy, Hypernymy, Meronymy

  • Score reflects distance (up to 3 edges, undirected – e.g. via lowest common subsumer)

  • Score is symmetric


Using wnsim

Using WNSim

  • Install and run the WNSim code

    • http://cogcomp.cs.illinois.edu/page/software_view/WNSim

    • Sets up an xmlrpc server

    • Expects xmlrpc ‘struct’ data structure (analogous to Dictionary)

      STRUCT { FIRST_STRING: aString; SECOND_STRING: anotherString }

    • Returns another xmlrpc data structure:

      STRUCT { SCORE: aDouble; REASON: aString }

  • USE: call and cache (reduce network latency overhead)

  • NOTE: LLM code has Java client…


Wnsim via metric interface

WNSim via Metric interface

String metricHost = “…”;

int metricPort = …;

XmlRpcMetricClient client = new XmlRpcMetricClient( “WNSim”,

metricHost,

metricPort

);

MetricResponse response = client.compareStrings( source_, target_ );

double score = response.score;


Nesim

NESim

  • Set of entity-type-specific measures

    • Acronyms, Prefix/Title rules, distance metric

  • Score reflects similarity based on type information

  • Score is asymmetric


Using nesim

Using NESim

  • NESim package from CCG web site

    • http://cogcomp.cs.illinois.edu/page/software_view/NESim

  • NESim can use context to help determine similarity

    • Specify token offsets of NE string to indicate context (optional)

    • Specify Type as one of PER, LOC, ORG (optional)

      [<Type>#]<original string>[#<start offset>#<end offset>]

    • Note: offsets are inclusive, token-based, zero offset

  • Uses specialized resources depending on the type (if specified)

    • Rules/gazetteers for People’s names

    • Acronyms for Organizations


Using nesim cont d

Using NESim (cont’d)

  • Returns a score in [0, 1]

    • Threshold of 0.8 or higher is advised

    • Weakly similar names are generally not semantically close

  • Put jar on classpath, call programmatically

    • Loads large lists, so instantiate once only

      import edu.illinois.cs.cogcomp.entityComparison.core.EntityComparison;

      EntityComparisonentityComparator = new EntityComparison();

      entityComparator.compare( aName, anotherName );

      double currentScore = entityComparator.getScore();

  • Problem: identifying NE boundaries, types


Annotators

Annotators


Available from ccg

Available from CCG

Tokenization/Sentence Splitting

Part Of Speech

Chunking

Named Entity Recognition

Coreference

Semantic Role Labeling


Tokenization and sentence segmentation

Tokenization and Sentence Segmentation

  • Given a document, find the sentence and token boundaries

    The police chased Mr. Smith of Pink Forest, Fla. all the way to Bethesda, where he lived. Smith had escaped after a shoot-out at his workplace, Machinery Inc.

  • Why?

    • Word counts may be important features

    • Words may themselves be the object you want to classify

    • “lived.” and “lived” should give the same information

    • different analyses need to align if you want to leverage multiple annotators from different sources/tasks


Tokenization and sentence segmentation ctd

Tokenization and Sentence Segmentation ctd.

  • Believe it or not, this is an open problem

  • No agreed standard for token-level segmentation

    • e.g. “American-led” vs. “American - led”?

    • e.g. “$ 32 M” vs “$32 M” and “$32M”?

  • Different tasks may use different standards

  • No wildly successful sentence segmenter exists (see the excerpts in news aggregators for some nice errors)

  • Noisier text (e.g. online consumer reviews)  poorer performance (for reasons like inconsistent capitalization)

  • LBJ distribution includes the Illinois tokenizer and sentence segmenter


Part of speech pos

Part of Speech (POS)

  • Allows simple abstraction for pattern detection

  • Disambiguate a target, e.g. “make (a cake)” vs. “make (of car)”

  • Specify more abstract patterns, e.g. Noun Phrase: ( DT JJ* NN )

  • Specify context in abstract way

    • e.g. “DT boy VBX” for “actions boys do”

    • This expression will catch “a boy cried”, “some boy ran”, …


Chunking

Chunking

  • Identifies phrase-level constituents in sentences

    [NP Boris]  [ADVP regretfully]  [VP told]  [NP his wife]  [SBAR that]  [NP their child]  [VP could not attend] [NP night school]  [PP without]  [NP permission] .

  • Useful for filtering: identify e.g. only noun phrases, or only verb phrases

    • Groups modifiers with heads

    • Useful for e.g. Mention Detection

  • Used as source of features, e.g. distance (abstracts away determiners, adjectives, for example), sequence,…

    • More efficient to compute than full syntactic parse

    • Applications in e.g. Information Extraction – getting (simple) information about concepts of interest from text documents


Named entity recognition

Named Entity Recognition

  • Identifies and classifies strings of characters representing proper nouns

    [PER Neil A. Armstrong] , the 38-year-old civilian commander, radioed to earth and the mission control room here: “[LOC Houston] , [ORG Tranquility] Base  here; the Eagle has landed."

  • Useful for filtering documents

    • “I need to find news articles about organizations in which Bill Gates might be involved…”

  • Disambiguate tokens: “Chicago” (team) vs. “Chicago” (city)

  • Source of abstract features

    • E.g. “Verbs that appear with entities that are Organizations”

    • E.g. “Documents that have a high proportion of Organizations”


Coreference

Coreference

  • Identify all phrases that refer to each entity of interest – i.e., group mentions of concepts

    [Neil A. Armstrong] , [the 38-year-old civilian commander], radioed to [earth]. [He] said the famous words, “[the Eagle] has landed”."

  • The Named Entity recognizer only gets us part-way…

  • …if we ask, “what actions did Neil Armstrong perform?”, we will miss many instances (e.g. “He said…”)

  • Coreference resolver abstracts over different ways of referring to the same person

    • Useful in feature extraction, information extraction


Semantic role labeler

Semantic Role Labeler

SRL reveals relations and arguments in the sentence (where relations are expressed as verbs)

Cannot abstract over variability of expressing the relations – e.g. kill vs. murder vs. slay…


Curator

Curator


Big nlp

Big NLP

  • We introduced a lot of tools, some of them quite sophisticated

  • The more complex, the bigger the memory requirement

    • NER: 1G; Coref: 1G; SRL: 4G ….

  • If you use tools from different sources, they may be…

    • In different languages

    • Using different data structures

  • If you run a lot of experiments on a single corpus, it would be nice to cache the results

    • …and for your colleagues, nice if they can access that cache.

  • Curator is our solution to these problems.


Curator1

Curator

  • Supports distributed NLP resources

    • Central point of contact

    • Single set of interfaces

    • Code generation in many languages (using Thrift)

  • Programmatic interface

    • Defines set of common data structures used for interaction

  • Caches processed data

  • Enables highly configurable NLP pipeline

    Overhead:

  • Annotation is all at the level of character offsets: Normalization/mapping to token level required

  • Need to wrap tools to provide requisite data structures


Curator2

Curator

NER

Curator

SRL

POS, Chunker

Cache


Using curator for flexible nlp pipeline

Using Curator for Flexible NLP Pipeline

  • http://cogcomp.cs.illinois.edu/curator/demo/index_beta.html

  • For this class only: dedicated curator instance

    • Temporary instance with host, port accessible to class members

  • http://cogcomp.cs.illinois.edu/trac/wiki/CuratorDataStructures

  • Recommended: access using Edison library (next)


Edison an nlp library

Edison: An NLP Library

  • http://cogcomp.cs.illinois.edu/software/edison/

  • Convenient interface to Curator

    • Converts to token-level indexing (often more convenient)

  • Supports feature extraction over trees

    • Apply to syntactic parse/dependency, and to SRL/NOM

    • E.g. see http://cogcomp.cs.illinois.edu/software/edison/FeaturesExample.html for examples of dependency path features


Serializing textannotations

Serializing TextAnnotations

public void serializeAnnotations( List< TextAnnotation > annotations_, String outputFile_ ) throws Exception {        try {            ObjectOutputStream objOut =

new ObjectOutputStream(new FileOutputStream( outputFile_ ) );                     objOut.writeObject( new Integer( annotations_.size() ) );            for ( TextAnnotation ta: annotations_ ) {                System.err.println( "serializing TA for text '" + ta.getText() + "'..." );                objOut.writeObject( ta );            }            objOut.close();        } catch (IOException e) {…

        }        return;    }


K best views in curator

K-best Views in Curator

  • The Charniak and Stanford parsers can be run in K-best mode

  • These will be added to Curator with k=50

    • This will be quite disk-hungry

    • These components will probably *not* be cached

  • Curator uses a MultiParser interface for k-best parsers

    • Generates a parse view in Record

    • The parse view is a List of Forests: the k-th Forest contains the k-th best parse for all sentences in record

  • Edison does NOT yet directly support getting k-best parses from Curator, BUT…


K best views in edison

K-best views in Edison

Edison supports k-best views

List<View> topKParses  = ...; // A list of top-k parses, say from Charniak

ta.addView(ViewNames.PARSE_CHARNIAK, topKParses);

List<View> parses = ta.getTopKViews(ViewNames.PARSE_CHARNIAK);

inttokenId = 17; // some token


Edison k best example cont d

Edison k-best example cont’d

Constituent c = new Constituent("", "",ta, tokenId, tokenId+1); 

inttreeId =0 ;

for(View parseTree: parses) {

    for(Constituent parseConstituent: parseTree.

    where(Queries.containsConstituent(c))) {

// do something with parseConstituent belonging to tree "treeId"

    }

treeId++;

}


A final word

A Final word


Llm and semantic similarity

LLM and Semantic Similarity

LLM was designed for Textual Entailment, and is asymmetric by design

This task is a little different – trying to assess level of semantic equivalence of two sentences S1 and S2

Still want to normalize (don’t want all short sentence pairs to have lower scores than long sentence pairs), but consider evaluating for both (S1, S2) and for (S2, S1)


Cognitive computation group resources for semantic similarity

Fin


  • Login