1 / 35

An Introduction to Edison

An Introduction to Edison. Vivek Srikumar 17 th April 2012. Curator gives us easy access to several layers of annotation over text What can we do with these? . Outline. What is Edison? Installing Edison Using Edison Creating Edison objects Accessing the Curator Adding and using views.

aziza
Download Presentation

An Introduction to Edison

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. An Introduction to Edison Vivek Srikumar 17th April 2012

  2. Curator gives us easy access to several layers of annotation over text What can we do with these?

  3. Outline • What is Edison? • Installing Edison • Using Edison • Creating Edison objects • Accessing the Curator • Adding and using views

  4. What is Edison? • A uniform representation of diverse NLP annotations • A library of NLP data structures • A Java client to the Curator

  5. NLP Annotations John Smith bought the car. Parse tree Part-of-speech NNP John NNPSmith VBD bought DT the NN car . . Shallow parse NP John Smith VP bought NPthe car S NP VP NNP NP NNP Semantic roles Predicate buy A0 John Smith A1the car VBD NN DT Named Entities PER John Smith John Smith bought the car And many others….

  6. A uniform representation • Main ideas • All the annotations over text are graphs • Nodes: Labeled spans of text • Spans indexed by tokens in the text • Edges: Relations between the nodes • Edison terminology • TextAnnotation: A container of tokens and views • View: A graph that denotes a specific annotation • Constituent: A labeled span of text (nodes) • Relation: A labeled directed edge between Constituents

  7. A uniform representation TextAnnotation Raw text: John Smith bought the car. Tokens: {0:John, 1:Smith, 2:bought, 3:the, 4:car, 5:.} Views Name: SENTENCE Name: POS Name: PARSE_CHARNIAK Constituents: {…} Constituents: {…} Constituents: {…} Relations: {…} Relations: {…} Relations: {…} and other views….

  8. Getting started with Edison • Download the jar from http://cogcomp.cs.illinois.edu/page/software_view/Edison • Click the download link and follow instructions • Add the edison jar and its dependencies to your class path • Dependencies • Cogcomp core utilities • Apache commons libraries • Thrift (to communicate with the Curator) • Porter stemmer • LBJ Library • Java WordNet interface • Javadoc available under “User Guide”

  9. Edison using Maven • Add the following repository definition to your pom.xml file • Add Edison as a dependency <repositories> <repository> <id>CogcompSoftware</id> <name>CogcompSoftware</name> <url>http://cogcomp.cs.illinois.edu/m2repo/</url> </repository> </repositories> <dependency> <groupId>edu.illinois.cs.cogcomp</groupId> <artifactId>edison</artifactId> <version>0.2.9</version> <type>jar</type> <scope>compile</scope> </dependency>

  10. So far… • What is Edison? • Installing Edison • Creating a TextAnnotation • Adding views from the Curator • Using views • …?? • Profit!

  11. A uniform representation TextAnnotation Raw text: John Smith bought the car. Tokens: {0:John, 1:Smith, 2:bought, 3:the, 4:car, 5:.} Views Name: SENTENCE Name: POS Name: PARSE_CHARNIAK Constituents: {…} Constituents: {…} Constituents: {…} Relations: {…} Relations: {…} Relations: {…} and other views….

  12. Three ways to create TextAnnotations • When you don’t know the tokenization • Use this for raw text, if you don’t want to use the Curator • When you know the tokenization • Use this for pre-tokenized text • Using the Curator • Use this for raw text • If your text is pre-tokenized, you can still use the Curator for adding views

  13. Creating TextAnnotations (1) • When to use this approach • If you don’t know the tokenization (i.e. words) • Want to use the LBJ tokenizer and sentence splitter • Note: Every TextAnnotation has a textId and corpusId, these could be used in the future for book-keeping

  14. Creating TextAnnotations (1) String corpus = "2001_ODYSSEY"; String textId = "001"; String text1 = "Good afternoon, gentlemen. I am a HAL-9000 computer."; TextAnnotation ta1 = newTextAnnotation(corpus, textId, text1); System.out.println(ta1.getText()); System.out.println(ta1.getTokenizedText()); // Print the sentences. The `Sentence` class has the same // methods as a `TextAnnotation`. List<Sentence> sentences = ta1.sentences(); System.out.println(sentences.size() + " sentences found."); for (inti = 0; i < sentences.size(); i++) { Sentencesentence = sentences.get(i); System.out.println(sentence); }

  15. Creating TextAnnotations (2) • When to use this approach • When you know the tokenization • That is, when some external source specifies the tokens of the text • After creating it, it can be used as before

  16. Creating TextAnnotations (2) String corpus = "2001_ODYSSEY"; String textId = "002"; List<String> tokenizedSentences = Arrays.asList("Good afternoon , gentlemen .", "Iam a HAL-9000 computer ."); TextAnnotation ta2 = newTextAnnotation(corpus, textId, tokenizedSentences); System.out.println(ta2.getText()); System.out.println(ta2.getTokenizedText()); // Print the sentences. The `Sentence` class of the same // methods as a `TextAnnotation`. List<Sentence> sentences = ta2.sentences(); System.out.println(sentences.size() + " sentences found."); for (int i = 0; i < sentences.size(); i++) { Sentencesentence = sentences.get(i); System.out.println(sentence); }

  17. Connecting to the Curator (1) If you don’t know anything about your text, the curator can tokenize your text for you. String text = "Good afternoon, gentlemen. I am a HAL-9000 " + "computer. I was born in Urbana, Il. in 1992"; String corpus = "2001_ODYSSEY"; String textId = "001"; // We need to specify a host and a port where the curator server is // running. String curatorHost = "my-curator-server.cs.uiuc.edu"; intcuratorPort = 9090; CuratorClient client = newCuratorClient(curatorHost, curatorPort); // Should the curator's cache be forcibly updated? booleanforceUpdate = false; // Get the text annotation object from the curator, which splits the // sentences and tokenizes it. TextAnnotation ta = client.getTextAnnotation(corpus, textId, text, forceUpdate); Create a curator client Create a TextAnnotation

  18. Connecting to the Curator (2) If you know the tokenization and want all the Curator’s annotators to respect this tokenization String corpus = "2001_ODYSSEY"; String textId = "002"; List<String> tokenizedSentences = Arrays.asList("Good afternoon , gentlemen .", "Iam a HAL-9000 computer ."); TextAnnotation ta2 = newTextAnnotation(corpus, textId, tokenizedSentences); // Weneedtospecify a host and a portwherethecurator server is // running. StringcuratorHost = "my-curator-server.cs.uiuc.edu"; intcuratorPort = 9090; CuratorClientclient = newCuratorClient(curatorHost, curatorPort, true); Create your TextAnnotation as before Curator shoud Respect tokenization Note: A Curator Client in this mode cannot create TextAnnotations. Doing so will trigger an exception!

  19. So far… • What is Edison? • Installing Edison • Creating a TextAnnotation • Adding views from the Curator • Using views • …?? • Profit!

  20. Views • Views are graphs, Constituents are nodes and Relations are edges • Every TextAnnotation can be seen as a container for views, indexed by their name • View is a Java class that represents any graph over constituents • Specializations of the View class to deal with specific types • TokenLabelView, SpanLabelView, TreeView, PredicateArgumentView, CoreferenceView • You can create your own views or specializations too!

  21. Example: Part-of-speech John Smith bought the car. Tokens {0:John, 1:Smith, 2:bought, 3:the, 4:car, 5:.} Constituents Part-of-speech NNP John NNPSmith VBD bought DT the NN car . . Each constituent is associated with a span. The convention is to denote a span using the first token and the (last +1)thone. 0-1 NNP 1-2 NNP 2-3 VBD 3-4 DT 4-5 NN 5-6 . No Relations! This specialization of the View class is called a TokenLabelView, where each constituent assigns a label to a token and there are no relations. Use for part-of-speech, stem/lemma, etc.

  22. Adding part-of-speech from the Curator // Suppose we have a CuratorClient called 'client' and a TextAnnotation // called 'ta'. // Should the Curator forcibly update the part-of-speech annotation? booleanforceUpdate = false; // Add the part of speech view from the Curator client.addPOSView(ta, forceUpdate); // Get the part-of-speech view from the TextAnnotation. This view will // be filed under the name 'ViewNames.POS'. Also, we know that // this view will be a TokenLabelView. TokenLabelViewposView = (TokenLabelView) ta.getView(ViewNames.POS); // Iterate through the text and get the POS label for each token for (inttokenId = 0; tokenId < ta.size(); tokenId++) { String token = ta.getToken(tokenId); String posLabel = posView.getLabel(tokenId); System.out.println(token + "\t" + posLabel); } Curator call This method is available for TokenLabelVIews

  23. Example: Shallow parse John Smith bought the car. Tokens {0:John, 1:Smith, 2:bought, 3:the, 4:car, 5:.} Constituents Each constituent is associated with a span. The convention is to denote a span using the first token and the (last +1)thone. Shallow parse NP John Smith VP bought NPthe car 0-2 NP 2-3 VP 3-4 NP No Relations! This specialization of the View class is called a SpanLabelView, where each constituent assigns a label to a span of text and there are no relations. Use for named entities, shallow parse, Wikifier, etc.

  24. Adding shallow parse from the Curator // Suppose we have a CuratorClient called 'client' and a TextAnnotation // called 'ta'. // Should the Curator forcibly update the shallow parse annotation? booleanforceUpdate = false; // Add the shallow parse/chunk view from the Curator client.addChunkView(ta, forceUpdate); // Get the shallow parse view from the TextAnnotation. This view will // be filed under the name 'ViewNames.SHALLOW_PARSE'. Also, we know that // this view will be a SpanLabelView. SpanLabelViewchunkView = (SpanLabelView) ta.getView(ViewNames.SHALLOW_PARSE); // Get all constituents whose span is contained in the span (0, 2). List<Constituent> constituents = chunkView.getSpanLabels(0, 2); // Iterate over them and print their labels for(Constituent c: constituents) { String label = c.getLabel(); System.out.println(label); } Curator call Available for SpanLabelView

  25. Other SpanLabel views in the Curator • Shallow parse • ViewNames.SHALLOW_PARSE • Use ‘client.addChunkView(ta, forceUpdate)’ • Named entities • ViewNames.NER • Use ‘client.addNamedEntityView(ta, forceUpdate)’ • Wikifier • ViewNames.WIKIFIER • Use ‘client.addWikifierView(ta, forceUpdate) Note: For these function calls to work, the corresponding annotator should exist in your instance of the Curator. Otherwise, an exception will be triggered

  26. Example: Parse view John Smith bought the car. Tokens {0:John, 1:Smith, 2:bought, 3:the, 4:car, 5:.} Parse tree Constituents ParentOf S NP VP 0-1 NNP 0-5 S 0-2 NP 3-5 VP NNP NP NNP VBD Rest of the tree not shown. NN DT ParentOf John Smith bought the car ParentOf Relations This specialization of the View class is called a TreeView, where the graph represents a tree. Use for full parse and dependency trees.

  27. Adding Charniak parse from the Curator // Suppose we have a CuratorClient called 'client' and a TextAnnotation // called 'ta'. // Should the Curator forcibly update the parse annotation? booleanforceUpdate = false; // Add the charniak parse view from the Curator client.addCharniakParse(ta, forceUpdate); // Get the Charniak parse view from the TextAnnotation. This view will // be filed under the name 'ViewNames.PARSE_CHARNIAK'. Also, we know // that this view will be a TreeView. TreeViewparseView = (TreeView) ta.getView(ViewNames.PARSE_CHARNIAK); // get all parse nodes List<Constituent> treeNodes = parseView.getConstituents(); // get the tree structure for the first sentence (i.e. sentence #0) Tree<String> parseTree = parseView.getTree(0); // Get path between parse tree nodes (common feature) String parsePath = PathFeatureHelper.getFullParsePathString( treeNodes.get(0), treeNodes.get(1), 400); Curator call Do interesting things

  28. Tree views from the curator • Charniak parser • ViewNames.PARSE_CHARNIAK • client.addCharniakParse(ta, forceUpdate) • Easy-first dependency parser • ViewNames.DEPENDENCY • client.addEasyFirstDependencyView(ta, forceUpdate) • Stanford parser • ViewNames.PARSE_STANFORD • client.addStanfordParse(ta, forceUpdate) • Stanford dependency parser • ViewNames.DEPENDENCY_STANFORD • client.addStanfordDependencyView(ta, forceUpdate)

  29. Other Curator calls • Verb semantic roles • View name: ViewNames.SRL • client.addSRLView(ta, forceUpdate) • Adds a view of type PredicateArgumentView, which is a subclass of the View class • Nominal semantic roles • View name: ViewNames.NOM • client.addNOMView(ta, forceUpdate) • Adds a view of type PredicateArgumentView • Coreference • View name:ViewNames.COREF • client.addCorefView(ta, forceUpdate) • Adds a view of type CoreferenceView, which is a subclass of the View class

  30. So far… • What is Edison? • Installing Edison • Creating a TextAnnotation • Adding views from the Curator • Using views • …?? • Profit!

  31. Using views • All views provide access to • Constituents: • getConstituents, getConstituentsCoveringToken, getConstituentsCoveringSpan • Relations: getRelations • Allows us to manipulate several different views • Eg: Get the parse tree nodes that contain the named entity constituent that whose label is “PER”: for (Constituent c : namedEntityView.getConstituents()) { if (c.getLabel().equals("PER")) { List<Constituent> parseConstituents = parseView .getConstituentsCovering(c); // do something with these } }

  32. Using constituents and relations • Each constituent belongs to a view • Constituents provide the following methods: • getLabel(): gets the label of the constituent • getSpan(): gets the span of the constituent • getIncomingRelations(): gets list of Relations that are incident to this constituent in this view • getOutgoingRelations(): gets list of Relations whose source is this constituent in this view • Relations provide the following accessors: • getRelationName(),getSource(),getTarget()

  33. Other useful functionality • Supports • Top-K views • Custom views, for your application • Provides helper functions for common tasks • Look at the functions in classes in the package edu.illinois.cs.cogcomp.edison.features.helpers • Provides interface to WordNet • WordNetManager • Collin’s head-finding rules • Several feature extraction utilities • Look the classes at edu.illinois.cs.cogcomp.edison.features

  34. So far… • What is Edison? • Installing Edison • Creating a TextAnnotation • Adding views from the Curator • Using views • …?? • Profit!

  35. Links • Edison download http://cogcomp.cs.illinois.edu/page/software_view/Edison • Example code http://cogcomp.cs.illinois.edu/software/edison/ • API documentation http://cogcomp.cs.illinois.edu/software/edison/apidocs

More Related