1 / 24

GATE Overview and Demo

GATE Overview and Demo. University of Washington CLMA Treehouse Presentation October 8, 2010 Prescott Klassen. Overview. Summary of GATE information and documentation found at gate.ac.uk GATE Developer features, components, and plug-ins IDE Demo Embedded GATE

lazar
Download Presentation

GATE Overview and Demo

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. GATEOverview and Demo University of Washington CLMA Treehouse Presentation October 8, 2010 Prescott Klassen

  2. Overview • Summary of GATE information and documentation found at gate.ac.uk • GATE Developer features, components, and plug-ins • IDE Demo • Embedded GATE • Using GATE with Condor on Patas • GATE code samples

  3. Background • Sheffield Natural Language Processing Group at the University of Sheffield • Released 1996 – re-written and re-released 2002 • Latest Release GATE 5.2.1 (May 6, 2010) – Windows, Linux, Solaris, and Mac OS • Beta Release GATE 6.0 (Beta 1 – August 21, 2010) • 100% Java Reference Implementation • Compatible with IBM Unstructured Information Management Architecture (UIMA) • Open Source (GNU Library General Public License) • XML Corpus Encoding Standard (XCES) format, used by the American National Corpus

  4. What is GATE? • An architecture describing how language processing systems are made up of components. • A framework (class library) written in Java and tested on Linux, Windows and Solaris. • A graphical development environment built on the framework (IDE for NLP)

  5. GATE Products • GATE Developer • IDE for language processing components bundled with the ANNIE (A Nearly-New Information Extraction system) and plug-ins • GATE Teamware • Web app for collaborative semantic annotation projects incorporating a workflow engine and a backend service infrastructure • GATE Embedded • Object library optimized for inclusion in applications • GATE Services • Hosted services for cloud application development • GATE Wiki • Wiki/CMS • GATE Cloud • Cloud computing solution for hosted large-scale text processing

  6. GATE Components • Language Resources (LRs)—documents, corpora and ontologies • Processing Resources (PRs)—parsers, stemmers, co-reference resolvers, ML components, etc. • Visual Resources (VRs)—IDE components that provide a visual interface (GUI) to GATE components and plug-ins

  7. Language Resources • Documents, corpora, and ontologies • Can persist in Java Serial Store or Lucene Serial Data Store • Document = content + annotations + features • “Stand-off” Markup • Annotations as Directed Acyclic Graphs (start Node, end Node, ID, type, Feature Map, pointers into the sources document—character offsets) • Input Formats: Plain Text, HTML,SGML,XML, RTF, Email, PDF, Microsoft Word • Ontology support (Sesame2,OWLIM3)

  8. Processing Resources • ANNIE (a Nearly-New Information Extraction System) • Document Reset • Tokeniser • Gazetteer • Sentence Splitter • RegEx Sentence Splitter • Part of Speech Tagger • Semantic Tagger • Orthographic Coreference (OrthoMatcher) • Pronominal Coreference

  9. Processing Resources • JAPE (Java Annotation Pattern Engine): • Regular expressions over annotations • Finite state transduction over annotations based on regular expressions • Not against strings but against annotation graphs • Non-deterministic • ANNIC: ANNotations-In-Context • full-featured annotation indexing and retrieval system • Searchable Serial DataStore • Based on Lucene

  10. Processing Resources • The Annotation Diff Tool • enables two sets of annotations in one or two documents to be compared • figures are generated for precision, recall, F-measure • Corpus Benchmark Tool • Apply evaluation across an entire corpus • Balance Distance Measure (BDM) Ontology Tool

  11. Processing Resources (PlugIns) • OntoGazetteer • HashGazetteer • Gazetteer List Collector • Large KB Gazetteer • Ontology-Aware JAPE Transducer • Batch Learning PR (LibSVM, PAUM algorithm, Weka interface) • Machine Learning PR (Maxent, Weka and SVM Light)

  12. Resources on the Web sitegate.ac.uk • User Guide • Movie Tutorials • Developer’s Guide/API docs • NLP Application Programmer’s Guide • Research Papers • GATE project descriptions • Demos • Plug-in Info • Commerical/Academic partnerships • Etc…

  13. IDE Demo

  14. What is GATE Embedded? • Everything in GATE IDE without the GUI • A Java framework for many different types of NLP solutions • A complex assortment of core functionality and plug-ins • Extensible and Composable • GATE can be included as a component in other Java Frameworks and vice-versa

  15. Example Application with a GATE EmbeddedComponent

  16. Running GATE (“Hello World”) import gate.*; import gate.creole.*; public class Main { public static void main(String[] args) throws Exception { Gate.setGateHome(new File(<Path to GATE>)); Gate.setPluginsHome(new File(<Path to Plugins>)); Gate.init(); // start GATE }

  17. Registering Directories Gate.getCreoleRegister().registerDirectories(newFile(Gate.getPluginsHome(), "ANNIE").toURL()); Gate.getCreoleRegister().registerDirectories(newFile(Gate.getPluginsHome(), "Information_Retrieval").toURL()); Gate.getCreoleRegister().registerDirectories(newFile(Gate.getPluginsHome(), "Stemmer_Snowball").toURL());

  18. Creating Processing Resources SerialAnalyserControllerannieController = (SerialAnalyserController) Factory.createResource( "gate.creole.SerialAnalyserController", Factory.newFeatureMap(), Factory.newFeatureMap(), "ANNIE"); FeatureMapparams = Factory.newFeatureMap(); annieController.add((ProcessingResource) Factory.createResource("gate.creole.annotdelete.AnnotationDeletePR", params)); annieController.add((ProcessingResource) Factory.createResource("gate.creole.tokeniser.DefaultTokeniser", params)); annieController.add((ProcessingResource) Factory.createResource("stemmer.SnowballStemmer", params)); annieController.add((ProcessingResource) Factory.createResource("gate.creole.gazetteer.DefaultGazetteer", params)); annieController.add((ProcessingResource) Factory.createResource("gate.creole.splitter.RegexSentenceSplitter", params)); annieController.add((ProcessingResource) Factory.createResource("gate.creole.POSTagger", params)); annieController.add((ProcessingResource) Factory.createResource("gate.creole.ANNIETransducer", params)); annieController.add((ProcessingResource) Factory.createResource("gate.creole.orthomatcher.OrthoMatcher", params)); FeatureMapcoRefParams = Factory.newFeatureMap(); coRefParams.put("resolveIt", "true"); annieController.add((ProcessingResource) Factory.createResource("gate.creole.coref.Coreferencer", coRefParams));

  19. Creating Language Resources Corpus corpus = Factory.newCorpus("DUC Queries"); @SuppressWarnings("static-access") File topicsFile = new File(ConfigMgr.getTopicFilePath() + "topics.xml"); gate.DocumenttopicDoc = Factory.newDocument(topicsFile.toURL()); corpus.add(topicDoc); annieController.setCorpus(corpus); annieController.execute();

  20. Iteration and Cleanup AnnotationSetdefaultAnnotations = topicDoc.getAnnotations(); AnnotationSetoriginalMarkup = topicDoc.getAnnotations("Original markups"); AnnotationSettopicAnnotationSet = originalMarkup.get("TOPIC"); for (Annotation topicAnnotation : topicAnnotationSet) { ArrayList<Query> topicQueryArrayList = new ArrayList<Query>(); if (ConfigMgr.isQueryBreakdown()) { topicQueryArrayList = Utilities.buildTopicMultiQuery(topicAnnotation, originalMarkup, defaultAnnotations, config); } else { topicQueryArrayList = Utilities.buildTopicQuery(topicAnnotation, originalMarkup, defaultAnnotations, config); } String topicKey = null; topicKey = topicQueryArrayList.get(0).getDucTopicName(); globalQueryHash.put(topicKey, topicQueryArrayList); } topicDoc.cleanup(); Factory.deleteResource(topicDoc); corpus.cleanup(); Factory.deleteResource(corpus);

  21. Iterating through Annotations public static AnnotationSetgetChildAnnotationSet( String childAnnotationSetName, Annotation annotation, AnnotationSetparentAnnotationSet) throws NullPointerException { AnnotationSetchildAnnotationSet = null; // traverse nested Annotation Set for named annotation using parent offsets to delimit range try { childAnnotationSet = parentAnnotationSet.get(childAnnotationSetName, annotation.getStartNode().getOffset(), annotation.getEndNode().getOffset()); if (childAnnotationSet == null) { throw new NullPointerException(); } } catch (Exception e) { System.err.println(e.getMessage()); } return childAnnotationSet; }

  22. Example Script for Compiling on Patas #! /bin/bash javac -classpath .:/NLP_TOOLS/tool_sets/gate/gate-5.1/bin/gate.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/activation.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/ant-contrib-1.0b2.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/ant.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/ant-junit.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/ant-launcher.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/jdom.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/antlr.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/ant-nodeps.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/ant-trax.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/Bib2H‚Ñ¢L.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/commons-discovery-0.2.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/commons-fileupload-1.0.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/commons-lang-2.4.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/commons-logging.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/concurrent.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/gate-asm.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/gate-compiler-jdt.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/gateHmm.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/geronimo-ws-metadata_2.0_spec-1.1.1.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/GnuGetOpt.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/icu4j.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/jakarta-oro-2.0.5.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/javacc.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/jaxb-api-2.0.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/jaxen-1.1.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/jaxws-api-2.0.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/junit.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/jwnl.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/log4j-1.2.14.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/lubm.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/lucene-core-2.2.0.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/mail.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/nekohtml-1.9.8+2039483.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/ontotext.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/orajdbc3.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/PDFBox-0.7.2.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/pg73jdbc3.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/poi-2.5.1-final-20040804.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/spring-beans-2.0.8.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/spring-core-2.0.8.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/stax-api-1.0.1.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/tm-extractors-0.4.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/wstx-lgpl-3.2.3.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/xercesImpl.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/xml-apis.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/xmlunit-1.2.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/xpp3-1.1.3.3_min.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/xstream-1.2.jar:edu.mit.jwi_2.1.5.jar ling573extractive/*.java

  23. GATE Condor Script universe = java executable = ling573extractive/Main.class arguments = ling573extractive.Main output = ling573extractive.output error = ling573extractive.error jar_files = /NLP_TOOLS/tool_sets/gate/gate-5.1/bin/gate.jar,/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/junit.jar,/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/ant-junit.jar,/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/jdom.jar,/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/commons-lang-2.4.jar,/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/gate-asm.jar,/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/gate-compiler-jdt.jar,/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/log4j-1.2.14.jar,/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/lucene-core-2.2.0.jar,/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/nekohtml-1.9.8+2039483.jar,/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/ontotext.jar,/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/PDFBox-0.7.2.jar,/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/orajdbc3.jar,/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/wstx-lgpl-3.2.3.jar,/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/xercesImpl.jar,edu.mit.jwi_2.1.5.jar java_vm_args = -Xmn100M -Xms500M -Xmx500M +RequiresWholeMachine = True Requirements = ( Memory > 0 && TotalMemory >= (7*1024) ) queue

  24. Discussion

More Related