1 / 14

Exercise I

Exercise I. Objective: Implement a ML component based on SVM to identify the following concepts in company profiles: company name; address; fax; phone; web site; industry type; creation date; industry sector; main products; market locations; number of employees; stock exchange listings.

chuck
Download Presentation

Exercise I

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Exercise I • Objective: • Implement a ML component based on SVM to identify the following concepts in company profiles: company name; address; fax; phone; web site; industry type; creation date; industry sector; main products; market locations; number of employees; stock exchange listings

  2. Exercise I • Materials : we are working with material in directory hands-on-resources/ml/entity-learning • training documents: a set of 5 company profiles annotated with the target concepts (corpus/annotated) - each document contains an annotation Mention with a feature class representing the target concept (human annotated), the documents also contain annotation produced by ANNIE plus an annotation called Entity that wraps up named entities of type Person, Organization, Location, Date, Address. All annotations are in the default annotation set • test documents (without target concepts and without annotations): a set of company profiles from the same source as the training data (corpus/testing) • SVM configuration file learn-company.xml (experiments/company-profile-learning) • Open the configuration file in a text editor to see how the target concept and the linguistic annotations are encoded, remember that the target concept is encoded using the <CLASS/> sub-element in the <ATTRIBUTE> element (in this case we are trying to learn a Mention and its ‘class’).

  3. Exercise I – PART I • Run an experiment with the training documents to check the performance of the learning component on annotated data – we will use the GATE GUI for this exercise • Load the Batch Learning plug-in using the plug-in manager (it has the name ‘learning’ in the list of plug-ins) • Create a corpus (ANNOTATED) • Populate it with the training documents (corpus/annotated) use encoding UFT-8 (you may want to look at one of the documents to see the annotations, the target annotation is Mention) • Create a Batch Learning PR using the provided configuration file (experiments/company-profile-learning/learn-company.xml) - should appear in the list of processing resources • Create a corpus pipeline and add the Batch Learning PR to the corpus pipeline • Set the parameter “learningMode” of the Batch Learning PR to “evaluation” • Run the corpus pipeline over the ANNOTATED corpus (by setting the corpus parameter) • When finished, evaluation information will be dumped on the GATE console • Examine the GATE console to see the evaluation results

  4. Exercise I – PART I • In this exercise we have tested how to evaluate the learning component over annotated documents. Note that we have provided very few documents for training. • According to the configuration file and the number of documents in the corpus, the ML pipeline will execute 2 runs, each run will use 3 documents for training and 2 documents for testing, in each test document the Mention annotation automatically produced will be compared to the true Mention annotation (gold standard) to compute precision, recall, and f-measure values. The evaluation results will be an average over the two runs.

  5. Exercise I - PART II • Run an experiment to TRAIN the machine learning component • Create a corpus and populate it with the training data (or use ANNOTATED from previous steps) • Create a Batch Learning PR using the provided configuration file (or use the same PR as before) • Create a corpus pipeline containing the Batch Learning PR (or use the one before) • In the corpus pipeline, set the “learningMode” of the Batch Learning PR component to “training” • Set the corpus in the corpus pipeline to the ANNOTATED corpus • Run the corpus pipeline • Now you have trained the ML component to recognise Mentions

  6. Exercise I – PART III • Run an experiment to apply the trained model to unseen documents • We will use the trained model produced in the previous exercise • Create a corpus (TEST) and populate it with the test documents (use UTF-8 encoding) • NOTE: the documents are not annotated, so you need to produce the annotations! The steps below produce the annotations. • Load the ANNIE system (with defaults) • Create an ANNIE NE Transducer (call it ENTITY-GRAMMAR) using the grammar file under (grammars/create_entity.jape) • Add the ENTITY-GRAMMAR as the last component of ANNIE • Run ANNIE (+ the new grammar) over the TEST corpus • Verify that the documents contain the ANNIE annotations + the Entity annotation

  7. Exercise I – PART III • Take the corpus pipeline created in the previous exercise and change the parameter learning mode of the Batch Learning PR to “application” • The input annotation set should be empty (default) because the ANNIE annotations are there, and the output annotation set can be any set (including the default) • Apply (run) the corpus pipeline to the TEST corpus (by setting the corpus) • Examine the result of the annotation process (see if Mention annotations have been produced) • Mention annotations should contain a feature class (one of the concepts listed in the first slide) and a feature ‘prob’ which is a probability produced by the ML component • Now you have applied a trained model to a set of unseen documents • With the parts I, II, and III you have use the evaluation, training, and application modes of the Batch Learning PR

  8. Exercise I – PART IV Run your own experiment: copy the configuration file to another directory and edit this configuration file. You may comment out some of the features used, or the windows used, or the type of ML. Chapter 11 of the GATE guide contains enough information on options you can adjust.

  9. Exercise II • Objective: • Implement a ML component based on SVM to “learn” ANNIE, e.g. To learn to identify the following concepts or named entities: Location, Address, Date, Person, Organization • Materials (under directory hand-on-resources/ml/entity-learning) • We will need the GATE GUI and the learning plug-in loaded using the plug-in manager (see previous exercise) • We will use the testing documents provided in Exercise I • Before starting, it better to close all documents and resources of the previous exercise • Configuration file is learn-nes.xml in experiments/learning-nes, it is very similar to the previously used but check the target annotation to be learned (Entity and its type)

  10. Exercise II – PART I • Annotate the documents • Create a corpus (CORPUS) and populate it with the test documents (use UTF-8 encoding) • NOTE: the documents are not annotated, so you need to produce the annotations! The steps below produce the annotations. • Load the ANNIE system (with defaults) • Create an ANNIE NE Transducer (call it ENTITY-GRAMMAR) using the grammar file under (grammars/create_entity.jape) • Add the ENTITY-GRAMMAR as the last component of ANNIE • Run ANNIE (+ the new grammar) over the CORPUS • Verify that the documents contain the ANNIE annotations + the Entity annotation

  11. Exercise II – PART I • Evaluate an SVM to identify ANNIE’s named entities • Create a Batch Learning PR using the provided configuration file (experiments/learning-nes/learn-nes.xml) • Create a corpus pipeline and add the Batch Learning PR to the corpus pipeline • Set the parameter “learningMode” of the Batch Learning PR to “evaluation” • Run the corpus pipeline over the CORPUS corpus (by setting the corpus parameter) • When finished, evaluation information will be dumped on the GATE console • Examine the GATE console to see the evaluation results • NOTE: For the sake of this exercise we have used annotations produced by ANNIE as gold standard and learn an named entity recognition system based on those annotations. Note however that training should be based on human annotations.

  12. Exercise II – PART II • Train a SVM to learn named entities and apply it to unseen documents • We will use the documents you annotated (automatically!) in PART I (corpus CORPUS) • Using the corpus editor remove from CORPUS the first 5 documents in the list (profile_A, profile_AA, profile_AB, profile_AC, profile_AD) • Create a corpus called TESTING • Add to TESTING (using the corpus editor) documents profile_A, profile_AA, proffile_AB, profile_AC, profile_AD – should be the last 5 of the list! • Now we have one corpus for training (CORPUS) and one corpus for testing (TESTING)

  13. Exercise II – PART II • We will use the learning corpus pipeline we have evaluated in PART I of this exercise • In the learning corpus pipeline, set the parameter “training” of the Batch Learning PR to “training” • Run the learning corpus pipeline over the CORPUS corpus (by setting the corpus parameter) • Now we have a trained model to recognise Entity and its type • In the learning corpus pipeline, set the parameter “learningMode” of the Batch Learning PR to “application” • Also set the output annotation set outputASName to “Output” (to hold the annotations produced by the system) • Run the learning corpus pipeline over the TESTING corpus (by setting the corpus parameter) • After execution, check the annotations produced on any of the testing documents (Output annotation set)

  14. Exercise II – PART III • On any of the automatically annotated documents from TESTING you may want to use the annotationDiff tool verify in each document how the learner performed, comparing the Entity in the default annotation set with the Entity in the Output annotation set. • Run your own experiment varying any of the parameters of the configuration file, modifying or adding new features, etc.

More Related