1 / 12

LingPipe

LingPipe. http://www.alias-i.com/lingpipe/. Does a variety of tasks. Tokenization Part of Speech Tagging Named Entity Detection Clustering Identifies Significant Phrases Other Topic Classification Database Text Mining Spell Checker Sentiment Analysis Chinese Word Segmentation .

damien
Download Presentation

LingPipe

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. LingPipe http://www.alias-i.com/lingpipe/

  2. Does a variety of tasks • Tokenization • Part of Speech Tagging • Named Entity Detection • Clustering • Identifies Significant Phrases • Other • Topic Classification • Database Text Mining • Spell Checker • Sentiment Analysis • Chinese Word Segmentation

  3. Other Niceties • Its free • Plenty of documentation • Tutorials for every subtask • Highly Configurable • Source Code • Very complex, but well written • Good comments • Gives examples on how to edit code • Can be trained in several languages.

  4. Tokenization • Divides up text in sentences and words using pretty sophisticated methods.

  5. Part of Speech Tagging • You can output the N-best results • You can output a confidence score for each word. • You can also retrain the Part of Speech Tagger. • You can also edit how it runs.

  6. Named Entity Detection • The default detection distinguishes between three types of entities. • People (distinguishes male and female) • Place • Organization • It can be trained to recognize any type of entity. • You can get corpora from online • You can annotate your own corpora using WordFreak, which also comes with LingPipe.

  7. Sample Input/Output • - <DOCUMENT><P>This is Mr. Bob Smith. Bob lives in Redmond. He works for Microsoft.</P></DOCUMENT> • - <DOCUMENT><P><sent>This is Mr. <ENAMEX id="13" type="PERSON">Bob Smith.</ENAMEX> </sent> • <sent><ENAMEX id="13" type="PERSON">Bob</ENAMEX> lives in • <ENAMEX id="14" type="LOCATION">Redmond</ENAMEX> . </sent> • - <sent><ENAMEX id="13" type="MALE_PRONOUN">He</ENAMEX> • works for <ENAMEX id="15" type="ORGANIZATION">Microsoft</ENAMEX> . </sent></P></DOCUMENT>

  8. Dictionary • To increase the accuracy of LingPipe, you can import a Dictionary. • A dictionary will force the recognition of certain strings to be certain types. • Common dictionaries include: • Gazeteer • List of people’s names • Company names

  9. Coreference • It identifies different references to the same entity, such Bob Smith and Bob. • It does not identify entities across documents. • It identifies pronouns with its antecedent. • It does not do other anaphora resolution, like “Jane was the woman who pulled the trigger.”

  10. Clustering • Single-link Clustering • chops off longest link • Clustering with proximity bounds • Merges based on proximity • Extract for K-clusters • You can specify how many clusters you want • Complete-Link Clustering • var of single link using a whole cluster • Within-Cluster Point Scatter • You don’t need to specify the number of clusters. • It detects the best breaking point. • This is the method used to do NER across documents.

  11. Significant Phrases • Determines phrases that are seen together more often than coincidence • Seems to be mostly named entities • Puget Sound, George Bush • Helps tell the genre of an article

  12. Questions?

More Related