“Cheap” Tricks for NLP: An “Invited” Talk

“Cheap” Tricks for NLP: An “Invited” Talk Craig Martell Associate Professor Naval Postgraduate School Director, NLP Lab

Overview • We’ve been doing work on microtext since before it was “microtext”. • About NPS • NPS Chat Corpus (v1 and v2?) • Overview • Goal (Jane Lin, NSA) • Age Detection Task (MAJ Jenny Tam, USA) • POS and Dialogue Act tagging: using Treebank to bootstrap (Lt. Col. Eric Forsyth, USAF) • But do we really even need to POS-tag? (CAPT James Hitt, USN) • Getting by “on the cheap” • Authorship detection in Twitter (LT Sarah Boutwell, USN) • Good scientific goals for the community (??)

NPS NLP Lab • NPS is both a university and part of the DoD • As a university, we work on the same types of sponsored research as civilian universities • DARPA*, IARPA, MURIs, NSF, etc. • Standard competitive process • Standard academia/industry expectations for results • Same tenure and promotion process • As a part of the DoD, we do work more directly for sponsors: • DoD, DARPA*, NRO, NSA, etc. • Depending on the type of money, results need to be more operationally applicable • We have had some cool results using “cheap” tricks that could point to more “normal” academic research

Some Recent and Current Work • IARPA SCIL • Persuasion Detection • Sub-group Detection • In forums, chat, etc. (“microtext”) • With UMD, UCSC, and Temple • DoD, etc. • Topic Detection in IRC Chat (Adams 2008) • Authorship “signal boosting” with large author sets • Any boost is remarkably useful to analysts • Project away topic signal from documents for cleaner authorship signal (topic does most of the work) • L1 detection from English-L2 documents. • “On-phone” NLP (above and more) • Accuracy vs Computational Power

The NPS Chat Corpus, V1 • Gather 495,000 posts in age-based rooms • According to the terms of service of the chat service • To abide by the privacy act, we hand anonymized 10,000 posts, tagged them for dialogue act and part of speech • Go to Web Page

The NPS Chat Corpus, V2? • We have also gathered data to aid in doing conversational thread extraction: • Essentially, we want to cluster posts according to what conversation they’re in • Not necessarily mutually exclusive clusters • We gathered data similar to that gathered by Elsner and Charniak at Brown • They gathered IRC data from Linux tech help • We added iPhone and Python tech help, and Physics Q&A • It has all been hand “clustered” for conversations • Working with UCSC CS (Lyn Walker) and Linguistics (Pranav Anand) to augment the annotation to include dialogue act and “attachment” instead of clusters.

First Use: Age Detection • Second Youth Internet Safety Survey (2005) (YISS-2) • Decrease in youths receiving solicitations • Number of dangerous sexual overtures/aggressive solicitations has not declined • In 35% of the aggressive episodes, youths did not think the solicitations were serious enough to tell anyone • Only 7% of the aggressive solicitations were reported to law enforcement, ISP, or other authority • Need for an automated system that can recognize adults conversing with teens to alert parents of possible inappropriate conversations

Tam – Chat Classification • NPS Chat Corpus (Talk City Chat Data) • Teens, 20s, 30s, 40s, 50+ • Perverted Justice (IM chat logs) • Pseudo Victims (adults posing as minors) • Convicted criminals (solicitation of minors) • Binary Classification • teens vs. adults • teens vs. specific age group • teens vs. pseudo victims (similarity between actual teens to adults pretending to be teens) • criminals vs. teens (looking for minors soliciting minors) • criminals vs. pseudo victims • Classification Tool • Linear Support Vector Machine with different slack variables • Result: 80-90% success at detecting Teens from Adults. But the most important is detecting Teens from 20s. >90% • Current state of the art in the field!

Forsyth – Dialogue Act/POS Tagging • An experiment in cross-domain NLP • We wanted to POS tag chat (Lt Col Eric Forsyth, USAF) • Lex. Bigrams → Bigrams → Lex. Unigrams → Unigrams → MLE from training data • WSJ train, chat test: 57.4% accuracy • Not surprising. Chat is not like WSJ • Treebank train, chat test: 65.8% accuracy • Includes ATIS, Switchboard; chat is somewhat speech like • Boot strapped/hand corrected POS tags for 10,000 posts • Chat train, chat test: 73.7% • But, add 10,000 chat posts to Treebank: 87.1% • Using HMM tagger trained on combo: 90.8% • Using these parts of speeches tags as part of the input, we can dialogue act tag at 83.2% accuracy.

Hitt – Dialogue Act/POS Tagging • But does POS tagging matter for dialogue act tagging (our actual goal in chat, sms, etc). • Sure, but it doesn’t have to be that good Instead of using chat at all, we (CAPT James Hitt, USN) simply generated the MLE for each word string (no wsd) from pre-existing resources (Treebank and Brown combined). • Just using these “cheap” parts of speech we get: 83.23%

L1 Language Identification • Using International Corpus of Lerner English • For each author L2 = English (except for native speaker control group) • Texts in English • Task: Guess L1 • Using character 3-grams, we (LT Charles Ahn, USN) got: 81.3%

L1 Language Identification

CPOS and L1 Identification • Interestingly CPOS n-grams works very well here too: • Cells contain average counts of documents over 26 trials • ML: Multi-class Logistic Regression

Boutwell – Authorship Detection in Twitter • Hot off the presses • Built a “social network” from the Twitter garden hose • Use it to simulate SMS messages within the group • If my phone is stolen, can it tell that it isn’t me writing SMS? • So, what do we need to do authorship detection over “SMS” • Doesn’t seem to be a lot of authorship signal in SMS • Well, not in one, but in 23 there is • If we have a stream of 23 messages, we got 90% accuracy over 10 authors. • Authors are consistent in how they deal with the constraints? • More error/success analysis needed

Research to be explored • Can we build a better scientific understanding of different domains of text and develop a theory of what will be useful from pre-existing domains? What will be needed from the new domain? • How much can we actually do with as little as possible? • Do we need to parse? • Should we expand (e.g., ur), or generate new grammars • I argue we build new models sooner rather than later • How do we get parallel corpora? • How do we get best practices for mechanical turk?

“Cheap” Tricks for NLP: An “Invited” Talk