Mason Chua Classifying news articles by whether they are about diseases and outbreaks
Global Viral Forecasting Predicting disease outbreaks early. Internet articles are a data source. My task: classify articles into unrelated outbreak just disease
Example datum Title: Australia burns, scores evacuated Text: <!--PSTYLE=WL Web Lead--><p>Firefighters battled two wildfires in Australia, water bombing them from above as they tried to stop their spread. One on the outskirts of Perth destroyed at least 40 homes and left a firefighter injured. The two blazes have razed 1 600 hectares of forested land to the north and southeast of Perth since Saturday. Several people were treated for smoke inhalation.</p> URL: http://www.iol.co.za/australia-burns-scores-evacuated-1.1022575 Annotation: unrelated Annotation trust: 80% (and some more)
Method: MaxEnt (1) Expected it to be easy to distinguish unrelated class from outbreak and just_disease. It was, with word-presence features: title_has_stem_x article_has_stem_x Used Porter stemmer (NLTK) for conflating paradigm elements: 'doctors' and 'doctor' become the same feature Stripped HTML
With word-presence features • Just with the stem-presence features: • title_has_stem_x • article_has_stem_x • Next: need to distinguish just_disease and outbreak better
just_disease versus outbreak Expected just_disease and outbreak to consistently have different... named entities: Ex: has_ORGANIZATION_parliament Ex: mentions_an_organization websites of origin Ex: subdomain_nature.com NER hurt accuracy subdomain_X didn't help Best performance is still just the generic presence-of-stems features From last slide: class F1 scores were 70%, 53%, 99%.