1 / 6

Classifying news articles by whether they are about diseases and outbreaks

Mason Chua. Classifying news articles by whether they are about diseases and outbreaks. Global Viral Forecasting. Predicting disease outbreaks early. Internet articles are a data source. My task: classify articles into unrelated outbreak just disease. Example datum.

deiter
Download Presentation

Classifying news articles by whether they are about diseases and outbreaks

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Mason Chua Classifying news articles by whether they are about diseases and outbreaks

  2. Global Viral Forecasting Predicting disease outbreaks early. Internet articles are a data source. My task: classify articles into unrelated outbreak just disease

  3. Example datum Title: Australia burns, scores evacuated Text: <!--PSTYLE=WL Web Lead--><p>Firefighters battled two wildfires in Australia, water bombing them from above as they tried to stop their spread. One on the outskirts of Perth destroyed at least 40 homes and left a firefighter injured. The two blazes have razed 1 600 hectares of forested land to the north and southeast of Perth since Saturday. Several people were treated for smoke inhalation.</p> URL: http://www.iol.co.za/australia-burns-scores-evacuated-1.1022575 Annotation: unrelated Annotation trust: 80% (and some more)

  4. Method: MaxEnt (1) Expected it to be easy to distinguish unrelated class from outbreak and just_disease. It was, with word-presence features: title_has_stem_x article_has_stem_x Used Porter stemmer (NLTK) for conflating paradigm elements: 'doctors' and 'doctor' become the same feature Stripped HTML

  5. With word-presence features • Just with the stem-presence features: • title_has_stem_x • article_has_stem_x • Next: need to distinguish just_disease and outbreak better

  6. just_disease versus outbreak Expected just_disease and outbreak to consistently have different... named entities: Ex: has_ORGANIZATION_parliament Ex: mentions_an_organization websites of origin Ex: subdomain_nature.com NER hurt accuracy subdomain_X didn't help Best performance is still just the generic presence-of-stems features From last slide: class F1 scores were 70%, 53%, 99%.

More Related