over9k n.
Download
Skip this Video
Download Presentation
Over9K

Loading in 2 Seconds...

play fullscreen
1 / 12

Over9K - PowerPoint PPT Presentation


  • 70 Views
  • Uploaded on

Over9K. Alex Meng Chunshi Jin Elliott Conant Jonathan Fung. Agenda. What is Over9k? Architecture Crawler Postprocessor Extractor Web Service Summary. What is Over9K about?.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Over9K' - kylee-wallace


Download Now An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
over9k

Over9K

Alex Meng

Chunshi Jin

Elliott Conant

Jonathan Fung

agenda
Agenda
  • What is Over9k?
  • Architecture
  • Crawler
  • Postprocessor
  • Extractor
  • Web Service
  • Summary
what is over9k about
What is Over9K about?
  • Original Goal: A system to predict stock’s future volatility based on the news and information gathered from Internet.
  • Current Goal: create a system that crawled different news sites for articles, identified which companies are affected, and extracted events from articles. We store all information in a database that is accessed through our web service.
crawler
Crawler
  • Web crawler: Nutch
  • Domains we crawl:
    • www.cnbc.com
    • www.reuters.com
    • www.marketwatch.com
    • … (6 total)
  • Nutch’sSuccesses
  • Nutch’s Failures
postprocessor
Postprocessor
  • Components:
    • NBClassifier
      • Classifies articles using Naives-Bayes
    • DateParser
      • Parses date using regular expressions
    • PageGetter
      • Retrieves training data from RSS feeds
slide7
IE
  • Tried several systems for IE
    • Gate
    • OpenCalais
    • CRF++
comparison of ie tools
Comparison of IE tools
  • OpenCalais:
    • Web service. Easy to use.
    • Not extensible. No machine learning process.
    • Has usage quotas
  • Gate:
    • ANNIE( a Nearly New IE system ):
      • Tokenizer, Sentence Splitter, POSTagger, Gazetteer, NE
    • JAPE: Gate’s rule engine.
    • Extensible with JAPE. Easy to use for its regex like syntax. Behavior is almost deterministic.
    • High precision for defined patterns, low recall if there are sentences of undefined patterns.
comparison of ie tools cont
Comparison of IE tools (cont.)
  • CRF++
    • Need tools to preprocess content:
      • HTML to text
      • POS Tag/NE (Stanford NLP library)
      • Extract other features when necessary
      • Convert file to the required train/test format of CRF++
    • Template file to define dependencies of feature and label.
    • Need big set of training set.
    • Labeling training set is laborious
    • Fairly good precision/recall. “Intelligence” may emerge.
web service
Web Service
  • Technologies used:
    • YUI Toolkit
    • PHP
    • Apache
    • CSS
    • Javascript
  • Layout description
lessons and thoughts
Lessons and Thoughts
  • A realistic goal is critical.
  • Right tools are important.
  • Communication is key.
  • Future Improvement
    • Controlled crawling
    • Improve feature extraction qualities: POSTagger/NE etc.
    • Developing a model to predict volatility
slide12

Q&A

Thanks!