Over9k
Download
1 / 12

Over9K - PowerPoint PPT Presentation


  • 85 Views
  • Uploaded on

Over9K. Alex Meng Chunshi Jin Elliott Conant Jonathan Fung. Agenda. What is Over9K about Architecture Crawler IE/Classifier Web Interface Summary. What is Over9K about?.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Over9K' - devaki


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Over9k

Over9K

Alex Meng

Chunshi Jin

Elliott Conant

Jonathan Fung


Agenda
Agenda

  • What is Over9K about

  • Architecture

  • Crawler

  • IE/Classifier

  • Web Interface

  • Summary


What is over9k about
What is Over9K about?

  • Original Goal: A system to predict stock’s future volatility based on the news and information gathered from Internet.

  • Over ambitious because of ignorant.

  • What we have done: extract information/events which may affect the volatility of stocks . User can search and browse it.


Events to extract
Events to Extract

  • Reorganization

  • Bankruptcy

  • Product release

  • Earning report


Architecture
Architecture

Web Interface

MySQL

IE/Classifier

Internet

Crawler


Crawler
Crawler

  • Based on nutch

  • Crawled web sites:


Ie classifier
IE/Classifier

  • Tried several systems for IE

    • Gate

    • OpenCalais

    • CRF++

  • Classifier

    • Mallet


Comparison of ie tools
Comparison of IE tools

  • OpenCalais:

    • Web service. Easy to use. No machine learning process.

    • Not extensible

    • Fairly good precision/recall

  • Gate:

    • ANNIE( a Nearly New IE system ):

      • Tokenizer, Sentence Splitter, POSTagger, Gazetteer, NE

    • JAPE: Gate’s rule engine.

    • Extensible with JAPE. Easy to use for its regex like syntax. Deterministic behavior.

    • High precision/recall for defined patterns, low for undefined patterns.


Comparison of ie tools cont
Comparison of IE tools (cont.)

  • CRF++

    • Need tools to preprocess content:

      • HTML to text

      • POS Tag/NE (Stanford NLP library)

      • Extract other features when necessary

      • Convert file to the required train/test format of CRF++

    • Template file to define dependencies of feature and label.

    • Labeling training set is laborious

    • Fairly good precision/recall. “Intelligence” may emerge.

    • Need big set of training set.



Lessons and thoughts
Lessons and Thoughts

  • A realistic goal is critical.

  • Right tools are important.

  • Future Improvement

    • Controlled crawling

    • Improve feature extraction qualities: POSTagger/NE etc.


Over9k

Q&A

Thanks!


ad