over9k
Download
Skip this Video
Download Presentation
Over9K

Loading in 2 Seconds...

play fullscreen
1 / 12

Over9K - PowerPoint PPT Presentation


  • 85 Views
  • Uploaded on

Over9K. Alex Meng Chunshi Jin Elliott Conant Jonathan Fung. Agenda. What is Over9K about Architecture Crawler IE/Classifier Web Interface Summary. What is Over9K about?.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Over9K' - devaki


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
over9k

Over9K

Alex Meng

Chunshi Jin

Elliott Conant

Jonathan Fung

agenda
Agenda
  • What is Over9K about
  • Architecture
  • Crawler
  • IE/Classifier
  • Web Interface
  • Summary
what is over9k about
What is Over9K about?
  • Original Goal: A system to predict stock’s future volatility based on the news and information gathered from Internet.
  • Over ambitious because of ignorant.
  • What we have done: extract information/events which may affect the volatility of stocks . User can search and browse it.
events to extract
Events to Extract
  • Reorganization
  • Bankruptcy
  • Product release
  • Earning report
architecture
Architecture

Web Interface

MySQL

IE/Classifier

Internet

Crawler

crawler
Crawler
  • Based on nutch
  • Crawled web sites:
ie classifier
IE/Classifier
  • Tried several systems for IE
    • Gate
    • OpenCalais
    • CRF++
  • Classifier
    • Mallet
comparison of ie tools
Comparison of IE tools
  • OpenCalais:
    • Web service. Easy to use. No machine learning process.
    • Not extensible
    • Fairly good precision/recall
  • Gate:
    • ANNIE( a Nearly New IE system ):
      • Tokenizer, Sentence Splitter, POSTagger, Gazetteer, NE
    • JAPE: Gate’s rule engine.
    • Extensible with JAPE. Easy to use for its regex like syntax. Deterministic behavior.
    • High precision/recall for defined patterns, low for undefined patterns.
comparison of ie tools cont
Comparison of IE tools (cont.)
  • CRF++
    • Need tools to preprocess content:
      • HTML to text
      • POS Tag/NE (Stanford NLP library)
      • Extract other features when necessary
      • Convert file to the required train/test format of CRF++
    • Template file to define dependencies of feature and label.
    • Labeling training set is laborious
    • Fairly good precision/recall. “Intelligence” may emerge.
    • Need big set of training set.
lessons and thoughts
Lessons and Thoughts
  • A realistic goal is critical.
  • Right tools are important.
  • Future Improvement
    • Controlled crawling
    • Improve feature extraction qualities: POSTagger/NE etc.
slide12

Q&A

Thanks!

ad