120 likes | 217 Views
Over9K. Alex Meng Chunshi Jin Elliott Conant Jonathan Fung. Agenda. What is Over9K about Architecture Crawler IE/Classifier Web Interface Summary. What is Over9K about?.
E N D
Over9K Alex Meng Chunshi Jin Elliott Conant Jonathan Fung
Agenda • What is Over9K about • Architecture • Crawler • IE/Classifier • Web Interface • Summary
What is Over9K about? • Original Goal: A system to predict stock’s future volatility based on the news and information gathered from Internet. • Over ambitious because of ignorant. • What we have done: extract information/events which may affect the volatility of stocks . User can search and browse it.
Events to Extract • Reorganization • Bankruptcy • Product release • Earning report
Architecture Web Interface MySQL IE/Classifier Internet Crawler
Crawler • Based on nutch • Crawled web sites: • …
IE/Classifier • Tried several systems for IE • Gate • OpenCalais • CRF++ • Classifier • Mallet
Comparison of IE tools • OpenCalais: • Web service. Easy to use. No machine learning process. • Not extensible • Fairly good precision/recall • Gate: • ANNIE( a Nearly New IE system ): • Tokenizer, Sentence Splitter, POSTagger, Gazetteer, NE • JAPE: Gate’s rule engine. • Extensible with JAPE. Easy to use for its regex like syntax. Deterministic behavior. • High precision/recall for defined patterns, low for undefined patterns.
Comparison of IE tools (cont.) • CRF++ • Need tools to preprocess content: • HTML to text • POS Tag/NE (Stanford NLP library) • Extract other features when necessary • Convert file to the required train/test format of CRF++ • Template file to define dependencies of feature and label. • Labeling training set is laborious • Fairly good precision/recall. “Intelligence” may emerge. • Need big set of training set.
Lessons and Thoughts • A realistic goal is critical. • Right tools are important. • Future Improvement • Controlled crawling • Improve feature extraction qualities: POSTagger/NE etc.
Q&A Thanks!