Cmu team a in tdt 2004 topic tracking
Download
1 / 21

CMU TEAM-A in TDT 2004 Topic Tracking - PowerPoint PPT Presentation


  • 99 Views
  • Uploaded on

CMU TEAM-A in TDT 2004 Topic Tracking. Yiming Yang School of Computer Science Carnegie Mellon University. CMU Team A. Jaime Carbonell (PI) Yiming Yang (Co-PI) Ralf Brown Jian Zhang Nianli Ma Shinjae Yoo Bryan Kisiel, Monica Rogati, Yi Chang. Participated Tasks in TDT 2004.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' CMU TEAM-A in TDT 2004 Topic Tracking' - franz


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Cmu team a in tdt 2004 topic tracking

CMU TEAM-A in TDT 2004Topic Tracking

Yiming Yang

School of Computer Science

Carnegie Mellon University

CMU Team-1 in TDT 2004 Workshop


Cmu team a
CMU Team A

  • Jaime Carbonell (PI)

  • Yiming Yang (Co-PI)

  • Ralf Brown

  • Jian Zhang

  • Nianli Ma

  • Shinjae Yoo

  • Bryan Kisiel, Monica Rogati, Yi Chang

CMU Team-1 in TDT 2004 Workshop


Participated tasks in tdt 2004
Participated Tasks in TDT 2004

  • Topic Tracking (Nianli Ma et al.)

  • Supervised Adaptive Tracking (Yiming Yang et al.)

  • New Event Detection (Jian Zhang et al.)

  • Link Detection (Ralf Brown)

  • Hierarchical Topic Detection – not participated

CMU Team-1 in TDT 2004 Workshop


Topic Tracking with Supervised Adaptation

(“Adaptive Filtering” in TREC)

Training documents (past)

Test documents

time

Topic 1

Topic 2

Topic 3

Current document

Unlabeled documents

On-topic

Off-topic

Relevance

Feedback

CMU Team-1 in TDT 2004 Workshop


Topic Tracking with Pseudo-Relevance

(“Topic Tracking” in TDT)

Training documents (past)

Test documents

time

Topic 1

Topic 2

Topic 3

Current document

Unlabeled documents

On-topic?

Off-topic

Pseudo-Relevance

Feedback (PRF)

CMU Team-1 in TDT 2004 Workshop


Adaptive rocchio with prf
Adaptive Rocchio with PRF

  • Conventional version

  • Improved version

CMU Team-1 in TDT 2004 Workshop


Rocchio in tracking on tdt 2003 data
Rocchio in Tracking on TDT 2003 Data

Weighted PRF reduced Ctrk by12%.

Ctrk: the cost of tracking, i.e., the harmonic average of miss rate and false alarm rate

CMU Team-1 in TDT 2004 Workshop


Primary tracking results in tdt 2004
Primary Tracking Results in TDT 2004

CMU Team-1 in TDT 2004 Workshop


Det curves of methods on tdt 2004 data
DET Curves of Methods on TDT 2004 Data

Charles’ target

CMU Team-1 in TDT 2004 Workshop


Supervised adaptive tracking
Supervised Adaptive Tracking

  • “Adaptive filtering” in TREC (since 1997)

    • Rocchio with threshold calibration strategies (Yang et al., CIKM 2003)

    • Probabilistic models assuming Gaussian/exponential distributions (Arampatzis et al, TREC 2001)

    • Combined use of Rocchio and Logistic regression (Yi Zhang, SIGIR 2004)

  • A new task in TDT 2004

    • Topics are narrower, and typically short lasting than the TREC topics

CMU Team-1 in TDT 2004 Workshop


Our experiments
Our Experiments

  • 4 methods

    • Rocchio with a fixed threshold (Roc.fix)

    • Rocchio with an adaptive threshold using Margin-based Local Regression (Roc.MLR)

    • Nearest Neighbor (Ralf’s variant) with a fixed threshold (kNN.fix)

    • Logistic regression (LR) regularized by a complexity penalty

  • 3 corpora

    • TDT5 corpus, as the evaluation set in TDT 2004

    • TDT4 corpus, as a validation set for parameter tuning

    • TREC11 ( 2002) corpus, as reference set for robustness analysis

  • 2 optimization criteria

    • Ctrk: TDT standard, equivalent to setting the penalty ratio for miss vs. false alarm to 1270: 1 (approximately)

    • T11SU: TREC standard, equivalent to the penalty ratio of 2:1

CMU Team-1 in TDT 2004 Workshop


Outline of our methods
Outline of Our Methods

  • Roc.fix and NN.fix

    • Non-probabilistic model, generating ad hoc scores for documents with respect to each topic

    • Fixed global threshold, tuned on a retrospective corpus

  • Roc.MLR

    • Non-probabilistic model, ad hoc scores

    • Threshold locally optimized using incomplete relevance judgments for a sliding window of documents

  • LR

    • Probabilistic modeling of Pr(topic | x)

    • Fixed global threshold that optimizes the utility

CMU Team-1 in TDT 2004 Workshop


Regularized logistic regression
Regularized Logistic Regression

  • The objective is defined as to find the optimal regression coefficients

  • This is equivalent to Maximum A Posteriori (MAP) estimation with prior distribution

  • It predicts the probability of a topic given the data

CMU Team-1 in TDT 2004 Workshop


Roc fix on tdt3 corpus
Roc.fix on TDT3 Corpus

RF on 1.6% of documents,

25% Min-cost reduction

Base: No RF or PRF

PRF: Weighted PRF

MLR: Partial RF

FRF: Complete RF

CMU Team-1 in TDT 2004 Workshop


Effect of sa vs prf on tdt5 corpus
Effect of SA vs. PRF: on TDT5 Corpus

With Rocchio.fix: SA reduced Ctrk by 54% compared to PRF;

With Nearest Neighbors: SA reduced Ctrk by 48%.

CMU Team-1 in TDT 2004 Workshop


Satracking results on tdt5 corpus
SATracking Results on TDT5 Corpus

Ctrk

(the lower the better)

T11SU

(the higher the better)

For each team, the best score (with respect to Ctrk or T11SU) of the submitted runs is presented.

CMU Team-1 in TDT 2004 Workshop


Relative performance of our methods
Relative Performance of Our Methods

TREC Utility (T11SU): Penalty of miss vs. f/a = 2:1

TDT Cost (Ctrk): Penalty of miss vs. f/a ~= 1270:1

CMU Team-1 in TDT 2004 Workshop


Main observations
Main Observations

  • Encouraging results: a small amount of relevance feedback (on 1~2% documents) yielded significant performance improvement

  • Puzzling point: Rocchio without any threshold calibration, works surprisingly well in both Ctrk and T11SU, which is inconsistent to our observations on TREC data. Why?

  • Scaling issue: a significant challenge for the learning algorithms including LR and MLR in the TDT domain.

CMU Team-1 in TDT 2004 Workshop


Temporal nature of topics events
Temporal Nature of Topics/Events

TDT Event: Nov. APEC Meeting

Broadcast News Topic: Kidnappings

TREC Topic: Elections

CMU Team-1 in TDT 2004 Workshop


Topics for future research
Topics for Future Research

  • Keep up with new algorithms/theories

  • Exploit domain knowledge, e.g., predefined topics (and super topics) in a hierarchical setting

  • Investigate topic-conditioned event tracking with predictive features (including Named Entities)

  • Develop algorithms to detect and exploit temporal trends

  • TDT in cross-lingual settings

CMU Team-1 in TDT 2004 Workshop


References
References

  • Y. Yang and B. Kisiel. Margin-based Local Regression for Adaptive Filtering. ACM CIKM 2003 (Conference on Information and Knowledge Management).

  • J. Zhang and Y. Yang. Robustness of regularized linear classification methods in text categorization ACM SIGIR 2003, pp 190-197.

  • J. Zhang, R. Jin, Y. Yang and A. Hauptmann. Modified logistic regression: an approximation to SVM and its application in large-scale text categorization . ICML 2003 (International Conference on Machine Learning), pp888-897.

  • N. Ma, Y. Yang& M. Rogati.  Cross-Language Event Tracking.  Asia Information Retrieval Symposium (AIRS), 2004.

CMU Team-1 in TDT 2004 Workshop


ad