Weimao Ke wke@indiana School of Library and Information Science Indiana University Bloomington

Collaborative Classifier AgentsStudying the Impact of Learning in Distributed Information Retrieval Weimao Ke, Javed Mostafa, and Yueyu Fu. Submitted to SIGIR 2006. Weimao Ke wke@indiana.edu School of Library and Information Science Indiana University Bloomington

Layout • Introduction • Classification and Knowledge Distribution • Learning in a Distributed Environment • Experiment Design and Setup • Experimental Results • Future work

Introduction • Distributed nature of knowledge • Collaboration is important • e.g., WWW • Distributed Information Retrieval • vs. traditional/centralized IR • e.g., intra-system retrieval fusion, cross-system communication, decentralized P2P network, and distributed information storage and retrieval algorithms, etc. • Our focus • Modeling distributed agent collaboration for text/info classification Motivation: Why distributed? Why not centralized?

Knowledge Distribution • In classification, knowledge = class vectors • Vector Space Model (VSM) • Traditional centralized approach: • All class vectors in one place • Global knowledge • Our distributed approach: • A subset of class vectors  each agent • Local distributed/partial/limited knowledge Motivation: Why distributed? Why not centralized?

Distributed Information Classification Presentation: Agent topology? Everyone knows each other? 2 1 Motivation: Why not centralized in AdminAgent? 2 Admin Agent 1 3 Documents # Agent (#: collaboration range) Motivation: Why ? The need of this model streaming? 4 Collaboration request Collaboration response

Learn to Collaborate Document Agent Compare the doc to every local class Max similarity score >= Threshold? Yes Label the doc (classified) No Learning… Ask for help…but WHO should I ask?

Pursuit Learning • Pursuit Learning – a reinforcement learning • Action probability vector P= [p1..pn]N • N: # actions = # neighbor agents • Exploration rate: r • The rate of predicting a helping agent randomly • To “explore” without using learned knowledge • To predict a helping agent when one fails • Randomize a rand • If rand< r, randomly choose an agent for help • Otherwise, predict based on vector P • To learn when another agent has helped • Reward • Update vector P Presentation: Redundant description in PL and NCL algorithms?

Nearest Centroid Learning • Nearest Centroid Learning – content sensitive • Neighbor centroid vector C= [c1..cn]N • N: # actions = # neighbor agents • Each element ci is the centroid of documents the ith neighbor agent has helped with • Exploration rate: r • To predict when one fails to classify a doc • Randomize a rand • If rand< r, randomly choose an agent for help; • Otherwise, find the nearest centroid in vector C to the current document and ask the corresponding agent for help • To learn when another agent has helped • Update the corresponding centroid by including the document • Nearest Centroid Learning – content sensitive • Pursuit Learning – notcontent sensitive

Experiment Design • Reuters Corpus Volumes 1 (RCV1) • Training set: 6,394 documents • Test set: 2,500 documents • Feature selection: 4,084 unique terms • Evaluation measures • Precision = a / (a + b) • Recall = a / (a + c) • F1 = 2 * P * R / (P + R) Impact: - Small test collection size, scalability, etc. - Larger, more recent collections suggested.

Hardware and Software Setup • MACCI • Multi-Agent Collaboration forClassification ofInformation • Cougaar Agent Architecture • Weka machine learning framework • Hardware • Dual Intel Xeon 2.8 GHz CPUs • 3.5 GB RAM (2GB reserved) • Software • Redhat Linux AS 4 • Java Runtime Environment 1.5.0 0

Results - Effectiveness Baselines • Presentation/Argument: • Use both Micro & Macro F scores; • ROC curves suggested.

Results – Effectiveness of Learning PL optimal zone random

Cumulative effectiveness over time Results - Learning Progression and Latency Pursuit Learning Nearest Centroid Learning Pursuit Learning Nearest Centroid L

In-session classification effectiveness over time Results - Learning Progression and Latency

Efficiency baseline Results - Classification efficiency

Results – Classification efficiency

Results – Efficiency vs. Effectiveness

Summary • Classification effectiveness decreases dramatically when knowledge becomes increasingly distributed. • Pursuit Learning • Efficient – without analyzing contents • Effective, although not content sensitive • “The Pursuit Learning approach did not depend on document content. By acquiring knowledge through reinforcements based on collaborations this algorithm was able to construct/build paths for documents to find relevant classifiers effectively and efficiently.” • Nearest Centroid Learning • Inefficient – to analyze content • Effective • Learning did not converged in some experiments • A larger test set for future study • Exploration rate (r): a function instead of a constant

Thank you  • Questions… • Comments… A copy of the submitted paper is available at: http://tara.slis.indiana.edu/macci/docs/sigir-agent-ke.pdf

Weimao Ke wke@indiana School of Library and Information Science Indiana University Bloomington

Weimao Ke wke@indiana School of Library and Information Science Indiana University Bloomington

Presentation Transcript

School of Science Indiana University-Purdue University Indianapolis

School of Science Indiana University-Purdue University Indianapolis

INDIANA UNIVERSITY

Presenter: Yang Ruan Indiana University Bloomington

Beth Plale, Indiana University, Bloomington, Indiana, USA

INDIANA UNIVERSITY

Indiana University

Indiana University

The FlyBase Consortium: Harvard University University of Bloomington-Indiana

InfoVis CyberInfrastructure Katy Börner School of Library and Information Science katy@indiana

Mapping Knowledge Domains Katy Börner School of Library and Information Science katy@indiana

Indiana University

Mehmet Akif Demircioglu mdemirci@indiana Indiana University-Bloomington, USA

InfoVis Cyberinfrastructure Katy Börner School of Library and Information Science katy@indiana

NWB Team nwb.slisdiana Indiana University, Bloomington, IN

Emilija Zlatkovska , PhD School of Education Indiana University, Bloomington, IN

Millicent Fleming-Moran, PhD Applied Health Science Indiana University, Bloomington

Mapping Knowledge Domains Katy Börner School of Library and Information Science katy@indiana

InfoVis CyberInfrastructure Katy Börner School of Library and Information Science katy@indiana

Indiana University Health Bloomington Southern Indiana Physicians

The FlyBase Consortium: Harvard University University of Bloomington-Indiana