190 likes | 354 Views
Collaborative Classifier Agents Studying the Impact of Learning in Distributed Information Retrieval. Weimao Ke, Javed Mostafa, and Yueyu Fu. Submitted to SIGIR 2006. Weimao Ke wke@indiana.edu School of Library and Information Science Indiana University Bloomington. Layout. Introduction
E N D
Collaborative Classifier AgentsStudying the Impact of Learning in Distributed Information Retrieval Weimao Ke, Javed Mostafa, and Yueyu Fu. Submitted to SIGIR 2006. Weimao Ke wke@indiana.edu School of Library and Information Science Indiana University Bloomington
Layout • Introduction • Classification and Knowledge Distribution • Learning in a Distributed Environment • Experiment Design and Setup • Experimental Results • Future work
Introduction • Distributed nature of knowledge • Collaboration is important • e.g., WWW • Distributed Information Retrieval • vs. traditional/centralized IR • e.g., intra-system retrieval fusion, cross-system communication, decentralized P2P network, and distributed information storage and retrieval algorithms, etc. • Our focus • Modeling distributed agent collaboration for text/info classification Motivation: Why distributed? Why not centralized?
Knowledge Distribution • In classification, knowledge = class vectors • Vector Space Model (VSM) • Traditional centralized approach: • All class vectors in one place • Global knowledge • Our distributed approach: • A subset of class vectors each agent • Local distributed/partial/limited knowledge Motivation: Why distributed? Why not centralized?
Distributed Information Classification Presentation: Agent topology? Everyone knows each other? 2 1 Motivation: Why not centralized in AdminAgent? 2 Admin Agent 1 3 Documents # Agent (#: collaboration range) Motivation: Why ? The need of this model streaming? 4 Collaboration request Collaboration response
Learn to Collaborate Document Agent Compare the doc to every local class Max similarity score >= Threshold? Yes Label the doc (classified) No Learning… Ask for help…but WHO should I ask?
Pursuit Learning • Pursuit Learning – a reinforcement learning • Action probability vector P= [p1..pn]N • N: # actions = # neighbor agents • Exploration rate: r • The rate of predicting a helping agent randomly • To “explore” without using learned knowledge • To predict a helping agent when one fails • Randomize a rand • If rand< r, randomly choose an agent for help • Otherwise, predict based on vector P • To learn when another agent has helped • Reward • Update vector P Presentation: Redundant description in PL and NCL algorithms?
Nearest Centroid Learning • Nearest Centroid Learning – content sensitive • Neighbor centroid vector C= [c1..cn]N • N: # actions = # neighbor agents • Each element ci is the centroid of documents the ith neighbor agent has helped with • Exploration rate: r • To predict when one fails to classify a doc • Randomize a rand • If rand< r, randomly choose an agent for help; • Otherwise, find the nearest centroid in vector C to the current document and ask the corresponding agent for help • To learn when another agent has helped • Update the corresponding centroid by including the document • Nearest Centroid Learning – content sensitive • Pursuit Learning – notcontent sensitive
Experiment Design • Reuters Corpus Volumes 1 (RCV1) • Training set: 6,394 documents • Test set: 2,500 documents • Feature selection: 4,084 unique terms • Evaluation measures • Precision = a / (a + b) • Recall = a / (a + c) • F1 = 2 * P * R / (P + R) Impact: - Small test collection size, scalability, etc. - Larger, more recent collections suggested.
Hardware and Software Setup • MACCI • Multi-Agent Collaboration forClassification ofInformation • Cougaar Agent Architecture • Weka machine learning framework • Hardware • Dual Intel Xeon 2.8 GHz CPUs • 3.5 GB RAM (2GB reserved) • Software • Redhat Linux AS 4 • Java Runtime Environment 1.5.0 0
Results - Effectiveness Baselines • Presentation/Argument: • Use both Micro & Macro F scores; • ROC curves suggested.
Results – Effectiveness of Learning PL optimal zone random
Cumulative effectiveness over time Results - Learning Progression and Latency Pursuit Learning Nearest Centroid Learning Pursuit Learning Nearest Centroid L
In-session classification effectiveness over time Results - Learning Progression and Latency
Efficiency baseline Results - Classification efficiency
Summary • Classification effectiveness decreases dramatically when knowledge becomes increasingly distributed. • Pursuit Learning • Efficient – without analyzing contents • Effective, although not content sensitive • “The Pursuit Learning approach did not depend on document content. By acquiring knowledge through reinforcements based on collaborations this algorithm was able to construct/build paths for documents to find relevant classifiers effectively and efficiently.” • Nearest Centroid Learning • Inefficient – to analyze content • Effective • Learning did not converged in some experiments • A larger test set for future study • Exploration rate (r): a function instead of a constant
Thank you • Questions… • Comments… A copy of the submitted paper is available at: http://tara.slis.indiana.edu/macci/docs/sigir-agent-ke.pdf