Filtering semi structured documents based on faceted feedback
Download
1 / 31

Filtering Semi-Structured Documents Based on Faceted Feedback - PowerPoint PPT Presentation


  • 85 Views
  • Uploaded on

Filtering Semi-Structured Documents Based on Faceted Feedback. Lanbo Zhang, Yi Zhang, Qianli Xing Information Retrieval and Knowledge Management (IRKM) Lab University of California, Santa Cruz. Personalized Information Filtering. Identify user-desired documents from a document stream

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Filtering Semi-Structured Documents Based on Faceted Feedback' - marlo


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Filtering semi structured documents based on faceted feedback

Filtering Semi-Structured DocumentsBased on Faceted Feedback

Lanbo Zhang, Yi Zhang, Qianli Xing

Information Retrieval and Knowledge Management (IRKM) Lab

University of California, Santa Cruz


Personalized information filtering
Personalized Information Filtering

  • Identify user-desired documents from a document stream

  • Two families of filtering approaches

    • Collaborative Filtering (CF)

    • Content-Based Filtering (CBF)

  • Applications: news feeder, email spam filter, etc.

Emails

Passed documents

News

Filtering System

Blogs


Semi structured documents
Semi-Structured Documents

  • Increasingly prevalent over the Internet

    • Emails, news, movies, tweets, etc.

  • Plenty of metadata available


Definitions
Definitions

  • Facet: a metadata field

    • Date, Topic, Location, Director, Genre, etc.

  • Facet-Value Pair (FVP): a metadata field assigned with a particular value

    • Topic: Royal wedding

    • Date: 04-29-2011

    • Location: London, UK

Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz


Motivation
Motivation

  • Existing filtering approaches learn user interests based on users’ relevance judgments of documents

  • Users may have prior knowledge on which facet-value pairs are relevant

    • English-only readers

      • “Language: English”

    • Social network analysts

      • “Company: Facebook”

Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz


Can we exploit users’ prior knowledge on facet-value pairs for filtering?

Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz


A new user interaction mechanism faceted feedback
A New User Interaction Mechanism:Faceted Feedback

FVP candidates:

  • Lang: …

  • Topic: …

  • Date: …

Filtering System

Relevant FVPs:

  • Topic: …

  • Lang: …


Research questions
Research Questions

  • Question 1

    • How to select facet-value pair candidates?

  • Question 2

    • How to learn user profiles based on faceted feedback?

Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz


Q1 possible methods
Q1: Possible Methods

  • Feature selection methods for text classification

    • E.g., Mutual Information, Chi-Square measure, etc.

      • Usually a large number of labeled documents available

  • Query expansion methods for retrieval

    • E.g., TFIDF score on pseudo relevant documents

      • No labeled documents available

Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz


Fvp selection our approach
FVP Selection: Our Approach

  • In a filtering task

    • A large number of unlabeled documents

    • Possibly a small number of labeled documents

  • We rank facet-value pairs by

Pseudo relevant (positively classified) documents

User-labeled relevant documents

Intuition: features that occur frequently among relevant docs while rarely in the whole corpus are very likely to be relevant

Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz


Research questions1
Research Questions

  • Question 1

    • How to select facet-value pair candidates?

  • Question 2

    • How to learn user profiles based on faceted feedback?

Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz


Content based filtering cbf
Content-Based Filtering (CBF)

  • Treated as a binary text classification task

  • User profile: a feature vector that represents a user’s information needs (interests/preferences)

  • Given the user profile θ, a document can be determined as relevant or not according to:

Document label

Document vector

The core of CBF is learning the user profile!

Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz


Q2 possible methods
Q2: Possible Methods

  • Simple methods

    • Boolean strategy (AND, OR)

    • Feature selection

    • Pseudo relevant document

  • Sophisticated methods

    • Bayesian logistic regression with an adjusted prior (Dayanik et al. 06)

    • Generalized Expectation Criteria (Druck et al. 08)

Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz


Our approach
Our Approach

  • The assumption

    • A feature is selected by a user since it has a high correlation with the document label (R/NR)

  • Generalized Constraint Model (GCM)

Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz


Correlation decomposition
Correlation Decomposition

  • Sufficiency

    • The probability of a document being relevant given that the feature has occurred: P(R+|f=1)

    • P(R+|f=1)=1 : sufficient features

      • E.g., “Company: Facebook” for social network analysts

  • Necessity

    • The probability of the feature having occurred given that a document is relevant: P(f=1|R+)

    • P(f=1|R+)=1 : necessary features

      • E.g., “Language: English” for English-only readers

Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz


Examples highly correlated features
Examples: Highly-Correlated Features

The whole corpus

f2=1

R+

f1=1

f3=1

1) f1 is a sufficient feature since P(R+|f1=1)=1

2) f2 is a necessary feature since P(f2=1|R+)=1

3) f3 is neither necessary nor sufficient, but both its sufficiency and necessity are high (>0.5)

Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz


Estimating sufficiency
Estimating Sufficiency

Document label

User profile vector

The feature

Estimation of the label of document di

The set of documents covered by feature f

Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz


Estimating necessity
Estimating Necessity

Bayes’ Theorem!

Feature sufficiency

Prior distribution

Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz


Reference distributions
Reference Distributions

  • Our assumption

    • User selects a feature since it has a high sufficiency and/or a high necessity

  • Reference distributions: two Bernoulli dist’ns

    • The sufficiency/necessity of a user-selected feature should be close to the reference distribution

    • KL-divergence for similarity measure

Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz


User profile learning
User Profile Learning

  • The unified loss function to combine two types of feedback:

User-labeled documents

Sufficient features

Necessary features

Ts , Tn:reference dist’ns

Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz


User interaction mechanisms
User Interaction Mechanisms

Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz

Two mechanisms

  • Mechanism 1: ask users to select features they think are relevant

  • Mechanism 2: ask users to specifically select features they think are sufficient and necessary respectively

21


Outline
Outline

  • Introduction

  • Faceted Feedback

    • Facet-Value Pair Candidate Selection

    • Learning from Faceted Feedback

  • Experiments

    • Settings

    • Results

  • Summary


Data sets
Data Sets

  • Use two data sets from TREC filtering track

    • TREC 2000: OHSUMED (348566 medical articles) + 63 topics (information needs)

      • Metadata field: MeSH (Medical Subject Headings)

    • TREC 2002: RCV1 (~800,000 news articles) + 50 topics defined by human assessors

      • Metadata fields: Topic, Industry, Region

  • Split each topic set into two equal-size subsets

    • One for parameter tuning, the other for testing

Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz


Faceted feedback collection
Faceted Feedback Collection

  • Recruit subjects on Mechanical Turk

    • Five subjects per topic

    • The average performances will be reported

  • For each topic, we show subjects

    • The topic description (information need)

    • A group of facet-value pair candidates

Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz


Evaluation metrics
Evaluation Metrics

  • Precision (macro)

  • Recall (macro)

  • T11U = 2 * Nrd – Nnd

    • Nrd: the number of relevant docs delivered

    • Nnd: the number of non-relevant docs delivered

  • T11SU =

    • MinNU = -0.5

    • MaxU: the maximum possible utility (T11U)

Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz


Outline1
Outline

  • Introduction

  • Faceted Feedback

    • Facet-Value Pair Candidate Selection

    • Learning from Faceted Feedback

  • Experiments

    • Settings

    • Results

  • Summary


Results 1 w wo faceted feedback ff
Results 1: w/wo Faceted Feedback (FF)

# relevant docs initially known

Faceted feedback improves filtering performances, especially when fewer relevant documents are initially known.

Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz


Results 2 different learning algorithms
Results 2: Different Learning Algorithms

Existing approaches

Our approach

BOOL(A), BOOL(O): Boolean strategy

FS: feature selection based on FF

Pseudo-D/Q: pseudo relevant doc/query

Prior: logistic regression with Bayesian prior

GEC: generalized expectation criteria


Outline2
Outline

  • Introduction

  • Faceted Feedback

    • Facet-Value Pair Candidate Selection

    • Learning from Faceted Feedback

  • Experiments

    • Settings

    • Results

  • Summary


Summary
Summary

  • Faceted feedback is useful for filtering, especially in the cold-start scenarios

  • The Generalized Constraint Model (GCM) is a robust user profile learning algorithm

  • In future work, we will evaluate our methods on data sets where faceted features are more important

    • Movie, music, product, etc.

Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz


Questions
Questions?

Filtering Semi-Structured Documents Based on Faceted Feedback

Lanbo Zhang, Yi Zhang, Qianli Xing

Information Retrieval and Knowledge Management (IRKM) Lab

University of California, Santa Cruz

lanbo@soe.ucsc.edu

yiz@soe.ucsc.edu

xingqianli@gmail.com


ad