The Boolean Model
This presentation is the property of its rightful owner.
Sponsored Links
1 / 32

The Boolean Model PowerPoint PPT Presentation


  • 105 Views
  • Uploaded on
  • Presentation posted in: General

The Boolean Model. Simple model based on set theory Queries specified as boolean expressions precise semantics neat formalism q = k a  (k b  k c ) Terms are either present or absent. Thus, w ij  {0,1} Consider q = k a  (k b  k c )

Download Presentation

The Boolean Model

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


The boolean model

The Boolean Model

  • Simple model based on set theory

  • Queries specified as boolean expressions

    • precise semantics

    • neat formalism

    • q = ka (kb  kc)

  • Terms are either present or absent. Thus, wij {0,1}

  • Consider

    • q = ka (kb  kc)

    • vec(qdnf) = (1,1,1)  (1,1,0)  (1,0,0)

    • vec(qcc) = (1,1,0) is a conjunctive component

  • Each query can be transformed in DNF form


The boolean model

The Boolean Model

Ka

Kb

  • q = ka (kb  kc)

  • sim(q,dj) = 1, if document satisfies the boolean query

  • 0 otherwise

  • - no in-between, only 0 or 1

(1,1,0)

(1,0,0)

(1,1,1)

Kc


Exercise

Exercise

D1 = “computer information retrieval”

D2 = “computer retrieval”

D3 = “information”

D4 = “computer information”

Q1 = “information  retrieval”

Q2 = “information ¬computer”


Exercise1

Exercise

กำหนด Index term ของแต่ละเอกสาร

D1 = {love, need, person, possess, understand}

D2 = {heart, listen, love, practice, suffer}

D3 = {compassion, love, mind, person, practice}

D4 = {death, health, languor, life, suffer}

D5 = {energy, love, nourish, practice, teach}

Q = {love ^ suffer}


The boolean model

Drawbacks of the Boolean Model

  • Retrieval based on binary decision criteria with no notion of partial matching

  • No ranking of the documents is provided (absence of a grading scale)

  • Information need has to be translated into a Boolean expression which most users find awkward

  • The Boolean queries formulated by the users are most often too simplistic

  • As a consequence, the Boolean model frequently returns either too few or too many documents in response to a user query


Drawbacks of the boolean model

Drawbacks of the Boolean Model

  • The Boolean model imposes a binary criterion for deciding relevance

  • The question of how to extend the Boolean model to accomodate partial matching and a ranking has attracted considerable attention in the past

  • Two extensions of boolean model:

    • Fuzzy Set Model

    • Extended Boolean Model


Set theoretic models

Set Theoretic Models


Ir models

Algebraic

Set Theoretic

Generalized Vector

Lat. Semantic Index

Neural Networks

Structured Models

Fuzzy

Extended Boolean

Non-Overlapping Lists

Proximal Nodes

Classic Models

Probabilistic

Boolean

Vector

Probabilistic

Inference Network

Belief Network

Browsing

Flat

Structure Guided

Hypertext

IR Models

U

s

e

r

T

a

s

k

Retrieval:

Adhoc

Filtering

Browsing


Set theoretic models1

Set Theoretic Models

  • The Boolean model imposes a binary criterion for deciding relevance

  • The question of how to extend the Boolean model to accomodate partial matching and a ranking has attracted considerable attention in the past

  • Two set theoretic models for this:

    • Fuzzy Set Model

    • Extended Boolean Model


The boolean model

Fuzzy Set Model

  • Queries and docs represented by sets of index terms: matching is approximate from the start

  • This vagueness can be modeled using a fuzzy framework, as follows:

    • with each term is associated a fuzzy set

    • each doc has a degree of membership in this fuzzy set

  • This interpretation provides the foundation for many models for IR based on fuzzy theory

  • In here, we discuss the model proposed by Ogawa, Morita, and Kobayashi (1991)


The boolean model

Fuzzy Set Theory

  • Framework for representing classes whose boundaries are not well defined

  • Key idea is to introduce the notion of a degree of membership associated with the elements of a set

  • This degree of membership varies from 0 to 1 and allows modeling the notion of marginal membership

  • Thus, membership is now a gradual notion, contrary to the crispy notion enforced by classic Boolean logic


The boolean model

Fuzzy Set Theory

  • Model

    • A query term: a fuzzy set

    • A document: degree of membership in this test

    • Membership function

      • Associate membership function with the elements of the class

      • 0: no membership in the test

      • 1: full membership

      • 0 ~1: marginal elements of the test

documents


Fuzzy set theory

Fuzzy Set Theory

for query term

a class

  • A fuzzy subsetA of a universe of discourse U is characterized by a membership function µA: U[0,1] which associates with each element u of U a number µA(u) in the interval [0,1]

    • complement:

    • union:

    • intersection:

document collection


Examples

Examples

  • Assume U={d1, d2, d3, d4, d5, d6}

  • Let A and B be {d1, d2, d3} and {d2, d3, d4}, respectively.

  • Assume A={d1:0.8, d2:0.7, d3:0.6, d4:0, d5:0, d6:0} and B={d1:0, d2:0.6, d3:0.8, d4:0.9, d5:0, d6:0}

  • ={d1:0.2, d2:0.3, d3:0.4, d4:1, d5:1, d6:1}

  • =

    {d1:0.8, d2:0.7, d3:0.8, d4:0.9, d5:0, d6:0}

  • ={d1:0, d2:0.6, d3:0.6, d4:0, d5:0, d6:0}


Fuzzy information retrieval

Fuzzy Information Retrieval

  • basic idea

    • Expand the set of index terms in the query with related terms (from the thesaurus) such that additional relevant documents can be retrieved

    • A thesaurus can be constructed by defining a term-term correlation matrix c whose rows and columns are associated to the index terms in the document collection

keyword connection matrix


Fuzzy information retrieval continued

Fuzzy Information Retrieval(Continued)

  • normalized correlation factor ci,l between two terms ki and kl (0~1)

  • In the fuzzy set associated to each index term ki, a document dj has a degree of membership µi,j

ni is # of documents containing term ki

where

nl is # of documents containing term kl

ni,l is # of documents containing ki and kl


Fuzzy information retrieval continued1

Fuzzy Information Retrieval(Continued)

  • physical meaning

    • A document dj belongs to the fuzzy set associated to the term ki if its own terms are related to ki, i.e., i,j=1.

    • If there is at least one index term kl of dj which is strongly related to the index ki, then i,j1.ki is a good fuzzy index

    • When all index terms of dj are only loosely related to ki, i,j0.ki is not a good fuzzy index


Example

Example

q = (ka (kb  kc))= (ka kb  kc)  (ka kb   kc) (ka kb kc)= cc1+cc2+cc3

Da: the fuzzy set of documents

associated to the index ka

cc2

cc3

Da

djDa has a degree of membership

a,j > a predefined threshold K

cc1

Db

Da: the fuzzy set of documents

associated to the index ka

(the negation of index term ka)

Dc


Example1

Example

Query q=ka (kb   kc)

disjunctive normal form qdnf=(1,1,1)  (1,1,0)  (1,0,0)

(1) the degree of membership in a disjunctive fuzzy set is computed

using an algebraic sum (instead of max function) more smoothly

(2) the degree of membership in a conjunctive fuzzy set is computed

using an algebraic product (instead of min function)

Recall


Fuzzy set model

Fuzzy Set Model

  • Q: “gold silver truck”D1:“Shipment of gold damaged in a fire”D2:“Delivery of silver arrived in a silver truck”D3:“Shipment of gold arrived in a truck”

  • IDF (Select Keywords)

    • a = in = of = 0 = log 3/3arrived = gold = shipment = truck = 0.176 = log 3/2damaged = delivery = fire = silver = 0.477 = log 3/1

  • 8 Keywords (Dimensions) are selected

    • arrived(1), damaged(2), delivery(3), fire(4), gold(5), silver(6), shipment(7), truck(8)


Fuzzy set model1

Fuzzy Set Model


Fuzzy set model2

Fuzzy Set Model


Fuzzy set model3

Fuzzy Set Model

  • Sim(q,d): Alternative 1

    Sim(q,d3) > Sim(q,d2) > Sim(q,d1)

  • Sim(q,d): Alternative 2

    Sim(q,d3) > Sim(q,d2) > Sim(q,d1)


Extended boolean model

Extended Boolean Model

  • Disadvantages of “Boolean Model” :

  • No term weight is used

  • Counterexample: query q=Kx AND Ky.

    Documents containing just one term, e,g, Kx is considered as irrelevant as another document containing none of these terms.

  • The size of the output might be too large or too small


Extended boolean model1

Extended Boolean Model

  • The Extended Boolean model was introduced in 1983 by Salton, Fox, and Wu

  • The idea is to make use of term weight as vector space model.

  • Strategy: Combine Boolean query with vector space model.

  • Why not just use Vector Space Model?

  • Advantages: It is easy for user to provide query.


Extended boolean model2

Extended Boolean Model

  • Each document is represented by a vector (similar to vector space model.)

  • Remember the formula.

  • Query is in terms of Boolean formula.

  • How to rank the documents?


Fig extended boolean logic considering the space composed of two terms k x and k y only

Fig. Extended Boolean logicconsidering the space composed of two terms kx and ky only.

ky

ky

kx

kx


Extended boolean model3

Extended Boolean Model

  • For query q=Kx or Ky, (0,0) is the point we try to avoid. Thus, we can use

    to rank the documents

  • The bigger the better.


Extended boolean model4

Extended Boolean Model

  • For query q=Kx and Ky, (1,1) is the most desirable point.

  • We use

    to rank the documents.

  • The bigger the better.


Extend the idea to m terms

Extend the idea to m terms

  • qor=k1 p k2 p … p Km

  • qand=k1 p k2 p … p km


Properties

Properties

  • The p norm as defined above enjoys a couple of interesting properties as follows. First, when p=1 it can be verified that

  • Second, when p= it can be verified that

  • Sim(qor,dj)=max(xi)

  • Sim(qand,dj)=min(xi)


Example2

Example

  • For instance, consider the query q=(k1 k2)  k3. The similarity sim(q,dj) between a document dj and this query is then computed as

  • Any boolean can be expressed as a numeral formula.


  • Login