search by strategy n.
Skip this Video
Loading SlideShow in 5 Seconds..
Search by strategy PowerPoint Presentation
Download Presentation
Search by strategy

Loading in 2 Seconds...

play fullscreen
1 / 28

Search by strategy - PowerPoint PPT Presentation

  • Uploaded on

Search by strategy. Arjen P. de Vries CWI, Spinque, Delft University of Technology. Interactive Information Access. Feedback: Interaction improves information representation Faceted Browsing: Interaction can let user take over where machine would fail Search by Strategy:

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Search by strategy' - chet

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
search by strategy

Search by strategy

Arjen P. de

CWI, Spinque, Delft University of Technology

interactive information access
Interactive Information Access
  • Feedback:
    • Interaction improves information representation
  • Faceted Browsing:
    • Interaction can let user take over where machine would fail
  • Search by Strategy:
    • Interaction can let user take over where system designer would fail
two extreme views on search
Two extreme views on ‘search’

What is DB?

From business applications

Deductive reasoning

Precise and efficient query processing

Users with technical skills (SQL, XQuery, etc) and precise information needs


Books where category=‘CS’

What is IR?

From digital libraries, patent collections, etc

Inductive reasoning

Best-effort processing

Users with low technical skills and imprecise information needs


Books about CS

Note: SemWeb more DB than IR!!!

complex search tasks or not so simple search tasks if you like
Complex search tasks(or, “not-so-simple” search tasks, if you like)
  • I want to buy a house in Amsterdam and I want it with ‘sfeer’ but still in good shape
  • I can afford about €350K. I need 3 bedrooms, the size should be about 80m2. It should have a balcony or a backyard
  • Thecloserto the station and an AH, thebetter. BUT… I do not want to live in Amsterdam-Noord, unless there is a quick bus connection to the ferry
  • I may be willing to drop some of these constraints, but I’m notsurewhich
search ir db
Search = IR + DB
  • Complex search tasks mix exact (DB) and ranked (IR) searches, on structured (DB) and unstructured (IR) data
  • Current technical solutions support either/or
    • SQL or Excel for structured data (DB)
    • Google or Lucene for unstructured data (IR)
  • Combining results requires significant effort
    • copy & paste result sets between interfaces, “human (probabilistic) joins”
search ir on top of db
Search = IR on-top-of DB ?
  • IR on-top-of DB: let exact and ranked operations both be processed by the same engine, so they can be mixed freely
  • IR responsible for ranking models, using DB as a data-access layer; no physical details necessary
  • DB responsible for reliable, dynamically optimised, data access; no logical details necessary
ir on top of db
IR on-top-of DB???
  • Traditional, general-purpose DB technology cannot compete with custom IR search tools
    • Working assumption: using column stores should solve the efficiency problem
ir on top of db part i
IR on-top-of DB – part I

let $c := doc("papers.xml")//DOC[author = "John Smith"]

let $q := "//text[about(.//Abstract, IR DB integration)]"

for $res in tijah-query($c, $q)

return $res/title/text()

I am an IRandDB expert



Hiemstra, Rode, Van Os, Flokstra, OSIR 2006PF/Tijah: text search in an XML database system

ir on top of db1
IR on-top-of DB

Rölleke, Tsikrika, KazaiA general matrix frameworkfor modelling Information Retrieval

“IR concepts to matrix spaces”


Data Model Mismatch!!!To implement IR taskson top of DB is a tough job!


Cornacchia, De Vries, Van Ballegooij, KerstenSRAM (Sparse Relational Array Mapping)

“matrix spaces to relational tables”



Bla bla

Bla bla bla, bla bla. Bla, bla bla

Bla bla


Bla bla

Bla bla bla, bla bla. Bla.


ir on top of db part ii
IR on-top-of DB – part II

let $c := doc("papers.xml")//DOC[author = "John Smith"]

let $q := "//text[about(.//Abstract, IR DB integration)]"

for $res in tijah-query($c, $q)

return $res/title/text()

I am an IR expert

(needn’t be a DB expert as well)




Cornacchia, De Vries, ECIR 2007A Parametrised Search System

sram for ir

# Language Modelling

langmod(Q,d,λ) = sum( [ lm_t(d,t,λ) * Q(t) | t ] )

lm_t(d,t,λ) = log( λ* DTf(d,t) + (1-λ) * Tf(t) )

# Okapi BM25

bm25(Q,d,k1,b) = sum( [ w_t(d,t,k1,b) * Q(t) | t ] )

w_t(d,t,k1,b) = Tidf(t) * (k1+1) * DTf(d,t)

/ ( DTf(d,t) + k1 * ((1-b) + b * Dnl(d)/avgDl) )

Retrieval_Models.ram (excerpt)

DTnl = mxMult(mxTrnsp(LT), LD) # doc-term freq.

Dnl = mvMult(mxTrnsp(LD), L) # doc lenght

Tnl = mvMult(mxTrnsp(LT), L) # term freq.

DTf = [ DTnl(d,t)/Dnl(d) | d,t ] # doc-term norm. freq.

Tf = [ Tnl(t)/nLocs | t ] # term norm. freq.

Indexing.ram (excerpt)

mxTrnsp(A) = [ A(j,i) | i,j ]

mxMult(A,B) = [ sum([ A(i,k) * B(k,j) | k ]) | i,j ]

mvMult(A,V) = [ sum([ A(i,j) * V(j) | j ]) | i ]

Linear_Algebra.ram (excerpt)

parameterised search system pss
Parameterised Search System (PSS)

Cannot we ‘remove’ this IR engineer from the loop, like DBMS software removes the data engineer from the loop?

Cornacchia, De Vries, ECIR 2007A Parametrised Search System

search by strategy1
Search by Strategy
  • Visually construct search strategies by connecting building blocks
search by strategy2
Search by Strategy
  • Visually construct search strategies by connecting building blocks
  • Each block describes either data or actions upon that data
    • A search strategy may include multiple data sources
    • Data sources are internally represented as quadruples, triples extended with an additional probability value
    • Actions are actually scripts expressed in Fuhr and Roelleke’s PRA (TOIS 1997)

1. Which universities/colleges hold patents?

2. Who are the inventors named in those patents?

Real-life patent search example:

Which researchers associated to universities and colleges should our Human Resources manager know to hire the right people on time?

3. Which inventors are active in the area of our company?

how strategies help
How Strategies Help
  • Strategies improve communication between search intermediary and user
    • Encapsulate domain expert knowledge
    • Abstract representation of search expert knowledge
    • Analyze information seeking process at any stage
  • Strategies facilitate knowledge management
    • Store / share / publish / refine
  • Strategies mix exact (DB) and ranked (IR) searches
    • Avoid the need for “human (probabilistic) joins”
  • PRA translates into SQL (!)
  • Current system setup using CWI’s MonetDB column-store
  • Strategies are dynamically transformed into a REST API
  • Hand over control to the user (or, most likely, the search intermediary)
    • Patent information specialists
    • Digital forensics detectives
    • Librarians / archivists
    • Real estate agents
    • Travel agency
open issues
Open Issues
  • Assist the user make the best out of their increased level of control
    • Integrate usage data live system to help improve or adapt strategies
  • Handle “even larger” scale data
    • Patent demo fine on ~17GB semi-structured data (i.e., Fairview Research’s Green Energy collection), without specific optimizations, even with fairly large strategies
  • Close the loop!
current situation
Current Situation
  • index ;
  • repeat {
  • specify ;
  • retrieve
  • } until 



desirable situation
Desirable Situation
  • repeat {
  • index ;
  • specify ;
  • retrieve
  • } until 

IR + DB + IE