Coms e6125 web enhanced information management whim
This presentation is the property of its rightful owner.
Sponsored Links
1 / 74

COMS E6125 Web-enHanced Information Management (WHIM) PowerPoint PPT Presentation


  • 54 Views
  • Uploaded on
  • Presentation posted in: General

COMS E6125 Web-enHanced Information Management (WHIM). Prof. Gail Kaiser Spring 2008. Today’s Topics. Web Search – partially adapted from Alexandros Biliris (adjunct here) Semantic Web – partially adapted from York Sure (University of Karlsruhe). Information Retrieval as a Field.

Download Presentation

COMS E6125 Web-enHanced Information Management (WHIM)

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Coms e6125 web enhanced information management whim

COMS E6125 Web-enHanced Information Management (WHIM)

Prof. Gail Kaiser

Spring 2008

Kaiser: COMS E6125


Today s topics

Today’s Topics

  • Web Search – partially adapted from Alexandros Biliris (adjunct here)

  • Semantic Web – partially adapted from York Sure (University of Karlsruhe)

Kaiser: COMS E6125


Information retrieval as a field

Information Retrieval as a Field

  • An “old” field that addresses issues related to

    • Classification and categorization of documents

    • Systems and languages for searching for words

    • User interfaces and visualization of results

  • Field was previously seen as of narrow interest – mainly, library search

  • The advent of the Web brings IR to the forefront

    • The Web became a huge “library” and everybody has free access to it (with no special training on “search”)

    • No central editorial board

Kaiser: COMS E6125


Ir a world of words

IR: A World of Words

  • Typical IR model: The dataset consists of documents, each of which is a bag (multiset) of words (terms)

  • IR functionality: map words to documents

    • Search for documents that contain

    • a given word

    • word1 AND word2

    • word1 AND word2 AND NOT word3

    • etc.


Ir a world of words1

IR: A World of Words

  • Detail 1: Stop Words

    • Certain words are considered irrelevant and not placed in the bag, e.g., “and”, “the”, …

  • Detail 2: “Stemming” and other content analysis

    • Using language-specific rules, convert words to their basic form, e.g., “surfing”, “surfed” --> “surf”

    • Deal with synonyms, misspellings, abbreviations


Rankings

Rankings

  • Finding documents that are the most relevant to a user’s query is quite imprecise

  • A ranking is an ordering of the documents retrieved that (hopefully) reflects their relevance to the user query

Kaiser: COMS E6125


Ir vs dbms

IR

Imprecise semantics

Keyword search

Text, unstructured data

No transactions

Partial results (top k)

Relevance is built-in

DBMS

Precise semantics

SQL

Structured data

Transactional semantics

Generate full answer

Relevance is built on top

IR vs. DBMS


Inverted indexes

Inverted indexes

  • Permit fast search for individual terms

  • For each term, you get a list consisting of:

    • document ID

    • frequency of term in doc (optional)

    • position of term in doc (optional)

Kaiser: COMS E6125


Inverted indexes1

Inverted indexes

  • These lists can be used to solve Boolean queries:

    • country -> d1, d2

    • manor -> d2

    • country AND manor -> d2

  • Also used for statistical ranking algorithms

  • Kaiser: COMS E6125


    How inverted files are created

    How Inverted Files Are Created

    • Periodically rebuilt, static otherwise

    • Documents are parsed to extract tokens

    Doc 1

    Doc 2

    Now is the time

    for all good men

    to come to the aid

    of their country

    It was a dark and

    stormy night in

    the country

    manor. The time

    was past midnight


    How inverted files are created1

    How Inverted Files are Created

    • After all documents have been parsed the inverted file is sorted alphabetically


    How inverted files are created2

    How Inverted Files are Created

    • Multiple term entries for a single document are merged

    • Within-document term frequency information is compiled


    How inverted files are created3

    How Inverted Files are Created

    • Finally, the file can be split into

      • A Dictionary or Lexicon file

        and

      • A Postings file

    Kaiser: COMS E6125


    How inverted files are created4

    How Inverted Files are Created

    Dictionary/Lexicon Postings


    Search engine characteristics

    Search Engine Characteristics

    • Unedited data – anyone can enter content

      • Quality issues, spam

    • Varied information types

      • Phone book, brochures, catalogs, dissertations, news reports, weather, all in one place!

    Kaiser: COMS E6125


    Search engine characteristics1

    Search Engine Characteristics

    • Different kinds of users

      • LexisNexis: Paid professional searchers

      • Online catalogs: Scholars searching scholarly literature

      • Web: Every type of person with every type of goal

    • Scale

      • Hundreds of millions of searches/day

      • Billions of static documents

      • Tens of millions of Web servers

    Kaiser: COMS E6125


    Directories vs search engines

    Directories

    Hand-selected sites

    Search over the contents of the descriptions of the pages

    Organized in advance into categories

    Search Engines

    All pages in all sites

    Search over the contents of the pages themselves

    Organized in response to a query by relevance rankings or other scores

    Directories vs. Search Engines

    Kaiser: COMS E6125


    Inverted indexes for web search engines

    Inverted Indexes for Web Search Engines

    • Inverted indexes are still used, even though the web is so huge.

    • Some systems partition the indexes across different machines; each machine handles different parts of the data.

    • Other systems duplicate the data across many machines; queries are distributed among the machines.

    • Most do a combination of these.

    Kaiser: COMS E6125


    Ranking strategies

    Ranking Strategies

    • Details proprietary and changing

    • Combining subsets of:

      • IR-style relevance: Based on term frequencies, proximities, position (e.g., in title), font, etc.

      • Popularity information - Frequently visited pages

      • Link analysis information - Which sites are linked to by other sites

      • A variant of vector space ranking to combine these

        • Make a vector of weights for each feature

        • Multiply this by the counts for each feature


    Link analysis for ranking pages

    Link Analysis for Ranking Pages

    • Assumption: If the pages pointing to this page are good, then this is also a good page

    • Draws upon earlier research in sociology and bibliometrics.

      • Kleinberg’s model includes “authorities” (highly referenced pages) and “hubs” (pages containing good reference lists).

      • Google model is a version with no hubs

    Kaiser: COMS E6125


    Intuition

    A

    H

    Intuition

    • Authority comes from in-edges.

    • Being a good hub comes from out-edges.

    • Better authority comes from in-edges from good hubs.

    • Being a better hub comes from out-edges to good authorities.

    Kaiser: COMS E6125


    Web crawlers

    Web Crawlers

    • How do the web search engines get all of the items they index?

    • Main idea:

      • Start with known sites

      • Record information for these sites

      • Follow the links from each site

      • Record information found at new sites

      • Repeat

    Kaiser: COMS E6125


    Web crawling algorithm

    Web Crawling Algorithm

    • Put a set of known sites on a queue

    • Repeat the following until the queue is empty:

      • Take the first page off of the queue

      • If this page has not yet been processed:

        • Record the information found on this page

        • Add each link on the current page to the queue

        • Record that this page has been processed

    Kaiser: COMS E6125


    Robot exclusion protocol

    Robot Exclusion Protocol

    • Polite crawlers first attempt to download the file robots.txt

    • Created by the Web master to indicate which part of the site is off-limits to crawlers

      User-agent: *

      Disallow: /

    Kaiser: COMS E6125


    Robot exclusion protocol1

    Robot Exclusion Protocol

    • robots META tag

      <HTML>

      <HEAD>

      <META NAME="robots" CONTENT="noindex,nofollow">

      ...

      </HEAD>

      ...

      </HTML>

    Kaiser: COMS E6125


    Web crawling issues challenges

    Web Crawling Issues/Challenges

    • Politeness: robots “keep out” signs

    • Freshness - Figure out which pages change often, and re-crawl these often

    • Quantity (> 6B docs on > 60M Web servers)

    • Quality - Duplicates, virtual hosts, etc.

      • Convert page contents with a hash function

      • Compare new pages to the hash table

    Kaiser: COMS E6125


    Web crawling issues challenges1

    Web Crawling Issues/Challenges

    • Lots of other problems

      • Server unavailable; incorrect html; missing links; attempts to “fool” search engine by giving crawler a version of the page with lots of spurious terms added ...

    • Web crawling is difficult to do robustly!

    Kaiser: COMS E6125


    The deep hidden web

    Pages that do not actually exist as such: they are created dynamically as a result of a request/query to a specific application that most likely uses a DBMS

    Content in the deep Web is massive

    For a Web page to be discovered by a crawler, it must be static and linked

    The Deep (Hidden) Web

    Kaiser: COMS E6125


    Perspective on crawlers engines

    Perspective on Crawlers/Engines

    • Web content is getting more

      • Volatile

        • Frequent updates in content and/or location

        • New Web sites appear and existing ones disappear on a daily basis

      • Dynamic

        • Content produced by database-driven applications

    • These are the same challenges faced by

      • Caching proxies

      • Content distribution networks


    Today s topics1

    Today’s Topics

    • Web Search – partially adapted from Alexandros Biliris (adjunct here)

    • Semantic Web – partially adapted from York Sure (University of Karlsruhe)

    Kaiser: COMS E6125


    Simplicity is good

    Simplicity is Good

    • The World Wide Web contains huge amounts of information created by many different organizations, communities and individuals for many different reasons

    • Web users can easily access this information by specifying URI (Universal Resource Identifier) addresses or using a search engine, and following links to find other related resources

    • This simplicity is a key aspect that made the Web so popular

    Kaiser: COMS E6125


    Simplicity is bad

    Simplicity is Bad

    • The simplicity of the current Web has a price

    • It is very easy to get lost, or discover irrelevant or unrelated information

    • For instance, if we search for courses taught by a person named “Gail Kaiser”, we might find all kinds of other information

    • http://www.google.com/search?q=courses+taught+by+gail+kaiser&sourceid=navclient-ff&ie=UTF-8&rlz=1B3GGGL_enUS253US253

    • The problem is that the search engine does know what “courses” or “taught” means

    Kaiser: COMS E6125


    Machine accessible meaning what it s like to be a machine

    name

    education

    CV

    work

    private

    Machine accessible meaning(What it’s like to be a machine)


    So what does this mean

    So what does this mean?

    • What’s a “CV”?

    • What’s a “name”?

    • Etc.

    • Need semantics

    Kaiser: COMS E6125


    Semantic web

    Semantic Web

    The Semantic Web is not a separate web but an extension of the current web, in which information is given well-defined meaning, better enabling computers and people to work in co-operation.[Berners-Lee et al., 2001]

    Kaiser: COMS E6125


    Semantic web layers t berners lee

    Semantic Web Layers(T. Berners-Lee)


    Start with xml not html

    Start with XML, not HTML

    HTML:

    <H1>WHIM</H1><UL>

    <LI>Instructor: Gail Kaiser<LI>Students: George Bush</UL>

    XML:

    <course><title>WHIM</title><instructor>Gail Kaiser</instructor><students>George Bush</students></course>


    Why not use xml tags for semantics

    Why Not Use XML Tags For Semantics?

    <title> … <title>

    • But what does “title” mean?

    • If we ask google, we get (on the 1st page)

      • title element of an html document

      • a prefix or suffix added to a person's name

      • a company that sells boxing gear

      • a gym club for boxers

    Kaiser: COMS E6125


    Xml limitations for semantic markup

    XML Limitations for Semantic Markup

    • XML makes no commitment on:

       Domain-specific vocabulary

       Modeling primitives

    • Requires pre-arranged agreement on  & 

    Kaiser: COMS E6125


    Xml limitations for semantic markup1

    XML Limitations for Semantic Markup

    • Only feasible for closed collaboration

      • agents in a small & stable community

      • pages on a small & stable intranet

    • Not suited for sharing Web resources

    Kaiser: COMS E6125


    Xml machine accessible meaning

    < >

    < name >

    name

    <education>

    < >

    education

    < CV >

    < >

    CV

    <work>

    < >

    work

    <private>

    < >

    private

    XML machine accessible meaning


    Semantic web layers

    Semantic Web Layers


    Rdf for semantic annotation

    http://www.aifb.uni-karlsruhe.de/WBS/ysu

    site-owner

    tel

    6086592

    York

    W3C

    <rdf:Descriptionrdf:about=“#York”>

    <tel>6086592</tel>

    </rdf:Description>

    site-owner

    explains

    http://www.w3.org/RDF

    RDF for Semantic Annotation

    • RDF (Resource Description Framework) provides metadata about Web resources

    • Triples with Subject (or Resource) / Predicate (or Property) / Object (or Value)

    • XML syntax

    • Chained triples form a graph


    Rdf schema

    Person

    subClassOf

    subClassOf

    range

    domain

    PhDStud

    Professor

    hasSuperVisor

    type

    type

    RDF Schema

    • Defines vocabulary for RDF

    • Organizes this vocabulary in a typed hierarchy

      • Class, subClassOf, type

      • Property, subPropertyOf

      • domain, range

    hasSuperVisor

    York

    Rudi


    Rdf schema syntax in xml

    RDF Schema Syntax in XML

    <rdf:Description ID="MotorVehicle">

    <rdf:type resource="http://www.w3.org/...#Class"/>

    <rdfs:subClassOf rdf:resource="http://www.w3.org/...#Resource"/>

    </rdf:Description>

    <rdf:Description ID="Truck">

    <rdf:type resource="http://www.w3.org/...#Class"/>

    <rdfs:subClassOf rdf:resource="#MotorVehicle"/>

    </rdf:Description>

    <rdf:Description ID="registeredTo">

    <rdf:type resource="http://www.w3.org/...#Property"/>

    <rdfs:domain rdf:resource="#MotorVehicle"/>

    <rdfs:range rdf:resource="#Person"/>

    </rdf:Description>

    <rdf:Description ID=”ownedBy">

    <rdf:type resource="http://www.w3.org/...#Property"/>

    <rdfs:subPropertyOf rdf:resource="#registeredTo"/>

    </rdf:Description>


    Coms e6125 web enhanced information management whim

    Higher-order Statements

    • One can make RDF statements about other RDF statements

    • Example: “Cinderella believes that the web contains one billion documents”

    • Allow us to express beliefs (and other modalities)

    • Important for trust models, digital signatures, etc.

    • Constitute metadata about metadata

    • Represented by modeling RDF in RDF itself

    Kaiser: COMS E6125


    Coms e6125 web enhanced information management whim

    dc:Creator

    http://www.w3.org/TR/REC-rdf-syntax

    “Eric Miller”

    dc:Creator

    “Library of Congress”

    Reification

    • Reification allows a computer to process an abstraction as if it were any other datum

    • RDF is not really second-order

    • But it does provide a built-in predicate vocabulary for reification

    • The dotted box corresponds to the following statements

      • { x,rdf:predicate, “dc:creator” }

      • { x, rdf:subject, “http://www.w3.org/TR/REC-rdf-syntax }

      • { x, rdf:object, “Eric Miller” }

      • { x, rdf:type, “rdf:statement” }


    Reification

    <rdf:Description rdf:about=“#NYT”>

    <claims>

    <rdf:Description rdf:about=“#pers05”>

    <authorOf>ISBN...</authorOf>

    </rdf:Description>

    </claims>

    </rdf:Description>

    Author-of

    pers05

    ISBN...

    Reification

    • Any statement can be an object

      (graphs can be nested)

    claims

    NYT


    Conclusions about rdf

    Conclusions about RDF

    • Next step up from plain XML

      • modeling primitives

      • possible to define vocabulary

    • However:

      • no precisely described meaning

      • no inference model

    Kaiser: COMS E6125


    Where do we get the precisely defined meaning

    Where do we get the precisely defined meaning?

    • Two databases may use different identifiers for the same concept, such as zip code

    • A program that wants to compare or combine information across the two databases has to know that these two terms mean the same thing

    • The program must have a way to discover such common meanings for whatever databases it encounters

    • A solution to this problem is provided by collections of information called ontologies

    Kaiser: COMS E6125


    Semantic web layers1

    Semantic Web Layers


    What is an ontology

    What is an Ontology?

    • In philosophy, an ontology is a theory about the nature of existence, of what types of things exist; ontology as a discipline studies such theories

    • Semantic Web researchers (and various other communities) have co-opted the term for their own jargon

    • For semantic web researchers, an ontology is a document or file that formally defines the relations among terms

    • The most typical kind of ontology for the Web has a taxonomy and a set of inference rules

    Kaiser: COMS E6125


    Coms e6125 web enhanced information management whim

    Menu

    Object

    Person

    Topic

    Document

    Student

    Researcher

    Semantics

    Doctoral Student

    PhD Student

    F-Logic

    Ontology

    What is a Taxonomy?

    Taxonomy = Segmentation, classification and ordering of elements into a classification system according to their relationships between each other

    Kaiser: COMS E6125


    Taxonomies

    Taxonomies

    • A taxonomy defines classes of objects and relations among them

    • For example, an address may be defined as a type of location, and city codes may be defined to apply only to locations

    • If city codes must be of type city and cities generally have Web sites, we can discuss the Web site associated with a city code even if no database links a city code directly to a Web site

    Kaiser: COMS E6125


    Coms e6125 web enhanced information management whim

    Menu

    An Ontology also provides a form of Thesaurus

    Object

    Person

    Topic

    Document

    Student

    Researcher

    Semantics

    Doctoral Student

    PhD Student

    F-Logic

    Ontology

    synonym

    similar

    • Terminology for specific domain

    • Graph with primitives, 2 fixed relationships (similar, synonym)

    Kaiser: COMS E6125


    Coms e6125 web enhanced information management whim

    Menu

    knows

    described_in

    writes

    Doctoral Student

    PhD Student

    F-Logic

    Ontology

    similar

    Tel

    Affiliation

    An Ontology also provides a Topic Map

    Object

    Person

    Topic

    Document

    Student

    Researcher

    Semantics

    synonym

    • Topics (nodes), relationships and occurences (to documents)

    • Useful for navigation- and visualisation

    Kaiser: COMS E6125


    Coms e6125 web enhanced information management whim

    is_a

    knows

    is_a

    Semantics

    F-Logic

    Ontology

    is_a

    subTopicOf

    similar

    Doctoral Student

    PhD Student

    F-Logic

    Ontology

    Rules

    T

    D

    T

    D

    described_in

    similar

    is_about

    Affiliation

    P

    D

    T

    P

    T

    writes

    is_about

    knows

    The Taxonomy is Augmented by Inference Rules

    Object

    described_in

    Person

    Topic

    Document

    writes

    Student

    Researcher

    instance_of

    Tel

    Swapneel Sheth

    • W3C Ontologies: OWL = Web Ontology Language


    Inference rules

    Inference Rules

    • An ontology may express the rule “If a city code is associated with a state code, and an address uses that city code, then that address has the associated state code”

    • A program could then deduce, for instance, that a Columbia University address, being in New York City, must be in New York State, which is in the U.S., and therefore should be formatted to U.S. standards

    Kaiser: COMS E6125


    Inference rules1

    Inference Rules

    • The computer doesn't truly “understand” any of this information

    • But it can now manipulate the terms much more effectively in ways that are useful and meaningful to the human user

    Kaiser: COMS E6125


    Solution to terminology problems

    Solution to Terminology Problems

    • The meaning of terms or XML tags used on a Web page can be defined by pointers from the page to an ontology

    • The same problems as before now arise if I point to an ontology that defines addresses as containing a zip code and you point to one that uses postal code

    • This can be resolved if ontologies (or other Web services) provide equivalence relations: one or both of our ontologies may contain the information that my zip code is equivalent to your postal code

    Kaiser: COMS E6125


    Using ontologies

    Using Ontologies

    • Ontologies can be used in a simple fashion to improve the accuracy of Web searches

    • The search program can look for only those pages that refer to a precise concept instead of all the ones using ambiguous keywords

    • More advanced applications could use ontologies to relate the information on a page to the associated knowledge structures and inference rules

    Kaiser: COMS E6125


    Example

    Example

    • Suppose you wish to find the Ms. Cook you met at a trade conference last year

    • You don't remember her first name, but you remember that she worked for one of your clients and that her brother was a student at your alma mater

    Kaiser: COMS E6125


    Example1

    Example

    • An intelligent search program can sift through all the pages of people whose name is “Cook”

    • Sidestep all the pages relating to cooks, cooking, the Cook Islands and so forth

    • Find the person’s named Cook who mention working for a company that's on your client list

    • And follow links to Web pages of their relatives to track down if any are in school at the right place

    Kaiser: COMS E6125


    Another example

    Another Example

    • When you answer your phone, other sound is automatically turned down

    • Instead of having to program each specific appliance, you could program such a function once and for all to cover every local device that advertises having a volume control — the TV, the DVD player, the media players on the laptop, …

    Kaiser: COMS E6125


    Semantic web layers2

    Standard

    Standard

    Standard

    Semantic Web Layers

    Work in Progress

    RFC

    Kaiser: COMS E6125


    Semantic web layers3

    Semantic Web Layers

    • The top layers, Logic, Proof and Trust, are “under development”

    • The Logic layer will enable the writing of rules

    • The Proof layer will execute the rules

    • The Trust layer together with the Digital Signature layer will provide mechanisms for applications to determine whether to trust the given proof or not

    Kaiser: COMS E6125


    Summary

    Summary

    • Semantic Web concepts may, someday, dramatically improve Web search

    Kaiser: COMS E6125


    Upcoming revised project proposal

    Upcoming: Revised Project Proposal

    • Due Monday March 31st

    • No more than four (4) pages

    • Post in Revised Project Proposals folder on CourseWorks

    Kaiser: COMS E6125


    Revised project proposal new or extended system

    Revised Project Proposal: New or Extended System

    • Explain what your system will “do”

    • Describe “value” to prospective user community

    • Sketch the top-level architecture, including hosts, processes and major subsystems

    • Diagram and briefly explain the communications flows, including protocols to be used and typical messaging sequences - including "error" cases

    • Clarify any components that you are not implementing yourself (include URL)

    Kaiser: COMS E6125


    Revised project proposal comparison evaluation

    Revised Project Proposal: Comparison/Evaluation

    • Clearly indicate which system(s) you will be evaluating (and how you will obtain)

    • Explain what you plan to measure and how you will measure it (either quantitative or qualitative)

    • Define what criteria you will use – and why are these significant or important

    • Sketch the top-level architecture of those systems as they will operate during your experiments

    • Discuss the design of your test application(s) and/or benchmark(s)

    Kaiser: COMS E6125


    Revised project proposals

    Revised Project Proposals

    • Briefly discuss what you expect to be able to show in a <15 minute demo

    • Schedule with your TA between April 22nd and May 6th

    • TAs assigned per group – not necessarily same TA as for paper

    • Final reports due Friday May 9th

    Kaiser: COMS E6125


    Upcoming student presentations

    Upcoming: StudentPresentations

    • Topic can be paper, project, or something else relevant to class

    • If project, coordinate with any other team members (e.g., schedule back-to-back)

    • No more than 10 minutes

    • During class time April 1st, 8th, 15th, 22nd or 29th for on-campus students

    • Contact instructor by email to schedule if not already ([email protected])

    • Last year’s slides available at http://bank.cs.columbia.edu/classes/cs6125-s07/presentations

    • You need to provide the instructor with your slides, preferably at least 24 hours in advance, but in any case no later than 24 hours afterwards (needed by CVN students)


    Reminders

    Reminders

    • Revised project proposal due March 31st

    • Schedule your presentation with the instructor - only 22nd or 29th still available for on-campus students (and any “local” CVN students who can come to class)

    • CVN students should email instructor to arrange teleconference asap

    Kaiser: COMS E6125


    Coms e6125 web enhanced information management whim1

    COMS E6125 Web-enHanced Information Management (WHIM)

    Prof. Gail Kaiser

    Spring 2008

    Kaiser: COMS E6125


  • Login