Stop Word and Related Problems in Web Interface Integration
Download
1 / 35

Stop Word and Related Problems in Web Interface Integration - PowerPoint PPT Presentation


  • 94 Views
  • Uploaded on

Stop Word and Related Problems in Web Interface Integration. Eduard C. Dragut (speaker) ‏ Fang Fang Clement Yu Prasad Sistla Weiyi Meng. University of Illinois at Chicago University of Illinois at Chicago University of Illinois at Chicago University of Illinois at Chicago

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Stop Word and Related Problems in Web Interface Integration' - angus


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Stop Word and Related Problems in Web Interface Integration

Eduard C. Dragut(speaker)‏

Fang Fang

Clement Yu

Prasad Sistla

Weiyi Meng

University of Illinois at Chicago

University of Illinois at Chicago

University of Illinois at Chicago

University of Illinois at Chicago

SUNY at Binghamton

VLDB 2009, Lyon, France


Objectives

E. Dragut et al -

Stop Word and Related Problems in Web Interface Integration

Objectives

  • Address the problem of automatically identifying the set of stop words in a given application domain.

    • “Stop words is the name given to words which are filtered out prior to, or after, processing of natural language data (text)”, wikipedia.org, answers.com

    • Hans Peter Luhn is credited with coining the phrase.

  • Establish semantic relationships between multi-word phrases beyond those in electronic dictionaries (e.g., Wordnet)‏

    • We focus on synonymy and hyponymy/hypernymy relationships

  • Analyze the impact of words such as and and or when establishing semantic relationships

    • E.g., Is drop-off date and time a hyponym of date and time?


A motivating scenario for integration
A Motivating Scenario for Integration

E. Dragut et al -

Stop Word and Related Problems in Web Interface Integration

  • Looking for the cheapest ticket

    • Chicago – Paris, August 20th – August 29th

united.com

BritishAirline.com

AirFrance.com

  • A user looking for the “best” price for a ticket:

    • Has to explore multiple sources

    • It is tedious, frustrating and time-consuming


The goal
The Goal

E. Dragut et al -

Stop Word and Related Problems in Web Interface Integration

Formulate the query

  • Provide a unified way to query multiple sources in the same domain

The Web

Unified query interface

AirFrance.com

Lufthansa.com

united.com

delta.com

nwa.com


Overview of integrating web interfaces
Overview of Integrating Web Interfaces

E. Dragut et al -

Stop Word and Related Problems in Web Interface Integration

Auto

Car Rental

Books

Extract query

interfaces

Cluster query

interfaces

Match query interfaces

B.He03, Dhamankar04, Doan02, Madhavan05,

Wu04, 06

Airfare

He05, Zhang04,

Dragut09

Barbosa07, He04,

Peng04

Various formats

e.g. ASCII files

H.He03,

Dragut 06

Integration of Interfaces

(Deep) Web


Motivation for stop words

E. Dragut et al -

Stop Word and Related Problems in Web Interface Integration

Motivation for Stop Words

  • Automating the process of identifying the set of stop words

  • Establishing semantic relationships between labels

    • Stop words express important semantic information and their removal may lead to erroneous logic inferences

    • Stop words removal may leave some labels empty

      • Issue: No semantic relationships can be establish using empty labels


Motivation for stop words cont

E. Dragut et al -

Stop Word and Related Problems in Web Interface Integration

Motivation for Stop Words, cont’

  • The stop words are domain dependent, i.e. a stop word in one domain may not be a stop word in another domain.

    • The word whereis a stop word in the Credit Card domain, but not in the Airline domain


Motivation for semantic enrichment words

E. Dragut et al -

Stop Word and Related Problems in Web Interface Integration

Motivation for Semantic Enrichment Words

  • The labels of attributes may contain the words AND, OR and the characters “/”, “&”

  • Questions:

    • What are their semantics?

    • Where are they used, in the labels of fields or in the labels of sections?

    • How should they be handled when semantic relationships are established?

      • Is “Pick-up Date & Time” a hyponym of “Dates & Times”?

      • Is “Pick-up Date ” a hyponym of “Pick-up Date & Time”?


Motivation for semantic relationships

E. Dragut et al -

Stop Word and Related Problems in Web Interface Integration

Motivation for Semantic Relationships

  • Goal:

    • Provide a systematic way to distinguish between synonymy and hyponymy relationships

  • Usage:

    • Schema matching

    • Naming the attributes of an integrated query interface [Dragut 06], as part of Web interface integration

      • The main motivation.

    • Integration of hierarchies

      • Two synonym concepts from distinct hierarchies are collapsed into one concept in the integrated hierarchy


The stop word problem solution

E. Dragut et al -

Stop Word and Related Problems in Web Interface Integration

The Stop Word Problem - Solution

  • The Problem:

    • Given a set of query interfaces in the same application domain (e.g., real estate), determine those words within the labels of the query interfaces that are stop words

  • The input:

    • A set of query interfacesin the same domain

      • E.g. Airline domain: Delta, AA, NWA, Orbitz, Travelocity

      • Each query interface is represented hierarchically [Wu04]


The stop words problem solution

E. Dragut et al -

Stop Word and Related Problems in Web Interface Integration

The Stop Words Problem - Solution

  • The main heuristic observation:

    • The set of stop words from an Information Integration perspective is a subset of the set of stop words from an Information Retrieval perspective

      • E.g. the word lastin the label Last Nameis a stop word from IR perspective, but it is not a stop word in the label.

  • The strategy

    • Take an arbitrary general purpose dictionary of stop words and find its largest subset satisfying constraints specific to the information integration problem.

    • General dictionary of stop words obtained through a Google search

      • E.g. dcs.gla.ac.uk/idom/ir resources/linguistic_utils/stop_words.


The stop words problem solution1

E. Dragut et al -

Stop Word and Related Problems in Web Interface Integration

The Stop Words Problem - Solution

  • The constraints

    • After the removal of incorrect stop words, the following situations arise:

      • Empty label - A non-empty label becomes empty after the removal. It cannot be used to derive any knowledge.

      • Homonymy - Two sibling nodes in a hierarchy have synonym labels.

      • Hyponymy - Two sibling nodes in a hierarchy have hyponym labels.

  • Example:


The stop words problem solution2

E. Dragut et al -

Stop Word and Related Problems in Web Interface Integration

The Stop Words Problem - Solution

  • The Stop Word Problem is intractable, it is NP-complete.

  • Worse, regardless of the subset of constraints chosen the problem remains “equally” hard.

  • Common practice

    • Come up with an approximation algorithm

      • Not covered.

    • The proposed algorithm produces a maximal set of stop words with respect to the stop word constraints.

      • The algorithm performance will be discussed in the experimental part.


Semantic relationships among labels

E. Dragut et al -

Stop Word and Related Problems in Web Interface Integration

Semantic Relationships Among Labels

  • The goal is to devise a methodology for establishing synonymy and hyponymy relationships between multi-word phrases.

  • Why is the problem of establishing semantic relationships between labels (names) difficult in practice?

    • Is it because, in a given application domain, a content word occurs with multiple senses with respect to a (electronic) dictionary (e.g., Wordnet [Fellbaum98])?

      • E.g. Select an area vs. Minimum floor area

    • Is it because of the context of usage of words?

      • E.g. Home address vs. Business address

    • Is it because of the occurrence of the semantic enrichment words?

      • E.g., Pick-up date and time vs. Pick-up date

      • E.g., Date and time vs. Pick-up date and time


The sense of a word in a domain

E. Dragut et al -

Stop Word and Related Problems in Web Interface Integration

The Sense of a Word in a Domain

Domains

Words

Labels

Credit Card

Address

Home address, Company address, Email address

Credit Card

Type

3rd party credit card type, Major credit card type

Real estate

Type

Property type, Parcel type, Type of use

Real estate

Area

Select an area, Minimum floor area

  • To better see the number of meanings of content words

    • Create inverted lists of labels for each domain used in our experiments

      • 9 domains were used. There are 735 distinct words and 2,319 labels.

    • Manually check the number of meanings of each word.

  • Finding: Onlyone word (i.e., the word “area” in the Real estate domain) out of 735 words has multiple senses in the same application domain.

  • Assumption:

    • each word has a unique sense in a given domain.


Dictionary senses versus context of use

E. Dragut et al -

Stop Word and Related Problems in Web Interface Integration

Dictionary Senses versus Context of Use

  • An example:

    • Consider the noun Address in the following labels:

      • Home Address, Company Address, Relative’s Address, Email Address

      • Address has the same meaning in all of them, according to Wordnet:

        • “the place where a person or organization can be found or communicated with”

      • It will wrongly suggest that Home Address is a hyponym of Address

  • (Electronic) Dictionaries are limited

    • The context of a label needs to be also taken into consideration

    • The context of a label of an internal node is the set of its descendant leaves


Defining semantic relationships

E. Dragut et al -

Stop Word and Related Problems in Web Interface Integration

Defining Semantic Relationships

  • Normalization [e.g., He03 et al, Madhavan01 et al , Rahm01 et al]

    • E.g. Adults (18-64)becomes adult

  • A label is seen as a set of normalized content words

    • E.g., {area, study} corresponds to Area of Study

    • E.g., {field, work} corresponds to Field of Work

  • Informally, a label A is synonym to a label B if their sets of content words are "equal" (i.e., words may be synonymous)

    • Area of Studyis a synonym ofField of Work

      • Area is synonym of Field(by WordNet)‏

      • Study is synonym of Work(by WordNet)‏


  • Defining semantic relationships1

    E. Dragut et al -

    Stop Word and Related Problems in Web Interface Integration

    Defining Semantic Relationships

    • Informally, A label A is a hypernym of a label B if the set of content words of A is a "subset" of that of B, meaning that the words of may be mapped into those of B using either equality, synonymy, hypernymy relationships.

      • The intuition is that additional words usually restrict the meaning of a phrase

    • Example:

      • Financial Information is a hypernym of Household Financial Information

      • Employment Information is a hypernym of Job Information

        • Employment is a hypernym of Job (by Wordnet)‏


    Computing semantic relationships

    E. Dragut et al -

    Stop Word and Related Problems in Web Interface Integration

    Computing Semantic Relationships

    • Between two sets A and B, with A and B having n and m elements (n ≤ m), respectively, there can be a factorial number of mappings.

      • A brute force enumeration algorithm takes exponential time.

    • Solution sketch:

      • Convert the problem to bipartite matching problems

        • The vertices of the graph correspond to the content words of the labels.

        • An edge corresponds to two words of the two labels being either equal, synonyms or hyponyms.

        • The trick to distinguish a synonymy relationship from a hyponymy one is:

          • To assign a weight of 1 to edges denoting equality or synonymy relationships and a weight of 2 to edges denoting hyponymy relationships.

        • When |A| = |B| (|A| = number of content words of A) , a synonymy relationship corresponds to a maximum weighted bipartite matching whose weight is equal to |A|.

        • When |A| = |B| a hyponymy relationship corresponds to a maximum weighted bipartite matching whose weight is larger than |A|.

        • When |A| < |B| a hyponymy relationship corresponds to a maximum bipartite matching whose weight is equal to |A|.


    Computing semantic relationships1

    E. Dragut et al -

    Stop Word and Related Problems in Web Interface Integration

    Computing Semantic Relationships

    • Examples:

    Synonymy – as a perfect matching

    Hyponymy – as a maximum weighted bipartite matching

    Employment

    Area

    Job

    Field

    Information

    Information

    Study

    Work

    Denotes a hyponym edge

    Hyponymy – as a maximum bipartite matching

    Household

    Financial

    Financial

    Information

    Information


    Semantic enrichment words briefly

    E. Dragut et al -

    Stop Word and Related Problems in Web Interface Integration

    Semantic Enrichment Words, briefly

    • In the presence of semantic enrichment words (i.e., and and or), the intuition that additional words restrict the meaning of a phrase is no longer true

    • Examples:

      • Pick-up date is a hyponym of Pick-up date and time

      • City or airport code is a hyponym of City, point of interest or airport code

    • Some observations:

      • AND appears frequently (91.3%) among the labels of the internal nodes

      • OR appears frequently (96%) among the labels of the (fields) leaf nodes


    Experiments

    E. Dragut et al -

    Stop Word and Related Problems in Web Interface Integration

    Experiments

    • Goals:

      • Evaluate the approximation algorithm for computing the dictionary of stop words.

      • Asses the ability of the proposed methods to establish semantic relationships.

      • Determine the impact of stop words on determining semantic relationships.


    Experiments1

    E. Dragut et al -

    Stop Word and Related Problems in Web Interface Integration

    Experiments

    Domain

    # interfaces

    Avg. # fields per interface

    Avg. # internal nodes per interface

    Avg. depth of interfaces

    Airfare

    20

    10.7

    5.1

    3.6

    Automobile

    20

    5.1

    1.7

    2.4

    Book

    20

    5.4

    1.3

    2.3

    Job

    20

    4.6

    1.1

    2.1

    Real Estate

    20

    6.5

    2.4

    2.7

    Car Rentals

    20

    10.4

    2.4

    2.5

    Hotels

    30

    7.6

    2.4

    2.3

    Credit Card

    20

    50.15

    20.25

    3.6

    Alliances

    50

    15.3

    8.32

    3.58

    • Setup

      • 9 real world domains from the web

      • Parts of the data set used also in Wu06 et al, Madhavan05 et al, Dragut06 at al.


    Experiments gold standard stop words

    E. Dragut et al -

    Stop Word and Related Problems in Web Interface Integration

    Experiments:Gold Standard Stop Words

    • How was the gold standard created?

      • Following the intuition:

        • A word is not a stop word if there is a label whose meaning changes so “drastically” after the removal of the word from the label that the new label does not resemble in any way the original meaning of the label.

      • Examples:

        • The word yourself in the Credit Card domain is not a stop word because of labels such as Please tell us about yourself

        • The word who in the Airline domain is not a stop word because of labels such as Who is going in this trip?


    Experiments evaluating stop words

    E. Dragut et al -

    Stop Word and Related Problems in Web Interface Integration

    Experiments: Evaluating Stop Words

    • From left to right Precision, Recall, F-score


    Experiments discussion on stop words

    E. Dragut et al -

    Stop Word and Related Problems in Web Interface Integration

    Experiments:Discussion on Stop Words

    Domain

    Found non-stop words

    Missed non-stop words

    Airfare

    first, last, from, to, when, and, or

    where, who

    Alliances

    from, to, on, yourself, no, for, there, and, or

    where, when, who, by

    Auto

    first, last, from, to, within, or

    Book

    first, last, before, or

    after

    Car Rental

    to, and, or

    from, last

    Credit Card

    first, last, per, and, or

    yourself

    Real Estate

    to, from, or

    • Example of non-stop words commonly regarded as stop words

    • Why do we miss some of them?


    Experiments semantic relationships

    E. Dragut et al -

    Stop Word and Related Problems in Web Interface Integration

    Experiments: Semantic Relationships

    • The gold standard

      • Manually created for each of the 9 domains.

      • Contains 7,544 relationships: 4,103 (54.4%) are synonymy relationships and 3,441 (45.6%) are hypernymy/hyponymy relationships.


    Experiments the na ve algorithm

    E. Dragut et al -

    Stop Word and Related Problems in Web Interface Integration

    Experiments:The Naïve Algorithm

    • It uses only the dictionary senses of individual words

    • Why is the accuracy so poor and ranging over such a large interval (from 39% to 97.3%)?

      • It compares labels without taking into consideration their contexts.

      • It blindly establishes semantic relationships between labels that share some words.


    Experiments the improved algorithm

    E. Dragut et al -

    Stop Word and Related Problems in Web Interface Integration

    Experiments:The Improved Algorithm

    • It combines the context of labels and semantic enrichment words.

    • F-score ranges from 82.1% to 99.3%, with the mean at 92.6% and a standard deviation of 5.9%.

    • The naive algorithm has a mean F-score of 74.9% and a standard deviation of 18.5%.

    • It improves the average precision to 95%, the average recall to 90.4% and the average F-score to 92.6%.


    Experiments where do the problems lie

    E. Dragut et al -

    Stop Word and Related Problems in Web Interface Integration

    Experiments: Where Do the Problems Lie?

    • Words and phrases that are commonly perceived as synonyms but not recorded in electronic dictionaries WordNet.

      • E.g. drop-off and return are synonyms in the Car Rental domain but not by WordNet

    • Many labels are complex sentences

      • E.g. “So, what do you do for a living?”, “How flexible are you?”.


    Experiments what else did we try

    E. Dragut et al -

    Stop Word and Related Problems in Web Interface Integration

    Experiments: What Else Did We Try?

    Domain

    Label

    Relationship

    Label

    Airfare

    Outbound

    Syn

    Origin date

    Airfare

    How flexible are you?

    Hyp

    Search one day before and after

    Car Rental

    End

    Syn

    Drop-off date

    Car Rental

    Pick-up

    Syn

    Start

    Credit Card

    2nd card holder

    Syn

    Additional authorized user

    Credit Card

    So, what do you do for a living?

    Syn

    Employment Information

    Real Estate

    Size

    Hyp

    Square feet

    • Other linguistic techniques were attempted

      • Normalized Google Distance (NGD) [Cilibrasi and Vitanyi 2007]

      • The kernel function for measuring the semantic similarity between pairs of short text snippets [Sahami and Heilman 2006]


    Experiments stop words semantic relationships

    E. Dragut et al -

    Stop Word and Related Problems in Web Interface Integration

    Experiments:Stop Words & Semantic Relationships

    • We run the improved algorithm for computing semantic relationships with the following four possible sets of stop word:

      • S1 is the set of stop words produced by our algorithm;

      • S2 is the gold standard of stop words;

      • S3 is the empty set;

      • S4 is a domain independent stop word set used by a typical IR system;

        • we used dcs.gla.ac.uk/idom/ir resources/linguistic_utils/stop_words

    • The outcome:

      • F-score of using S1 is on average 17.6% better than that using S3.

        • The largest difference is 43%.

      • F-score of using S1 is on average 8% better than that using S4.

        • The largest difference is 33%.

      • F-score using S1 is on average 0.03% better than that using S2.

        • This is another way of validating our improve algorithm.


    Related work

    E. Dragut et al -

    Stop Word and Related Problems in Web Interface Integration

    Related Work

    • Synonym and near-synonym relationships between short phrases have been recently studied [Bollegala et al. 2007, Sahami and Heilman 2006]

    • There is a great deal of work to represent meaning of words (not phrases) in various areas of research: linguistics, computer science, cognitive psychology, etc

      • Manually created semantic networks Wordnet [Felbaum 1998] and Cyc [Lenat et al. 1990]

      • Generic methods to measure word similarity or word association

        • Using word frequencies in text corpora [Berland and Charniak 1990, Caraballo 1999, Hearst 1992, Jiang and Conrath 1998, Lin 1998]

        • Using a Web search engine counts (hits) to identify lexico-syntactic patterns [Bollegala et al. 2007, Cilibrasi and Vitani 2007, Cimiano and Staab 2004]


    Related work cont

    E. Dragut et al -

    Stop Word and Related Problems in Web Interface Integration

    Related Work, Cont’

    • Schema Matching

      • Surveys [Rahm and Bernstein 2001, Shvaiko and Euzenat 2005]

      • Query interface matching [He and Chang 2003, He at al. 2004, Wang et al. 2004, Wu et al. 2004, 2006]

      • A number of dictionary-based semantic matching techniques for relational/XML schema and ontology alignment [Benevantano et al. 2001, Giunchiglia et al. 2005, Kotis and Vouros 2004]


    E. Dragut et al -

    Stop Word and Related Problems in Web Interface Integration

    End

    • Please visit the project web site

      • http://www.cs.uic.edu/~edragut/QIProject.html

    Thank you for your time and patience!


    ad