collectively representing semi structured data from the web n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Collectively Representing Semi-Structured Data from the Web PowerPoint Presentation
Download Presentation
Collectively Representing Semi-Structured Data from the Web

Loading in 2 Seconds...

play fullscreen
1 / 23

Collectively Representing Semi-Structured Data from the Web - PowerPoint PPT Presentation


  • 150 Views
  • Uploaded on

Collectively Representing Semi-Structured Data from the Web. Bhavana Dalvi , William W. Cohen and Jamie Callan Language Technologies Institute Carnegie Mellon University Paper ID : 02 . This work is supported by Google and the Intelligence Advanced Research Projects Activity

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Collectively Representing Semi-Structured Data from the Web' - camdyn


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
collectively representing semi structured data from the web

Collectively Representing Semi-Structured Data from the Web

BhavanaDalvi, William W. Cohen and Jamie Callan

Language Technologies Institute

Carnegie Mellon University

Paper ID : 02

This work is supported by Google and the Intelligence Advanced Research Projects Activity

(IARPA) via Air Force Research Laboratory (AFRL) contract number FA8650-10-C-7058.

motivation
Motivation
  • Entities on the Web can be present in multiple datasets. E.g. HTML tables, text documents etc.
  • Traditional systems : Entities as sparse vector of document Ids in which it occurs.
  • We propose a low-dimensional representation for such entities.
  • Helps to efficiently perform different tasks with a small number of primitive operations :
    • Semi-supervised Learning (SSL)
    • Set Expansion (SE)
    • Automatic Class Instance Acquisition (ASIA)
entities in html tables
Entities in HTML tables

Entity-ColumnBi-partite Graph

Table-column

Entity

TC-1

USA

TC-2

India

TC-3

TC-2

Hockey

TC-3

Cricket

TC-4

Tennis

entities in unstructured text
Entities in unstructured text

“Such as”Bi-partite Graph

Countriessuch as India are developing rapidly in terms of infrastructure.

Suchas

Entity

Country

USA

India

Location

Hockey

Outdoor sportsincludeTennis andCricket.

Cricket

Sports

Tennis

resultant t ri partite graph
Resultant Tri-partite Graph

“Such as”Bi-partite Graph

Entity-ColumnBi-partite Graph

Table-column

Suchas

Entity

TC-1

Country

USA

TC-2

India

Location

Hockey

TC-3

Cricket

Sports

TC-4

Tennis

encoding the graph
Encoding the graph

“Entity-Column”Bi-partite Graph

Low-dimensional embedding using bipartite Power Iteration Clustering (Lin & Cohen, ICML 2010/ECAI 2010)

Table-column

Entity

TC-1

USA

TC-2

India

Hockey

TC-3

Cricket

Entities with similar X1/X2 values should be ontologically similar - values summarize tabular co-occurrence

TC-4

Tennis

encoding the graph1
Encoding the graph

“Such as”Bi-partite Graph

Low-dimensional embedding using bipartite Power Iteration Clustering (Lin & Cohen, ICML 2010/ECAI 2010)

Suchas

Entity

Country

USA

India

Location

Hockey

Cricket

Entities with similar Y1/Y2 values should be ontologically similar - values summarize “such as pattern” co-occurrence

Sports

Tennis

low dimensional pic3 embedding
Low-dimensional PIC3 embedding

n * m PIC embeddingm << t

n * t

entity-tableColumn

Bipartite graph

n * 2m PIC3 embedding

PIC

Concatenate

n * m PIC embeddingm << s

n * s entity-suchas

Bipartite graph

PIC

using pic3 representation
Using PIC3 Representation
  • Semi-Supervised Learning : Given few seed examples for each class, predict class-labels for unlabeled data-points.
  • Set Expansion : Given a set of seed entities, find more entities similar to seed entities.
  • Automatic Set Instance Acquisition (ASIA) : Given a concept name automatically find instances of that concept.
quantitative evaluation datasets
Quantitative Evaluation: Datasets

Link to dataset: http://rtw.ml.cmu.edu/wk/WebSets/wsdm_2012_online

ssl using pic3
SSL using PIC3

Input : Few seed examples for each class label

Output : Class-labels for unlabeled data-points

PIC clusters similar entities together  better SVM classifier on unlabeled data (use of background data)

ssl task i
SSL Task - I

# dimensions : 2504  10

ssl task ii
SSL Task - II

# dimensions : 2574  10

set expansion using pic3
Set Expansion using PIC3

Input : Few seed entities

e.g. Football, Hockey, Tennis

Output : More entities of same type as seeds

e.g. Baseball, Badminton, Cricket, Golf ….

K-NN operation is extremely efficient using KD-trees.

query times
Query Times
  • PIC3 preprocessing : 0.02 sec
  • # SE queries = 881
  • Precision Recall Curve : K-NN+PIC3 consistently beats K-NN-Baseline. Modified Adsorption method is better on 2/5 query classes at the expense of larger query time.

Modified Adsorption : Graph based label propagation algorithm

automatic set instance acquisition asia using pic3
Automatic Set Instance Acquisition(ASIA) : using PIC3

Input : Class label

e.g. Country

Output : Entities belonging to the given class label

e.g. India, China, USA, Canada, Japan …..

Previously described Set Expansion algorithm is used as a subroutine here.

query times1
Query Times
  • PIC3 preprocessing : 0.02 sec
  • # ASIA queries = 25
  • Precision Recall Curve : K-NN+PIC3 consistently beats K-NN-Baseline. Modified Adsorption method is better on 2/4 query classes at the expense of much larger query time.
conclusions future work
Conclusions & Future Work
  • Presented a novel low-dimensional PIC3 representation for entities on the Web using Power Iteration Clustering (PIC).
  • Simple primitive operations on PIC3 to perform following tasks :
    • Semi-Supervised Learning
    • Set Expansion
    • Automatic Set Instance Acquisition
  • Future work : Use PIC3 representation for
    • Named entity disambiguation and
    • Unsupervised class-instance pair acquisition
thank you
Thank You !!

Please visit our poster ID : 02

This work is supported by Google and the Intelligence Advanced Research Projects Activity

(IARPA) via Air Force Research Laboratory (AFRL) contract number FA8650-10-C-7058.