- 46 Views
- Uploaded on
- Presentation posted in: General

Declarative Analysis of Noisy Information Networks

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Declarative Analysis of Noisy Information Networks

Walaa Eldin Moustafa

Galileo Namata

AmolDeshpande

LiseGetoor

University of Maryland

Motivations/Contributions

Framework

Declarative Language

Implementation

Results

Related and Future Work

Social Networks

Social Media

- Users/objects are modeled as nodes, relationships as edges
- The observed networks are noisy and incomplete.
- Some users may have more than one account
- Communication may contain a lot of spam

- Missing attributes, links, having multiple references to the same entity
- Need to extract underlying information network.

- Attribute Prediction
- To predict values of missing attributes

- Link Prediction
- To predict missing links

- Entity Resolution
- To predict if two references refer to the same entity

- These prediction tasks can use:
- Local node information
- Relational information surrounding the node

Task: Predict topic of the paper

A Statistical Model forMultilingual Entity Detectionand Tracking

LanguageModel Based ArabicWord Segmentation.

Automatic RuleRefinement for Information Extraction

Why Not?

Tracing Lineage Beyond Relational Operators

An Annotation Management System forRelational Databases

Join Optimization of Information Extraction Output: Quality Matters!

- Use links between nodes (collective attribute prediction) [Sen et al., AI Magazine 2008]

DB

NL

?

Legend

Task: Predict topic of the paper

A Statistical Model forMultilingual Entity Detectionand Tracking

LanguageModel Based ArabicWord Segmentation.

Automatic RuleRefinement for Information Extraction

Why Not?

P1

P2

Tracing Lineage Beyond Relational Operators

An Annotation Management System forRelational Databases

Join Optimization of Information Extraction Output: Quality Matters!

DB

NL

?

Legend

Task: Predict topic of the paper

A Statistical Model forMultilingual Entity Detectionand Tracking

LanguageModel Based ArabicWord Segmentation.

Automatic RuleRefinement for Information Extraction

Why Not?

P1

P2

Tracing Lineage Beyond Relational Operators

An Annotation Management System forRelational Databases

Join Optimization of Information Extraction Output: Quality Matters!

DB

NL

?

Legend

- Goal: Predict new links
- Usinglocal similarity
- Usingrelational similarity [Liben-Nowell et al., CIKM 2003]

Graham Cormode

Flip Korn

Lukasz Golab

DiveshSrivastava

AvishekSaha

VladislavShkapenyuk

Theodore Johnson

Nick Koudas

- Goal: to deduce that two references refer to the same entity
- Can be based on node attributes (local)
- e.g. string similarity between titles or author names

- Local information only may not be enough

Jian Li

Jian Li

PetreStoica

PrabhuBabu

AmolDeshpande

BarnaSaha

William Roberts

SamirKhuller

Use links between the nodes (collective entity resolution) [Bhattacharya et al., TKDD 2007]

Jian Li

Jian Li

- Each task helps others get better predictions.
- How to combine the tasks?
- One after other (pipelined), or interleaved?

- GAIA:
- A Java library for applying multiple joint AP, LP, ER learning and inference tasks: [Namata et al., MLG 2009, Namata et al., KDUD 2009]
- Inference can be pipelined or interleaved.

- Motivation: To support declarative network inference
- Desiderata:
- User declaratively specifies the prediction features
- Local features
- Relational features

- Declaratively specify tasks
- Attribute prediction, Link prediction, Entity resolution

- Specify arbitrary interleaving or pipelining
- Support for complex prediction functions

- User declaratively specifies the prediction features

- Handle all that efficiently

Motivations/Contributions

Framework

Declarative Language

Implementation

Results

Related and Future Work

Specify the domain

For attribute prediction, the domain is a subset of the graph nodes.

For link prediction and entity resolution, the domain is a subset of pairs of nodes.

Compute features

Make Predictions, and Compute Confidence in the Predictions

Choose Which Predictions to Apply

Specify the domain

Local: word frequency, income, etc.

Relational:degree, clustering coeff., no. of neighbors with each attribute value, common neighbors between pairs of nodes, etc.

Compute features

Make Predictions, and Compute Confidence in the Predictions

Choose Which Predictions to Apply

Specify the domain

Attribute prediction: the missing attribute

Link prediction: add link or not?

Entity resolution: merge two nodes or not?

Compute features

Make Predictions, and Compute Confidence in the Predictions

Choose Which Predictions to Apply

Specify the Domain

After predictions are made, the graph changes:

Attribute prediction changes local attributes.

Link prediction changes the graph links.

Entity resolution changes both local attributes and graph links.

Compute Features

Make Predictions, and Compute Confidence in the Predictions

Choose Which Predictions to Apply

Motivations/Contributions

Framework

Declarative Language

Implementation

Results

Related and Future Work

- Use Datalog to express:
- Domains
- Local and relational features

- Extend Datalog with operational semantics (vs. fix-point semantics) to express:
- Predictions (in the form of updates)
- Iteration

- Domains are used to restrict the space of computation for the prediction elements.
- Space for this feature is |V|2
Similarity(X, Y, S) :âˆ’Node(X, Att=V1), Node(Y, Att=V1), S=f(V1, V2)

- Using this domain the space becomes |E|:
DOMAIN D(X,Y) :- Edge(X, Y)

- Other DOMAIN predicates:
- Equality
- Locality sensitive hashing
- String similarity joins
- Traverse edges

- Features of prediction elements are combined in a single predicate to create the feature vector:
DOMAIN D(X, Y) :- â€¦

{

P1(X, Y, F1) :- â€¦

â€¦

Pn(X, Y, Fn) :- â€¦

Features(X, Y, F1, â€¦, Fn) :- P1(X, Y, F1) , â€¦, Pn(X, Y, Fn)

}

DEFINE Merge(X, Y)

{

INSERT Edge(X, Z) :- Edge(Y, Z)

DELETE Edge(Y, Z)

UPDATE Node(X, A=ANew) :- Node(X,A=AX), Node(Y,A=AY), ANew=(AX+AY)/2

UPDATE Node(X, B=BNew) :- Node(X,B=BX), Node(X,B=BX), BNew=max(BX,BY)

DELETE Node(Y)

}

Merge(X, Y) :- Features (X, Y, F1,â€¦,Fn), predict-ER(F1,â€¦,Fn) = true, confidence-ER(F1,â€¦,Fn) > 0.95

- The prediction and confidence functions are user defined functions
- Can be based on logistic regression, Bayes classifier, or any other classification algorithm
- The confidence is the class membership value
- In logistic regression, the confidence can be the value of the logistic function
- In Bayes classifier, the confidence can be the posterior probability value

- Iteration is supported by ITERATE construct.
- Takes the number of iterations as a parameter, or * to iterate until no more predictions.
- ITERATE (*)
{

MERGE(X,Y) :-Features (X, Y, F1,â€¦,Fn),

predict-ER(F1,â€¦,Fn) = true,

confidence-ER(F1,â€¦,Fn) IN TOP 10%

}

Motivations/Contributions

Framework

Declarative Language

Implementation

Results

Related and Future Work

- Prototype based on Java Berkeley DB
- Implemented a query parser, plan generator, query evaluation engine
- Incremental maintenance:
- Aggregate/non-aggregate incremental maintenance
- DOMAIN maintenance

- Predicates in the program correspond to materialized tables (key/value maps).
- Every set of changes done by AP, LP, or ER are logged into two change tables Î”Nodes and Î”Edges.
- Insertions: |Record | +1 |
- Deletions: |Record | -1 |
- Updates: deletion followed by an insertion

- Aggregate maintenance is performed by aggregating the change table then refreshing the old table.
- DOMAIN:

Motivations/Contributions

Framework

Declarative Language

Implementation

Results

Related and Future Work

- Synthetic graphs. Generated using forest fire, and preferential attachment generation models.
- Three tasks:
- Attribute Prediction, Link Prediction and Entity Resolution

- Two approaches:
- Recomputing features after every iteration
- Incremental maintenance

- Varied parameters:
- Graph size
- Graph density
- Confidence threshold (update size)

- Varied the graph size from 20K nodes and 200K edges to 100K nodes and 1M edges

- Compared the evaluation of 4 features: degree, clustering coefficient, common neighbors and Jaccard.

- Real-world PubMed graph
- Set of publications from the medical domain, their abstracts, and citations

- 50,634 publications, 115,323 citation edges
- Task: Attribute prediction
- Predict if the paper is categorized as Cognition, Learning, Perception or Thinking

- Choose top 10% predictions after each iteration, for 10 iterations
- Incremental: 28 minutes. Recompute: 42 minutes

DOMAIN Uncommitted(X):-Node(X,Committed=â€˜noâ€™)

{

ThinkingNeighbors(X,Count<Y>):- Edge(X,Y), Node(Y,Label=â€˜Thinkingâ€™)

PerceptionNeighbors(X,Count<Y>):- Edge(X,Y), Node(Y,Label=â€˜Perceptionâ€™)

CognitionNeighbors(X,Count<Y>):- Edge(X,Y), Node(Y,Label=â€˜Cognitionâ€™)

LearningNeighbors(X,Count<Y>):- Edge(X,Y), Node(Y,Label=â€˜Learningâ€™)

Features-AP(X,A,B,C,D,Abstract):- ThinkingNeighbors(X,A), PerceptionNeighbors(X,B), CognitionNeighbors(X,C), LearningNeighbors(X,D),Node(X,Abstract, _,_)

}

ITERATE(10)

{

UPDATE Node(X,_,P,â€˜yesâ€™):- Features-AP(X,A,B,C,D,Text),P = predict-AP(X,A,B,C,D,Text),confidence-AP(X,A,B,C,D,Text) IN TOP 10%

}

Motivations/Contributions

Framework

Declarative Language

Implementation

Results

Related and Future Work

- Dedupalog[Arasu et al., ICDE 2009]:
- Datalog-based entity resolution
- User defines hard and soft rules for deduplication
- System satisfies hard rules and minimizes violations to soft rules when deduplicating references

- Datalog-based entity resolution
- Swoosh [Benjelloun et al., VLDBJ 2008]:
- Generic Entity resolution
- Match function for pairs of nodes (based on a set of features)
- Merge function determines which pairs should be merged

- Generic Entity resolution

- Conclusions:
- We built a declarative system to specify graph inference operations
- We implemented the system on top of Berkeley DB and implemented incremental maintenance techniques

- Future work:
- Direct computation of top-k predictions
- Multi-query evaluation (especially on graphs)
- Employing a graph DB engine (e.g. Neo4j)
- Support recursive queries and recursive view maintenance

- [Sen et al., AI Magazine 2008]
- PrithvirajSen, Galileo Namata, Mustafa Bilgic, LiseGetoor, Brian Gallagher, Tina Eliassi-Rad: Collective Classification in Network Data. AI Magazine 29(3): 93-106 (2008)

- [Liben-Nowell et al., CIKM 2003]
- David Liben-Nowell, Jon M. Kleinberg: The link prediction problem for social networks. CIKM 2003.

- [Bhattacharya et al., TKDD 2007]
- I. Bhattacharya and L. Getoor. Collective entity resolution in relational data. ACM TKDD, 1:1â€“36, 2007.

- [Namata et al., MLG 2009]
- G. Namata and L. Getoor: A Pipeline Approach to Graph Identification. MLG Workshop, 2009.

- [Namata et al., KDUD 2009]
- G. Namata and L. Getoor: Identifying Graphs From Noisy and Incomplete Data. SIGKDD Workshop on Knowledge Discovery from Uncertain Data, 2009.

- [Arasu et al., ICDE 2009]
- A. Arasu, C. Re, and D. Suciu. Large-scale deduplication with constraints using dedupalog. In ICDE, 2009

- [Benjelloun et al., VLDBJ 2008]
- O. Benjelloun, H. Garcia-Molina, D. Menestrina, Q. Su, S. E. Whang,and J. Widom. Swoosh: a generic approach to entity resolution. The VLDB Journal, 2008.