- 82 Views
- Uploaded on
- Presentation posted in: General

I2.2 Large-Scale Information Network Processing Mid-Year Report

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

I2.2 Large-Scale Information Network Processing Mid-Year Report

CharuAggarwal (IBM)

Christos Faloutsos (CMU)

Ambuj Singh (UCSB)

Xifeng Yan (UCSB)

- Indexing, Partitioning, and Distributed Processing
- on Time-Varying Networks

Objectives

Novel graph index model and advanced graph distributed computing theory to facilitate processing of (military) linked data that becomes a bottleneck for many research tasks in network science

Key Technical Innovations:

Dynamic graph indexing models and structures

Scalable graph processing

Graph partition overlapping and re-balancing theory

Primary Members

Xifeng Yan (UCSB), Ambuj Singh (UCSB), CharuAggarwal (IBM), Christos Faloutsos (CMU)

Collaborative Members

Z. Wen IBM/SCNARC, J. Bao RPI/IRC, J. Han UIUC/INARC, M. Srivatsa IBM/INARC, V. Kawadia BBN/IRC, S. Desai Army

- Key Objective:
- Novel graph index model and advanced graph distributed computing theory to facilitate processing of (military) linked data that becomes a bottleneck for many research tasks in network science
- Deliverables:
- Q1:Data collection and cleaning for graph indexing and distributed graph computing
- Q2:Design graph indices with time-varying concerns
- Q3:Design and test distributed graph computing strategies
- Q4:Hypotheses validation and research paper submission
- Impact:
- Provide fast, scalable, and linked information access to soldiers and commanders

- Key Technical Innovations
- Dynamic graph indexing models and structures to resolve graph queries in time-varying information networks
- Query cost models for distributed graph processing
- Graph partition overlapping and re-balancing theory to (1) improve locality of data for parallel computing, and (2) accommodate dynamic network data updates and query workload changes
- Self evolving distributed graph processing environment to adjust graph partitions dynamically

Large-Scale Information Network Processing: Inventscalable information network infrastructure

Facilitate processing of (military) linked data that becomes a bottleneck for many research tasks in network science

Advance our understanding of scalability challenges, not only for information networks but also for other genres of complex networks

The models and the proposed experimental systems provide fundamental analysis of

How indexing of dynamic network data affects query performance,

How graph partitioning schemes affect distributed query processing,

How the models and laws of real networks affect the design of graph indexing and partitioning strategy

Advance State-of-the-Art Network Science

5

5

Subtask 1: Graph Index and Search (UCSB, IBM)

Fast access and processing of time-varying information networks is the key for tasks such as intelligence service and query processing. Simply speaking, we cannot access networks nodes by nodes!

Subtask 2: Graph over MapReduce (CMU)

To process overwhelming amount of data on the Web, social networks, emails, telecommunications, to distill important information such as people’s opinion about extremists, to find potential radical groups, to identify influential nodes, we need powerful graph processing methods.

Needed by any large-scale network data processing including information, social and communication networks

Subtask 3: Graph Partitioning/Distributed Graph Processing (UCSB, CMU)

Military information is often distributed in many devices, distributed graph processing run graph algorithms without putting all data together in the same machine

Military Relevance

6

6

- Indexing Methods for Large Scale Static and Dynamic Networks
- Methods for Indexing Massive Disk-Resident Graphs (Aggarwal (IBM), Zhao (UIUC), and Han (UIUC))
- Methods for Indexing Dynamic Network Streams (Aggarwal (IBM), Khan (UCSB), Yan (UCSB))

- Dynamic structural index for label-based queries (Aggarwal (IBM) and Li (UCSB)): SDM 2011 accepted.
- Analysis of significant substructures in time-varying networks (Singh (UCSB) et al.)
- Find highest scoring substructures combines structure and time

- Problem definition (labeled proximity search)
- Label-based graph proximity search, seeks to find
the top-k vertex subsets with the smallest diameters,

for a given query of distinct labels. Each subset must

cover all the labels specified in the query.

- Label-based graph proximity search, seeks to find

Q=(a, b, c)

Q=(“reconnaissance”, “biometric matching”, “failure modeling”)

d=3

d=2

- Nan Li et al., Density Index and Proximity Search on Large Graphs, to be submitted to VLDB Journal

- Can we do better?

Which one is more promising?

u’s density

distribution

v’s density

distribution

- 10 – 300 times faster

- Nan Li et al., Density Index and Proximity Search on Large Graphs, to be submitted to VLDB Journal

Graph Search: a Model-Based Approach

- Ideas
- Use information propagation model to propagate labels in information networks
- Convert vertices to vectors
- Align sets of vectors

(a) linkedin

(b) facebook

- Query Speed: 0.1 sec for WebGraph:10M vertices, 213M edges

- Information Propagation Model

A. Khan et al., Neighborhood Based Fast Graph Search in Large Networks, SIGMOD’11

- Align two networks

Search Algorithm

- Step 1: Match a node u of target graph G with some node v of query graph Q, if L(v) ⊆ L(u) and cost(u,v) is less than a predefined cost threshold ε.
- Step 2: Discard the labels of the unmatched nodes in the target graph.
- Step 3: Propagate the labels only among the matched nodes from the previous step. Repeat steps 1 and 2 until no node can be discarded further.

u1

u2

v1

f

v3

v2

u3

v4

u4

u5

u6

Q

G

Dynamic Update in Index vs. Re-indexing (DBLP)

- Indexing is performed for h=2 hops.

Investigate graph properties and graph algorithms using MapReduce

Spectral Analysis of Billion-Scale Graphs

Patterns on the Connected Components of Terabyte-Scale Graphs

Study the limitation of the MapReduce architecture on processing network-centric data

Using the discovered patterns of terabyte-scale real-life graphs.

Subtask 2: Graph Over MapReduce

13

13

- Billion-Scale Eigen-solver
- Computes top-k eigen-values and eigenvectors
- Find anomalies in large graphs.
- Many application: SVD, triangle counting, spectral clustering, …
- A careful implementation of Lanczos on hadoop can give excellent accuracy as well as scalability

- Contribution:
- HEigen: a billion-scale eigensolver which can handle 1000x larger matrices than previous methods
- Application of the eigensolver on the twitter graph helps us spot abnormal users (adult advertisers)

U Kang, et al. Spectral Analysis of Billion-Scale Graphs: Discoveries and Implementation, PAKDD'11

- A large graph is composed of many connected components
- Q1: static patterns?
- Q2: evolution patterns?
- Q3: model?

Count

Metric:

Graph Fractal Dimension(G): log |E| / log |V|

Size

YahooWeb graph

|V| = 1.4 billion

|E| = 6.7 billion

120 GBytes

U Kang, et al. Patterns on the Connected Components of Terabyte-Scale Graphs. ICDM 2010

- Are typical techniques efficient for graph queries?
- Graph partitioning and distribution techniques (e.g., Pregel) Limitations:
- Unavailable to the public
- Unbalanced workload due to skewed uniformly distributed graph queries.
- Communication overhead due to inter‐machine (cross partition) communication.

- Sedge: distributed graph processing
- Model-based Graph Partitioning Techniques
- First-of-Its Kind Distributed Graph Computing Platform for Information, Social, and Communication Networks

- Shengqi Yang, et al., Managing Large-Scale Graphs for Efficient Distributed Processing
- submitted to VLDB 2011

- Dynamic Workload: Replicate Partitions
- Replicate partitions that are intensively accessed by many queries

- Complementary Partitions
- - Generate partitions sets that are complementary to each other

- Dynamic Workload: New Partitions
- - Generate new partitions that are intensively accessed by many cross-partition queries

- Shengqi Yang, et al., Managing Large-Scale Graphs for Efficient Distributed Processing
- submitted to VLDB 2011

- Optimal Solution:
- Where:
- NP-hard

- Before each iteration, increase the weight of edges in each region wrt. its priority

…

- Iteratively repartition the graph

…

- 10,000 random queries. Increase partition number by adding more machines.

# of Machines vs. Throughput Improvement Ratio

- Collaborations within I2
- Monthly meeting
- Strong connection between I2.1 and I2.2: One problem, two sides. information network processing on DTN and Clusters
- (I2.1) Work with ArunIyengar and MudahakarSrivatsa (IBM), who has done much work on DTN and Storage. Shengqi Yang will intern at IBM this summer.

- Collaborations with researchers in other networks
- (S1.1) Work with Zhen Wen (IBM), on the social network application of graph density indexing. U Kang was a summer intern at IBM
- (E1.1, R2.3) Work with JieBao (RPI), on RDF queries using neighborhood-based graph search.
- (T2.3) Work with VikasKawadia (BBN), on using graph query processing for distributed trust computing. Ziyu Guan is collaborating with Vikas
- Graph search has connection with (T2.4) M. Goldberg’s work on trust structure.
- Work with Sachi Desai (Army) on graph query language/system.

21

- Continue research on large-scale information network processing (more specific)
- Graph indexing on multiple time-varying graph snapshots
- Compression-based, Model-based Info Network Processing
(3) Edge lay-out on Hadoop file system for better compression and better performance

(4) Complementary graph partitioning theories.

- Other research topics planned
- Models and methods for building complex graph queries
- Models and methods for routing complex graph queries to data sources (for both I2.1 and I2.2)
- Tensor analysis on Hadoop

22

A. Khan, N. Li, Z. Guan, X. Yan, S. Chakraborty, and S. Tao, Neighborhood Based Fast Graph Search in Large Networks, Proc. 2011 Int. Conf. on Management of Data (SIGMOD'11), 2011.

Nicholas D Larusso and Ambuj K. Singh, "Synopses for Probabilistic Data over Large Domains", EDBT'11

C. C. Aggarwal, N. Li, On Dynamic Node-Classification in Content-based Networks, SIAM International Conference on Data Mining (SDM) 2011

U Kang, Mary McGlohon, Leman Akoglu, and Christos Faloutsos. Patterns on the Connected Components of Terabyte-Scale Graphs. IEEE International Conference on Data Mining (ICDM) 2010, Sydney, Australia.

U Kang, Brendan Meeder, Christos Faloutsos, Spectral Analysis of Billion-Scale Graphs: Discoveries and Implementation, PAKDD'11

U Kang, DuenHorngChau, and Christos Faloutsos. Mining Large Graphs: Algorithms, Inference, and Discoveries. IEEE International Conference on Data Engineering (ICDE) 2011, Hannover, Germany.

Shengqi Yang, Bo Zong, Arijit Khan, Ben Zhao, Xifeng Yan, Managing Large-Scale Graphs for Efficient Distributed Processing, submitted to VLDB 2011

Nan Li, Arijit Khan, Xifeng Yan, and Zhen Wen, Density Index and Proximity Search on Large Graphs, to be submitted to VLDB Journal

PetkoBogdanov, MisaelMogiovi, Ambuj Singh, Mining Heavy-Edges Subnetworks in Time, to be submitted to VLDB Journal

C. C. Aggarwal, P. Zhao, J. Han. On Shortest-Path Indexing of Massive Disk Resident Graphs, Research Report, to be submitted to VLDB Journal

C. C. Aggarwal, A. Khan, X. Yan. A Probabilistic Index for Massive and Dynamic Graph Streams, Research Report, to be submitted to VLDB Journal

Stage 1: How to distribute graphs (we are here)

Stage 2: How to construct queries

Stage 3: How to execute/route queries

Make Information Network Accessible by Soldiers and Commanders

Questions?