I2.2 Large-Scale Information Network Processing Mid-Year Report. Charu Aggarwal (IBM) Christos Faloutsos (CMU) Ambuj Singh (UCSB) Xifeng Yan (UCSB). Task Setting. Indexing , Partitioning , and Distributed Processing on Time-Varying Networks. INARC I2.2 Mid-Year Report.
I2.2 Large-Scale Information Network Processing Mid-Year Report
Christos Faloutsos (CMU)
Ambuj Singh (UCSB)
Xifeng Yan (UCSB)
Novel graph index model and advanced graph distributed computing theory to facilitate processing of (military) linked data that becomes a bottleneck for many research tasks in network science
Key Technical Innovations:
Dynamic graph indexing models and structures
Scalable graph processing
Graph partition overlapping and re-balancing theory
Xifeng Yan (UCSB), Ambuj Singh (UCSB), CharuAggarwal (IBM), Christos Faloutsos (CMU)
Z. Wen IBM/SCNARC, J. Bao RPI/IRC, J. Han UIUC/INARC, M. Srivatsa IBM/INARC, V. Kawadia BBN/IRC, S. Desai Army
Large-Scale Information Network Processing: Inventscalable information network infrastructure
Facilitate processing of (military) linked data that becomes a bottleneck for many research tasks in network science
Advance our understanding of scalability challenges, not only for information networks but also for other genres of complex networks
The models and the proposed experimental systems provide fundamental analysis of
How indexing of dynamic network data affects query performance,
How graph partitioning schemes affect distributed query processing,
How the models and laws of real networks affect the design of graph indexing and partitioning strategy
Advance State-of-the-Art Network Science
Subtask 1: Graph Index and Search (UCSB, IBM)
Fast access and processing of time-varying information networks is the key for tasks such as intelligence service and query processing. Simply speaking, we cannot access networks nodes by nodes!
Subtask 2: Graph over MapReduce (CMU)
To process overwhelming amount of data on the Web, social networks, emails, telecommunications, to distill important information such as people’s opinion about extremists, to find potential radical groups, to identify influential nodes, we need powerful graph processing methods.
Needed by any large-scale network data processing including information, social and communication networks
Subtask 3: Graph Partitioning/Distributed Graph Processing (UCSB, CMU)
Military information is often distributed in many devices, distributed graph processing run graph algorithms without putting all data together in the same machine
the top-k vertex subsets with the smallest diameters,
for a given query of distinct labels. Each subset must
cover all the labels specified in the query.
Q=(a, b, c)
Q=(“reconnaissance”, “biometric matching”, “failure modeling”)
Which one is more promising?
Graph Search: a Model-Based Approach
A. Khan et al., Neighborhood Based Fast Graph Search in Large Networks, SIGMOD’11
Dynamic Update in Index vs. Re-indexing (DBLP)
Investigate graph properties and graph algorithms using MapReduce
Spectral Analysis of Billion-Scale Graphs
Patterns on the Connected Components of Terabyte-Scale Graphs
Study the limitation of the MapReduce architecture on processing network-centric data
Using the discovered patterns of terabyte-scale real-life graphs.
Subtask 2: Graph Over MapReduce
U Kang, et al. Spectral Analysis of Billion-Scale Graphs: Discoveries and Implementation, PAKDD'11
Graph Fractal Dimension(G): log |E| / log |V|
|V| = 1.4 billion
|E| = 6.7 billion
U Kang, et al. Patterns on the Connected Components of Terabyte-Scale Graphs. ICDM 2010
- Replicate partitions that are intensively accessed by many queries
# of Machines vs. Throughput Improvement Ratio
(3) Edge lay-out on Hadoop file system for better compression and better performance
(4) Complementary graph partitioning theories.
A. Khan, N. Li, Z. Guan, X. Yan, S. Chakraborty, and S. Tao, Neighborhood Based Fast Graph Search in Large Networks, Proc. 2011 Int. Conf. on Management of Data (SIGMOD'11), 2011.
Nicholas D Larusso and Ambuj K. Singh, "Synopses for Probabilistic Data over Large Domains", EDBT'11
C. C. Aggarwal, N. Li, On Dynamic Node-Classification in Content-based Networks, SIAM International Conference on Data Mining (SDM) 2011
U Kang, Mary McGlohon, Leman Akoglu, and Christos Faloutsos. Patterns on the Connected Components of Terabyte-Scale Graphs. IEEE International Conference on Data Mining (ICDM) 2010, Sydney, Australia.
U Kang, Brendan Meeder, Christos Faloutsos, Spectral Analysis of Billion-Scale Graphs: Discoveries and Implementation, PAKDD'11
U Kang, DuenHorngChau, and Christos Faloutsos. Mining Large Graphs: Algorithms, Inference, and Discoveries. IEEE International Conference on Data Engineering (ICDE) 2011, Hannover, Germany.
Shengqi Yang, Bo Zong, Arijit Khan, Ben Zhao, Xifeng Yan, Managing Large-Scale Graphs for Efficient Distributed Processing, submitted to VLDB 2011
Nan Li, Arijit Khan, Xifeng Yan, and Zhen Wen, Density Index and Proximity Search on Large Graphs, to be submitted to VLDB Journal
PetkoBogdanov, MisaelMogiovi, Ambuj Singh, Mining Heavy-Edges Subnetworks in Time, to be submitted to VLDB Journal
C. C. Aggarwal, P. Zhao, J. Han. On Shortest-Path Indexing of Massive Disk Resident Graphs, Research Report, to be submitted to VLDB Journal
C. C. Aggarwal, A. Khan, X. Yan. A Probabilistic Index for Massive and Dynamic Graph Streams, Research Report, to be submitted to VLDB Journal
Stage 1: How to distribute graphs (we are here)
Stage 2: How to construct queries
Stage 3: How to execute/route queries
Make Information Network Accessible by Soldiers and Commanders