QoI Mining of Noisy, Volatile, Uncertain, and Incomplete Heterogeneous Information Networks

INARC I3.1 Mid-Year ReportI3.1: QoI Mining of Noisy, Volatile, Uncertain, and Incomplete Heterogeneous Information Networks Jiawei Han (Task Lead) Christos Faloutsos (CMU) Xifeng Yan (UCSB) University of Illinois at Urbana-Champaign NS-CTA: INARC 1

I3.1: QoI Mining of Noisy, Volatile, Uncertain, and Incomplete Heterogeneous Information Networks • Key Objectives: • Develop robust and quality mining methods for noisy and inaccurate heterogeneous information networks • Design substantially enhanced data mining methods to uncover hidden patterns and knowledge • Deliverables: • Q1: Methodology design for (i) two-stage mining and (ii) noise-aware mining, in heterogeneous information networks • Q2: Algorithm development for the two approaches • Q3: Algorithm test and refinement for the two approaches • Q4: System prototype demo of the two approaches • Impact: • Enable tools to uncover hidden patterns and knowledge from info_nets despite noise and uncertainty in the networks “Dirty” Information Network Cleaned/Inferred Adversarial Network • Key Technical Innovations: • Noise-aware mining model of incomplete and noisy network data by incorporating the uncertainty of the node attributes and network structure to discover hidden relationships

Subtask 1: Two-Stage Mining: First data cleaning and data fusion by information network analysis, and then mine the cleansed data in information networks Overall Task Organization Subtask2: Noise-aware mining • Subtask 2: Noise-aware mining: Directly mine the networked data with the consideration that a certain portion of the data may be noisy, incomplete, or unreliable Subtask1: Two-Stage Mining Subtask3: Exploring QoI Mining Applications • Subtask 3: Exploring QoI Mining Applications: Explore QoI network mining methodology in various applications 3 3

Two-Stage Mining: First data cleaning and data fusion by information network analysis, and then mine the cleansed data in information networks Role and relationship discovery: Uncovering hierarchical relationships among linked objects (KDD’11 sub) Data Cleaning by Trust Analysis: Cluster-Based Trustworthiness Analysis (WWW’11 poster, KDD’11 sub) Network Denoising and Sampling by Active Learning (ICML’11 sub) Clustering Heterogeneous Information Networks with Incomplete Attributes (KDD’11 sub) Assessing and Ranking Structural Correlations in Graphs (SIGMOD'11) Differentially Private Data Cubes: Optimizing Noise Source and Consistency (SIGMOD'11) Subtask 1: Two-Stage Mining 4 4

Uncovering Hierarchical Relationships among Linked Objects • Parent-child, manager-subordinate, organizational, initiator-follower • DAG underlying tree • Data: Nodes, links, labeled trees • Jointly Learn the importance of features and rules (challenge: joint learning) • Infer the tree structures of unlabeled data (challenge: model & feature design) • Develop a general model & summarize typical features w/ uncertain importance • Local feature (singleton potential) • Dependency rule (pairwise potential) • Test on two tasks • Uncover family tree structure • Uncover online discussion structure Examples of features and rules (UIUC + CUNY) Chi Wang, Jiawei Han, Xiang Li, Qi Li, Wen-Pin Lin, Adam Lee, Hao Li, Heng Ji, "Uncovering Hierarchical Relationships among Linked Objects: A Probabilistic Modeling Approach", KDD'11 (sub)

Cluster-Based Trustworthiness Analysis • Trust analysis and clustering of objects iteratively to obtain high accuracy of trust ranking of providers and confidence ranking of facts • Smoothing with the global scores • Several algorithms • Basic TruthFinder Algorithm (Basic TF) • Basic Cluster-based Fact Finder (BCFF) • Advanced Cluster-based Fact Finder (ACFF) (UIUC) Manish Gupta, Jiawei Han, and Yizhou Sun, "Cluster-Based Analysis of Information Trustworthiness", KDD'11 (sub) (an earlier version in WWW’11 poster.)

Network Denoising and Sampling by Active Learning • Problem: Which nodes are the most important in a network from a learning point of view? i.e., if they are labeled, the classifier trained will perform the best? • These nodes are more important, and less likely to contain noise • Methodology: Select data to label which minimize the variance of an unbiased classifier  Minimize the expected error • Significance: Theoretically minimize the expected error of a given classifier Classification accuracy vs. the number of labeled nodes used in the co-author network Our algorithm (UIUC) Ming Ji, Xiaofei He, and Jiawei Han, "A Variance Minimization Criterion to Active Learning on Graphs", ICML’11 (submission)

Correlation Metric in Information Networks Question: Is the distribution of events (blue nodes) influenced by the network links or not? If it is, to what degree? • A novel metric, Decayed Hitting Time, is proposed to assess and rank structural correlations in graphs that aggregates the proximity among nodes sharing the same event • SIGMOD reviewer: “Interesting problem that I haven’t seen before” • Structural Correlation: the first-of-its kind defined for networks • Able to test whether the distribution of events is related to (or influenced by) the underlying network structure. If Yes, how much? • Our sampling algorithm is 10-20 times faster than the iterative multiplication algorithm (UCSB) Z. Guan, J. Wu, Z. Yun, A. Singh, X. Yan, Assessing and Ranking Structural Correlations in Graphs, Proc. 2011 Int. Conf. on Management of Data (SIGMOD'11), 2011

Differentially Private Data Cubes: Optimizing Noise Source and Consistency • choose an initial subset of cuboids to compute directly from the fact table, injecting DP noise as usual • compute the remaining cuboids directly from the initial set • An efficient procedure with running time polynomial to # of cuboids to select the initial set of cuboids, such that the maximal noise in all published cuboids is within a factor of the optimal • Result: Enforce consistency in the published cuboids while simultaneously improving their utility (i.e., reducing error) (UIUC) Bolin Ding, Marianne Winslett, Jiawei Han, and Zhenhui Li, “Differentially Private Data Cube: Optimizing Noise Source and Consistency”, SIGMOD'11, Athens, Greece, June 2011 (accepted) • Motivation • Concern: Disclosure of sensitive information on data publication • Explore differential privacy to provide provable privacy guarantees for individuals in multi-dimensional data space (data cube) • Approach: Adding noise to query answers 9

Noise-aware mining: Directly mine the networked data with the consideration that a certain portion of the data may be noisy, incomplete, or unreliable RankClass: Ranking-Based Classification of Information Networks (KDD’11 sub) Apolo: Making Sense of Large Network Data: Combining Rich User Interaction & Machine Learning (CHI’11) Event Detection in Time Series of Mobile Communication Graphs (Army Science Conference’10) PathSim: Meta Path-Based Top-K SimilaritySearch in Heterogeneous Info. Networks (VLDB’11 sub) Towards Iceberg Analysis in Graph OLAP (in preparation, VLDB J.) Graph Cube: On Warehousing and OLAP Multidimensional Networks (SIGMOD’11) Subtask 2: Noise-Aware Mining 10

RankClass: Ranking-Based Classification of Information Networks • Output the classification results + ranking list of objects within each class • For each class, objects ranked low are more likely to contain noise • Methodology: iteratively use the current ranking results to extract the sub-network corresponding to each class, on which the within-class ranking algorithm is run • Iteratively use the current ranking results to remove noise from other classes Clean sub-network for class 1 Clean Sub-network for class 2 Clean sub-network for class 3 Original network (UIUC) Ming Ji and Jiawei Han, "Ranking-Based Classification of Heterogeneous Information Networks", KDD'11 (sub)

Apolo: Making Sense of Large Network Data: Combining Rich User Interaction & Machine Learning • Provides a mixed-initiative approach (ML + HCI) to help users interactively explore large graphs • Users start with smallsub-graph, then iteratively expand: • User specifies exemplars • Belief Propagation to find other relevant nodes • User study showed Apolo outperformed Google Scholar in making sense of citation network data Rest of the nodes are considered relevant (by BP); relevance indicated by color saturation. Note that BP supports multiple groups Exemplars (CMU) Chau,et al, Apolo: Making Sense of Large Network Data: … CHI’11 12

Event Detection in Time Series of Mobile Communication Graphs N N Problem: Given a graph that changes over time, perform: 1) “change detection”: time points at which many nodes change their behavior significantly 2) “attribution”: top nodes which contribute to the change in behavior the most W W N N …. T T Main idea: Extract features for nodes Derive the typical behavior (“eigen-behavior”) of nodes Compare “eigenbehavior”s over time * detect important events and festivals in our data, * spot nodes that change behavior over time. F F:inweight past pattern T change metric: angle θ eigen-behavior at t eigen-behaviors (CMU) Akoglu et .al., Event Detection … Army Science Conference’10 13

Noise-Aware Mining: Graph Iceberg • Graph Iceberg: A novel graph iceberg mining framework to find anomaly regions in large heterogeneous information networks • Graph Iceberg: the first-of-its kind in network science • gIceberg identifies promising vertices to avoid costly candidate region enumeration (efficient: 10-50 times faster) • Able to find abnormal concentration of events in information networks, intensive attacks in intrusion networks, and special communities in social networks • R1 has high concentration of black vertices, but low connectivity • R2 contrarily has few black vertices, but well-connected; • R3, is an anomaly region with high density of black vertices and high connectivity Scalable: gIceberg is 10-50 times faster than the existing algorithm Xifeng Yan, et al., Towards Iceberg Analysis in Graph OLAP in preparation for VLDB Journal, 2011 14

PathSim: Meta Path-Based Top-K SimilaritySearch in Heterogeneous Info. Networks • Problem: Study similarity betw. the same type of objects in heter. infonets • Solution • Define a meta path-based similarity framework • Propose a new measure called PathSim, which is able to detect peer objects for the given meta path • Propose a co-clustering-based efficient online search algorithm to support top-k search • Results (UIUC+UCSB) Yizhou Sun, Jiawei Han, Xifeng Yan, Philip S. Yu, Tianyi Wu, "PathSim: Meta PathBased Top-K Similarity Search in Heterogeneous Information Networks", submitted to VLDB'11 15

Graph Cube: On Warehousing and OLAP Multidimensional Networks • Multidimensional networks: • Topological graph structure comprising entities and relationships • Multidimensional attributes associated with entities • Graph cube: Extend decision support facilities on large multidimensional networks • A multidimensional network is summarized to a set of semantically meaningful and structure-enriched aggregate networks in coarser levels of granularity within different multidimensional spaces • Different query models and OLAP solutions • Cuboid queries • Crossboid queries straddling multiple cuboids Peixiang Zhao (UIUC), Xiaolei Li (Microsoft), Dong Xin (Google), and Jiawei Han (UIUC), “Graph Cube: On Warehousing and OLAP Multidimensional Networks”, SIGMOD'11 (accepted) 16

Exploring QoI Mining Applications Consider the network is noisy, incomplete, unreliable, … Explore network mining methodology in various applications Polonium: Tera-Scale Graph Mining and Inference for Malware Detection (SDM’11) ValuePick: Towards Dual-goal, Value-Oriented Recommendations (ICDM’10 Workshop on Emerging Applications) Reciprocity in Human Communication Networks (KDD’11 sub) Noise-Aware Mining: Collective Classification of Information Networks for Web Search (SIGIR’11 sub) Patent Value Estimation and Maintenance Recommendation with Patent Information Network Model (KDD’11 sub) Subtask 3: QoI Mining Applications 17

Polonium: Tera-Scale Graph Mining and Inference for Malware Detection malware good file • 60+ terabytes of data anonymously contributed by participants of worldwide Norton Community Watch program (Symantec) • 50+ million machines • 900+ million executable files • A file-in-machine bipartite graph (0.2 TB+) • 1 billion nodes (machines and files) • 37 billion edges • Contributions: Malware detection (‘bad’ files), Scalability • Polonium is a new and effective reputation-based malware detection technology adapting the Belief Propagation algorithm: 87% TPR, at 1% FPR binaries machines (CMU) Chau,et al, Polonium: Tera-Scale Graph Mining and Inference for Malware Detection, SDM’11

ValuePick: Towards Dual-goal, Value-Oriented Recommendations Problem: Given a graph with node-attributes (“value”), a query node q: Find (1) close-by (high proximity) as well as (2) “value”-able other nodes to recommend to q. proximity “value” Main idea: Carefully change (perturb) the order of nodes by proximity s.t. total expected value is maximized. Q: How to perturb? How to pick the best k nodes? A: Formulation as an optimization problem * Makes dual-goal recommendations by integrating “value” v253 v162 v261 v327 . . . . . . (darker color: higher proximity) (value: e.g. centrality) query node (CMU) Akoglu et .al., ValuePick: … ICDM Workshop on Optimization Based Methods for Emerging Data Mining Problems’10

Reciprocity in Human Communication Networks Motivation: Reciprocity often treated as a global, unweighted quantity. Problem: How reciprocal are human relations? GivennST(#calls from S(Silent) to T (Talkative)) and similarynTS, Quantify degree of reciprocity between S and T • Does reciprocity depend on T, S’s topological features –e.g. degree similarity? Problem: If T calls S nTS times, what can we say about how many times S calls T? Approach: Model Prob(nST, nTS) with 3PL : nTS S T nST Pareto Yule 3PL Real nTS Higher likelihood nTS nTS nTS nTS * 3PL spots anomalous mutual interactions (low data likelihood points) (CMU) Akoglu et .al., Reciprocity… Submitted to KDD ’11

Noise-Aware Mining: Collective Classification of Information Networks for Web Search • Extension of our work: Ming Ji, et al., “Graph Regularized Transductive Classification on Heterogeneous Information Networks”, ECMLPKDD 2010 • Collective classification: learning from both the network structure and the numerical features of nodes • Links and numerical features complement each other, so combining them provides more robust results against noise • Fully exploit all the information available in the network • Can predict the labels of new data that are not seen in the training phase, as long as they have features • Methodology: unify the feature information into a feature graph The unified network structure (UIUC) Ming Ji, Jun Yan, Siyu Gu, and Jiawei Han, "Learning Search Tasks in Queries and Web Pages via Graph Regularization", SIGIR’11 (submission)

Patent Value Estimation and Maintenance Recommendation with Patent Information Network Model • A U.S. granted patent can be held for up to 20 years; however, large maintenance fees need to be paid to keep it valid • For large companies/organizations, making such decision is difficult because too many patents need to be investigated • Model the patents as a heterogeneous time-evolving information network and propose new patent quality features and a network-based optimization model to rank the patents • Experiments on U.S. patent database over millions of patents show high accuracy of our approach (UIUC + IBM) Xin Jin, Scott Spangler, Ying Chen, and Jiawei Han, "Patent Value Estimation and Maintenance Recommendation with Patent Information Network Model", KDD'11 (sub)

Discovery hidden relationships in noisy, incomplete, dynamic and heterogeneous information networks Focused on large heterogeneous information networks Collections of information objects in diverse forms and from diverse resources Developed state-of-the-art algorithmic tools Supporting data cleaning, information trust analysis, network modeling and integrated information structure discovery Utilizing in-depth data analysis & statistical modeling approaches over the content and the structure of the network Make use of both explicit network structure and hidden information structure Advanced our understanding of how to: Perform two-stage mining and noise aware-mining from heterogeneous information networks when data is noisy, volatile, uncertain, and incomplete Exploring various kinds of large-scale, new applications Advancing the State-of-the-Art of Network Science 23 23

Subtask 1: Two-Stage Mining Military networks are inherently noisy, incomplete, unreliable, from multiple (some are non-trustable sources) Two-stage mining provides a systematic way to derive trustable information from multi-sourced, inconsistent networks Subtask 2: Noise-Aware Mining Noise-aware mining is to perform successful mining under the condition of existing various kinds of noise data Military network likely badly needs such robust mining methodologies Subtask 3:QoI Mining Applications Many diverse applications under this QoI mining framework are explored Such explorations will help understand how diverse military applications may explore different genres of networks effectively Military Relevance 24 24

Collaborations within NSCTA I1.1 Context-Aware Data Fusion IRC Data & Experiments T1.2 Large-Scale Info. Network Processing T2.4 Network Behavior Based on Trust Tarek, Lei, Huang Leung Adali Pirolli I3.1 QoI Mining of Information Networks I1.2 QoI Sensor Data Collection & Fusion E2.2 Tactic Mobility Models Tarek, Charu La Porta Chawla Heng Yan, Roth Yan, Charu Zen, Tong I3.2 Modeling and Mining of Text-Rich Information Networks E2.3 Co-Evolution of Composite Networks I2.2 Large-Scale Info. Network Processing S1.1+T1.5 25 Weekly/monthly meetings or teleconfs among collaborators, joint research papers, proposals, etc.

Next Six Months and Path Ahead to 2012 • Continue research on QoI mining of information networks • Research in three frontiers: (1) integrated classification and clustering in network mining, (2) build up a theory on link/relationship analysis in heterogeneous networks, and (3) explore military applications • Exploration and consolidation of cross-center collaborations • Work with NiteshChawla and Bolek on evaluation of mining methods for clustering and classification of heterogeneous networks • Work with Tom LaPorta on mining communication/information networks • Next year research planned if funded • Effective theory and methods for mining heterogeneous networks involving social and communication networks • Network fusion: Integration and modeling in multiple heterogeneous networks of multi-genres • Data fusion: By exploration multi-networks of multi-genres: exploration of information enhancement across multi-networks • Application of role discovery, network classification, and anomaly detection methods and network fusion in military applications 26

Research Papers (Accepted/Published) (2011) (UIUC + Microsoft + Google) Peixiang Zhao, Xiaolei Li, Dong Xin, and Jiawei Han, “Graph Cube: On Warehousing and OLAP Multidimensional Networks”, Proc. of 2011 ACM SIGMOD Int. Conf. on Management of Data (SIGMOD'11), Athens, Greece, June 2011 (UCSB) Z. Guan, J. Wu, Z. Yun, A. Singh, X. Yan, Assessing and Ranking Structural Correlations in Graphs, Proc. 2011 Int. Conf. on Management of Data (SIGMOD'11), 2011. (CMU) U Kang, Duen Horng Chau, and Christos Faloutsos. Mining Large Graphs: Algorithms, Inference, and Discoveries. IEEE Int. Conf. on Data Engineering (ICDE) 2011, Hannover, Germany. (SMU + UCSB + UIUC) Qiang Qu, Feida Zhu, Xifeng Yan, Jiawei Han, Philip S. Yu, and Hongyan Li, “Efficient Topological OLAP on Information Networks", Proc. of 2011 Int. Conf. on Database Systems for Advanced Applications (DASFAA'10), Hong Kong, Apr. 2011 (UIUC + IBM) Jing Gao, Wei Fan, Deepak S. Turaga, Olivier Verscheure, Xiaoqiao Meng, Lu Su, Jiawei Han, "Consensus Extraction from Heterogeneous Detectors to Improve Performance over Network Traffic Anomaly Detection”, Proc. of 2011 IEEE INFOCOM Mini-Conf. (INFOCOM-Mini'10), Shanghai, China, Apr. 2011. (CMU)Duen Horng Chau, Carey Nachenberg, Jeffrey Wilhelm, Adam Wright, Christos Faloutsos, Polonium: Tera-Scale Graph Mining and Inference for Malware Detection.,SIAM Int. Conf. on Data Mining (SDM) 2011. (CMU)Duen Horng (Polo) Chau, Aniket Kittur, Jason I. Hong, Christos Faloutsos. "Apolo: Making Senses of Large Network Data by Combining Rich User Interaction and Machine Learning", ACM Conf. on Human Factors in Computing Systems (CHI 2011) (UIUC) Bolin Ding, Marianne Winslett, Jiawei Han, and Zhenhui Li, “Differentially Private Data Cube: Optimizing Noise Source and Consistency”, Proc. of 2011 ACM SIGMOD Int. Conf. on Management of Data (SIGMOD'11), Athens, Greece, June 2011 (UIUC) Manish Gupta, Yizhou Sun, and Jiawei Han, “Trust Analysis with Clustering", Proc. of 2011 Int. World Wide Web Conf. (WWW'11), Hyderabad, India, March 2011

Research Papers (Published) (Sept.-Dec. 2010) (CMU)Leman Akoglu and Christos Faloutsos, Event Detection in Time Series of Mobile Communication Graphs, 27th Army Science Conference, Orlando, Florida, Dec. 2010. (CMU)Leman Akoglu and Christos Faloutsos, ValuePick: Towards a Value-Oriented Dual-Goal Recommender Systems, ICDM Workshop on Optimization Based Methods for Emerging Data Mining Problems , Sydney, Australia, Dec. 2010. (CMU)Pedro OlmoVaz de Melo, Leman Akoglu, Christos Faloutsos, Antonio Loureiro, Surprising Patterns for the Call Duration Distribution of Mobile Phone Users, ECML PKDD, Barcelona, Spain, Sep. 2010. (Kodak + UIUC) Jie Yu, Xin Jin, Jiawei Han, JieboLuo, "Collection-based Sparse Label Propagation and Its Application on Social Group Suggestion from Photos", ACM Transactions on Intelligent Systems and Technology (TIST), 2(2), 2011. (UIUC) Xin Jin, Sangkyum Kim, Jiawei Han, Liangliang Cao, and Zhijun Yin, “A General Framework for Efficient Clustering of Large Datasets based on Activity Detection”, Statistical Analysis and Data Mining, accepted Sept. 2010. (UIUC) Heli Sun, Jianbin Huang, Jiawei Han, Hongbo Deng, Peixiang Zhao, and BoqinFeng, “gSkeleton-Clu: Density-based Network Clustering via Structure-Connected Tree Division or Agglomeration”, Proc. of 2010 Int. Conf. on Data Mining (ICDM'10), Sydney, Australia, Dec. 2010 (UTD + UIUC) Mohammad Masud, Qing Chen, Latifur Khan, CharuAggarwal, Jing Gao, Jiawei Han, and BhavaniThuraisingham, “Addressing Concept-Evolution in Concept-Drifting Data Streams”, Proc. of 2010 Int. Conf. on Data Mining (ICDM'10), Sydney, Australia, Dec. 2010. (UIUC) Jianbin Huang, Heli Sun, Jiawei Han, Hongbo Deng, Yizhou Sun, and Yaguang Liu, “SHRINK: A Structural Clustering Algorithm for Detecting Hierarchical Communities in Networks", Proc. 2010 ACM Int. Conf. on Information and Knowledge Management (CIKM'10), Toronto, Canada, Oct. 2010. (UIUC) Xin Jin, Jiawei Han, Liangliang Cao, JieboLuo, Bolin Ding, Cindy Xide Lin, “Visual Cube and On-Line Analytical Processing of Images", Proc. 2010 ACM Int. Conf. on Information and Knowledge Management (CIKM'10), Toronto, Canada, Oct. 2010. (UIUC) Ming Ji, Yizhou Sun, Marina Danilevsky, Jiawei Han, and Jing Gao, “Graph Regularized Transductive Classification on Heterogeneous Information Networks", Proc. 2010 European Conf. on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECMLPKDD'10), Barcelona, Spain, Sept. 2010 (UIUC) HyungSul Kim, Sangkyum Kim, Tim Weninger, Jiawei Han, and TarekAbdelzaher, “NDPMine: Efficiently Mining Discriminative Numerical Features for Pattern-Based Classification", Proc. 2010 European Conf. on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECMLPKDD'10), Barcelona, Spain, Sept. 2010 (UTD + UIUC) Mohammad M. Masud, Qing Chen, Jing Gao, Latifur Khan, Jiawei Han, and BhavaniThuraisingham, “Classification and Novel Class Detection of Data Streams in a Dynamic Feature Space", Proc. 2010 European Conf. on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECMLPKDD'10), Barcelona, Spain, Sept. 2010. (UIUC + New York State Museum) Zhenhui Li, Bolin Ding, Jiawei Han, and Roland Kays, “Swarm: Mining Relaxed Temporal Moving Object Clusters", Proc. 2010 Int. Conf. on Very Large Data Bases (VLDB'10), Singapore, Sept. 2010. (UIUC + New York State Museum) Peixiang Zhao and Jiawei Han, “On Graph Query Optimization in Large Networks", Proc. 2010 Int. Conf. on Very Large Data Bases (VLDB'10), Singapore, Sept. 2010.

Research Papers (Submitted, 2011) (UIUC + CUNY) Chi Wang, Jiawei Han, Xiang Li, Qi Li, Wen-Pin Lin, Adam Lee, Hao Li, HengJi, "Uncovering Hierarchical Relationships among Linked Objects: A Probabilistic Modeling Approach", KDD'11 (sub) (UIUC + IBM) Jing Gao, Wei Fan, Deepak Turaga, SrinivasanParthasarathy, and Jiawei Han, "A Spectral Framework for Detecting Inconsistency across Multi-Source Object Relationships", KDD'11 (sub) (UIUC) Manish Gupta, Jiawei Han, and Yizhou Sun, "Cluster-Based Analysis of Information Trustworthiness", KDD'11 (sub) (also for Trust as collab.) (UIUC) Ming Ji and Jiawei Han, "Ranking-Based Classification of Heterogeneous Information Networks", KDD'11 (sub) (UIUC + IBM) Xin Jin, Scott Spangler, Ying Chen, and Jiawei Han, "Patent Value Estimation and Maintenance Recommendation with Patent Information Network Model", KDD'11 (sub) (UIUC + Tsinghua U) Yizhou Sun, Jie Tang, Jiawei Han, Cheng Chen, Manish Gupta, "Studying Co-Evolution of Multi-Typed Objects in Dynamic Heterogeneous Information Networks", KDD'11 (sub) (UIUC + IBM) Yizhou Sun, CharuAggarwal, Jiawei Han, "A Framework for Clustering Heterogeneous Information Networks with Incomplete Attributes", KDD'11 (sub) (UIUC+ UCSB + UIC) PathSim: Meta Path-Based Top-K Similarity Search in Heterogeneous Information Networks, VLDB’ 11 (sub) (UCSB + UIUC) Xifeng Yan, et al., Towards Iceberg Analysis in Graph OLAP in preparation for VLDB Journal. (UIUC) Ming Ji, Xiaofei He, and Jiawei Han, "A Variance Minimization Criterion to Active Learning on Graphs", submitted to Int. Conf. on Machine Learning (ICML'11), June, 2011 (UIUC) Ming Ji, Jun Yan, SiyuGu, and Jiawei Han, "Learning Search Tasks in Queries and Web Pages via Graph Regularization", submitted to Int. ACM SIGIR Conf. (SIGIR'11), July 2011. (CMU)Leman Akoglu, Pedro OlmoVaz de Melo, Christos Faloutsos, "Reciprocity in Human Communication Networks", KDD'11 (sub)

Other Technical Contributions (Book: UIC + UIUC + CMU) Philip S. Yu, Jiawei Han, and Christos Faloutsos (Editors), LINK MINING: MODELS, ALGORITHMS AND APPLICATIONS, Springer, 2010. (UCSB) Xifeng Yan, Invited talk, “Graph Pattern Mining and System,” Microsoft Research Asia, Beijing, Nov. 2010 (UIUC) INARC Ph.D. student, Mr. Chi Wang at CS, UIUC, has received 2011 Microsoft Research Ph.D. Fellowship (A highly competitive award since there are only in total 12 Ph.D. Fellowship Awardees across all the research fields in the U.S. in 2011). Chi Wang is supervised by Jiawei Han at INARC. (UIUC) Ms. Jing Gao, who was partially supported by the INARC program, has received IBM Ph.D. Fellowship for 2011-2012. She first received IBM Ph.D. Fellowship for the academic year of 2010-2011. This is her second-year award. Jing Gao is supervised by Jiawei Han at INARC. (UIUC) Jiawei Han has received Daniel C. Drucker Eminent Faculty Award at UIUC Jiawei Han, “Towards Integrated Mining of Multiple Social and Information Networks” (keynote speech) The 2011 Int. Conf. on Advances in Social Network Analysis and Mining (ASONAM’11), July 2011. Jiawei Han, “Exploring the Power of Heterogeneous Information Networks in Data Mining” (keynote speech) The 2011 Int. SIAM Data Mining Conf. (SDM’11), April 2011. Jiawei Han, “Construction and Analysis of Web-Based Computer Science Information Networks” (keynote speech) The 2011 Int. Conf. on Rough Sets, Fuzzy Sets, Data Mining and Granular Computing (RSFDGrC'11), June 2011. Latifur Khan, Wei Fan, Jiawei Han, Jing Gao, Mohammad MehedyMasud, “Data Stream Mining: Challenges and Techniques”, (tutorial), The 15th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD 2011), May 2011 Jiawei Han, Web Structure Mining and Information Network Analysis: An Integrated Approach, invited speech at the Third International Workshop on Network Theory: Web Science Meets Network Science, March 2011.

QoI Mining of Noisy, Volatile, Uncertain, and Incomplete Heterogeneous Information Networks

QoI Mining of Noisy, Volatile, Uncertain, and Incomplete Heterogeneous Information Networks

Presentation Transcript

Issues in the Practical Application of Data Mining Techniques to Pharmacovigilance

ICS 278: Data Mining Lecture 14: Text Mining and Information Retrieval

Joint Enhancement of Topic Modeling and Information Network Mining

Today’s class

Mining Frequent Itemsets over Uncertain Databases

Get Another Label? Improving Data Quality and Data Mining Using Multiple, Noisy Labelers

Uncertain input and noisy-channel sentence comprehension

Incomplete dominance

Frequent Subgraph Pattern Mining on Uncertain Graph Data

A Generalized Version Space Learning Algorithm for Noisy and Uncertain Data

DATA MINING

Managing Uncertain Data

I2.2 Large-Scale Information Network Processing Mid-Year Report

Planning with Incomplete, Unbounded Information

INCOMPLETE RECORD

Uncertain Reasoning over Time

Data Mining Spring 2007

Reasoning from Radically Incomplete Information: The Case of Containers

ICS 278: Data Mining Lecture 12: Text Mining

Mining Event Periodicity from Incomplete Observations

China Mining Ventilator Industry, 2014-2018