Create Presentation
Download Presentation

Download

Download Presentation

Padhraic Smyth, Sharad Mehrotra Information and Computer Science University of California, Irvine

123 Views
Download Presentation

Download Presentation
## Padhraic Smyth, Sharad Mehrotra Information and Computer Science University of California, Irvine

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Entity-Based Data Mining fromSpatio-Temporal Events and Text**Sources Presentation at KD-D Program Review, Nov 18-19 2003 Padhraic Smyth, Sharad Mehrotra Information and Computer Science University of California, Irvine {smyth, sharad}@ics.uci.edu www.datalab.uci.edu**Project Participants**• Principal Investigators: • Padhraic Smyth: Data mining • Sharad Mehrotra: Databases • Collaborators • Mark Steyvers: Text and Author Modeling • Postdoctoral Researchers • Michal Rosen-Zvi, Dmitri Kalashnikov • Staff Programmer • Amnon Meyers: Information Extraction • Students • Phd: Joshua O Madadhain, Scott White, Yiming Ma, Dawit Seid • Undergraduates: Yan-Biao Boey, Momo Alhazzazi • Acknowledgements • Steve Lawrence for CiteSeer data**Problem of Interest**• Intelligence Analysis today • Massive volumes/streams of data • Text (newswire, reports, etc) • Web data • Transactions/events • Central problems • Need flexible tools to support an analyst’s exploration of the data • Automatically focus an analyst’s attention on interesting parts of the data space • Need new theories/methods/tools….**Entities and Events**• Entities = Individuals, groups, communities, organizations, etc • Events = Contacts, collaborations, meetings, products, etc • Working hypothesis • A large component of intelligence work is centered on entities and events • Extracting entity-information from text streams and transaction data • Predicting entity behavior • Detecting groups of related entities • Our broad goal • Develop next-generation data management, exploration, and analysis tools for entity-event data**Nodes = Entities = Biotech-Related Organizations**Edges = Events = Collaborations**Red indicates nodes selected by**the data analyst as important**Algorithm determines blue nodes**are important relative to red nodes (Oxford and Cambridge)**Research Issues**• Information extraction • Data management tools • Visualization techniques • Interactive ad hoc querying and mining • Statistical modeling of graph data • Query languages for graphs • Scalability to large graphs • ……**Focus of Our Research**Text Sources Information Extraction Entity-Event Databases Statistical Modeling and Data Mining Visualization Query Languages User Modeling**Major Themes in Our Work**• Focus on data in the form of graphs • Nodes = entities, edges = events • Nodes and edges have attributes (e.g., temporal) • Year 1: entities = computer science researchers • Year 1: limited spatio-temporal aspects • Integration and coupling of • Statistical modeling and data mining • Visualization • Query languages and data management • Scalability • Methods should scale to millions of nodes and edges • User Interaction • Conditional “query-driven” analysis and mining • Contrast with offline global modeling**Accomplishments**• Infrastructure and Data Sets • Created testbed data sets, e.g., 100k entities, 400k events • Developed suite of text information extraction tools Developed and released a general public-domain JAVA API for graph data analysis and visualization • Statistical Modeling and Data Mining • Developed new statistical technique for modeling entities based on authored text • Developed new class of scalable algorithms for interactive graph-based data mining**Accomplishments**• Graph-based Querying • Developed framework for general graph-based query language • New accurate and efficient algorithms for interactive similarity queries and query refinement on graphs • Software Tools • Netsight: JAVA-based graph visualization and analysis tool • Browser tool for exploring author-topic models • Interactive query refinement system • Prototype system for graph-based query language for interacting with heterogenous graph data**Publications in Year 1**• Data Mining on Graphs • S. White and P. Smyth, Algorithms for Discovering Relative Importance In Graphs,Proceedings of the Ninth International ACM SIGKDD Conference, August 2003. Extended version submitted to JICRD, June 2003. • J. O'Madadhain, D. Fisher, S. White, and Y. Boey, The JUNG (Java Universal Network/Graph) Framework, UCI-ICS Tech Report 03-17, October 2003: invited presentation, Stanford Workshop on Statistical Inference, Computing and Visualization for Graphs, August 2003. • Modeling the Internet and the Web: Probabilistic Methods and Algorithms, P. Baldi, P. Frasconi, and P. Smyth, Wiley, June 2003. • Statistical Author-Topic Models • T. Griffiths and M. Steyvers (in press). Finding Scientific Topics.Proceedings of the National Academy of Sciences • M. Steyvers, M. Rosen-Zvi, T. Griffiths, P. Smyth, Author Attribution with LDA, NIPS workshop on Syntax, Semantics, and Statistics, December 2003 • Data Management and Graph Querying • Y. Ma, S. Mehrotra, D. Seid, A Framework for Refining Similarity Queries Using Learning Techniques, UCI-ICS Tech Report 03-19, Nov. 2003. Extended version submitted to EDBT 2004. • Y. Ma, D. Seid, S. Mehrotra, Interactive Filtering of Data Streams by Refining Similarity Queries, UCI-ICS Tech Report 03-07, June. 2003. • D. Seid, M. Ortega-Binderbergery, Z. Chen, and S. Mehrotra, Evaluating Top-k Selection and Preference Queries on Multiple Indexed Attributes. Submitted to EDBT'04. • D. Seid, and S. Mehrotra, Complex Analytical Queries on Graphs and Hierarchies, (in preparation). • L. Jin, C. Li, S. Mehrotra, Efficient Record Linkage in Large Data Sets, in the 8th International Conference on Database Systems for Advanced Applications (DASFAA 2003) 26 - 28 March, 2003, Kyoto, Japan.**Author Database Schema**Note: “individual-centric” not “document-centric”**Focus of Our Research**Text Sources Information Extraction Entity-Event Databases Statistical Modeling and Data Mining Visualization Query Languages User Modeling**From graphs to Markov chains**3 C • Importance = recursive function of nodes pointing at you 4 A B 2 2 D**From graphs to Markov chains**3 C 0.6 C • Importance = recursive function of nodes pointing at you 1.0 0.33 4 A B 2 A B 0.4 0.5 0.77 0.33 2 D 0.5 D**From graphs to Markov chains**3 C 0.6 C • Importance = recursive function of nodes pointing at you • Markov approach… • Notion of a “token” circulating around in Markov fashion • Important actors see the token more often • Importance = stationary probability of each node • PageRank: surfer randomly following links on the Web 1.0 0.33 4 A B 2 A B 0.4 0.5 0.77 0.33 2 D 0.5 D**Relative importance of node V to A:**Trade off [distance from A, structural importance of V]**Algorithms for Relative Importance(S. White and P. Smyth,**ACM KDD 2003: also JICRD, submitted) • PageRank with Priors (PRankP) • Random walks that start from A and return to A periodically • Relative importance = stationary probability • Iterative algorithm (e.g., Haveliwala, 2002) • HITS with priors • Formulate HITS as Markov chain, same idea…. • K-Step Markov • Use the transient probability distribution starting from A • Faster than stationary probability methods • Weighted Paths • Heuristic approximation to K-step Markov: even faster • All algorithms scale linearly in number of edges • Different constant factors**Computation Times for Ranking Algorithms (in seconds)**PRankP and HITS converged in 20-30 iterations**Computation Times for Ranking Algorithms (in seconds)**PRankP and HITS converged in 20-30 iterations**http://jung.sourceforge.net**JUNG Java Universal Network/Graph Framework 16,000 page visits 800 downloads since August**Authors**Words Can we model authors, given documents? (more generally, build statistical profiles of entities given sparse observed data)**Authors**Hidden Topics Words Model = Author-Topic distributions + Topic-Word distributions Parameters learned via Bayesian learning**Authors**Hidden Topics Words**Authors**Hidden Topics Words**Authors**Hidden Topics Words**Authors**Hidden Topics Words**Authors**Hidden Topics Words**Authors**Hidden Topics Words**Hidden**Topics Words “Topic Model”: - document can be generated from multiple topics - Hofmann (SIGIR ’99), Blei, Jordan, Ng (JMLR, 2003)**Authors**Hidden Topics Words Model = Author-Topic distributions + Topic-Word distributions NOTE: documents can be composed of multiple topics**Topic Models from CiteSeer**WORDS: probabilistic, Bayesian, carlo, monte, distribution, inference, conditional, prior, mixture, Markov, posterior, belief…… AUTHORS: N_Friedman, D_Heckerman, Z_Ghahramani, D_Koller, M_Jordan, R_Neal, A_Raftery, T_Lukasiewicz, J_Halpern…. WORDS: retrieval, text, document, information, content, indexing, relevance, collection, query, IR, feedback…. AUTHORS: D. Oard, W_Croft, K_Jones, P_Schauble, E_Voorhees, A_Singhal, D_Hawking, J_Allan, A_Smeaton, M_Hearst,….**Topic Models from CiteSeer**WORDS: Web, user, world, wide, pages, www, site, internet, hypertext, hypermedia, content, links, page, navigation.. AUTHORS: S. Lawrence, B. Mobasher, M. Levene, D. Florescu, O. Etzioni, R_Studer, W. Hall, R. Fielding, J. Pitkow, M. Crovella,…. WORDS: data, mining, attributes, discovery, association, large, knowledge, databases, dataset, interesting, frequent, discover, sets…. AUTHORS: J. Han, R. Rastogi, M. Zaki, R. Ng, B. Liu, H. Mannila, S. Brin, H Liu, L. Holder, H. Toivonen…**Author-Topic Models from CiteSeer**• Author = A McCallum: • Topic 1: classification, training, generalization, decision, data,… • Topic 2: learning, machine, examples, reinforcement, inductive,….. • Topic 3: retrieval, text, document, information, content,… • Author = H Garcia-Molina: - Topic 1: query, index, data, join, processing, aggregate…. - Topic 2: transaction, concurrency, copy, permission,distributed…. - Topic 3: source, separation, paper, heterogeneous, merging….. • Author = P Cohen: - Topic 1: agent, multi, coordination, autonomous, intelligent…. - Topic 2: planning, action, goal, world, execution, situation… - Topic 3: human, interaction, people, cognitive, social, natural….**Author-Topic Browser**• Interesting scalability issues • CiteSeer model exceeds 1 Gbyte • Real-time query answering demands Gibbs sampling (not well suited to SQL!) • Solution • Coupling of Gibbs sampling and relational DB (it works!) JAVA Query GUI SQL Interface Bayesian Sampling MySQL DB Original Text + Statistical Model