1 / 244

Hsinchun Chen, Ph.D. Director, Artificial Intelligence Lab

“Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, Accounting and Marketing Applications”. Hsinchun Chen, Ph.D. Director, Artificial Intelligence Lab Director, NSF COPLINK and Dark Web Research Centers University of Arizona

keaira
Download Presentation

Hsinchun Chen, Ph.D. Director, Artificial Intelligence Lab

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. “Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, Accounting and Marketing Applications” Hsinchun Chen, Ph.D. Director, Artificial Intelligence Lab Director, NSF COPLINK and Dark Web Research Centers University of Arizona Acknowledgements: NSF, LOC, ITIC/KDD, DHS, DOJ

  2. My Background • NCTU  SUNY Buffalo  NYU  U Arizona (MIS #4) • MS, MIS, Design Science, AI, Search Engine, Digital Library, Medical Informatics, Intelligence & Security Informatics, Business Intelligence • AI Lab, 25+ researchers; $25M funding ($1.5M/year), 180 top SCI papers (20+ papers/year); DL (#1), MIS (#8); Scientific Advisor: NLC, NLM, Academia Sinica; Chair, ICADL, IEEE ISI • AE in ten top SCI journals, IEEE and AAAS Fellow • DL/SE; GeneScene & BioPortal; COPLINK & Dark Web (NYT, USA Today, Associated Press, etc.); Knowledge Computing Corporation ($100M) • Business Intelligence Mining???

  3. The Peta AgeThe End of Theory

  4. Outline • Web 2.0 + Data Mining, Text mining, Web mining • Intelligence and Security Informatics • Case Studies, Examples, and Lessons Learned: Business Intelligence Data, Text and Web mining • Opportunities and Future Directions: Finance, Accounting, and Marketing Applications

  5. Web 2.0, Data Mining, Text Mining, and Web Mining

  6. Web 2.0, by O’Reilly • http://www.oreilly.com, “What is Web 2.0? Design Patterns and Business Models for the Next Generation of Software,” by Tim O’Reilly, 9/30/2005 (O’Reilly Media Web 2.0 Conference, 2004) • Examples of Web 2.0: Google AdSense, Flikr, Napster, Wikipedia, blogging, search engine optimization, web services, participation, tagging (folksonomy), syndication, etc.

  7. Web 2.0, by O’Reilly • Strategic positioning: “The Web as Platform” • User positioning: “You control your own data” • Core competencies: • Services, not packageg software • Architecture of participation • Cost-effective scalability • Remixable data sources and data transformations • Software above the level of a single device • Harnessing collective intelligence

  8. Web 2.0 Lessons • The value of the software is proportional to the scale and dynamism of the data it helps to manage. • Leverage customer-self service and algorithmic data management to reach out to the entire web, to the edges and not just the center, to the long tail and not just the head. • The service automatically gets better the more better use it. • Blogging and the wisdom of the crowds. • Network effects from user participation are the key to market dominance in the Web 2.0 era. • We, the media. • Data is the next Intel inside.

  9. Web 2.0 Lessons (cont’d) • Operations must become a core competency. • The perceptual beta. • Support lightweight programming models that allow for loosely coupled systems. (SOAP, REST, AJAX, etc.) • Think syndication, not coordination. • Innovation in assembly. The Mashups. • Design for “hackability” and remixability. • Some rights reserved.

  10. Web 2.0, Wikipedia • “Web 2.0 is a trend in the use of the WWW technology and web design that aims to facilitate creativity, information sharing, and collaboration among users.” • “Web 2.0 is the business revolution in the computer industry caused by the move to the Internet as platform, and an attempt to understand the rules for success on that new platform.”

  11. Web 2.0 Characteristics • Rich user experience • User participation • Dynamic content • Metadata • Web standards and scalability • Openness • Freedom • Collective intelligence

  12. Web 2.0 Features/Technologies • Technological infrastructure: server software, content syndication, messaging protocols, browsers with plug-ins and extensions, various client applications. • Cascading Style Sheets (CSS) to separate presentation from content • Folksonomy (collective tagging) • Microformats extending pages with semantics • REST, XML, JSON based APIs • Rich Internet application techniques based on AJAX • RSS or Atom feeds for syndication and notification of data • Mashups of content from different sources • Weblog publishing, and wikis

  13. Web 2.0 Criticism • “Web 2.0 as a piece of jargon,” by Tim Berners-Lee • “A second bubble” • “Bubble 2.0” • “A mere augmentation of current cultural information exchanges that are bound by existing political and societal structures.”

  14. Web Programming with Amazon, Google, and eBay APIs

  15. What is Web Services? Web Services: A new way of reuse/integrate third party softwre or legacy system No matter where the software is, what platform it residents, or which language it was written in Based on XML and Internet protocols (HTTP, SMTP…) Benefits: Ease of integration Develop applications faster

  16. Web Services Architecture Simple Object Access Protocol (SOAP) Web Service Description Language (WSDL) Universal Description, Discovery and Integration (UDDI)

  17. New Breeds of Web Services Representational State Transfer (REST) Use HTTP Get method to invoke remote services (not XML) The response of remote service can be in XML or any textual format Benefits: Easy to develop Easy to debug (with standard browser) Leverage existing web application infrastructure

  18. Server Responses in REST Really Simple Syndication (RSS, Atom) XML-based standard Designed for news-oriented websites to “Push” content to readers Excellent to monitor new content from websites JavaScript Object Notation (JSON) Lightweight data-interchange format Human readable and writable and also machine friendly Wide support from most languages (Java, C, C#, PHP, Ruby, Python…)

  19. Rich Interactivity Web - AJAX AJAX: Asynchronous JavaScript + XML AJAX incorporates: standards-based presentation using XHTML and CSS; dynamic display and interaction using the Document Object Model; data interchange and manipulation using XML and XSLT; asynchronous data retrieval using XMLHttpRequest; and JavaScript binding everything together. Examples: http://www.gmail.com http://www.kiko.com More info: http://www.adaptivepath.com/publications/essays/archives/000385.php

  20. AJAX Application Model

  21. Amazon Web Services (AWS) Amazon E-Commerce Service Search catalog, retrieve product information, images and customer reviews Retrieve wish list, wedding registry… Search seller and offer Alexa Services Retrieve information such as site rank, traffic rank, thumbnail, related sites amount others given a target URL Amazon Historical Pricing Programmatic access to over three years of actual sales data Amazon Simple Queue and Storage Service A distributed resource manager to store web services results Amazon Elastic Compute Cloud Sell computing capacity by the amount you use

  22. Google Web APIs Google has a long list of APIs http://code.google.com/apis/ Google Search AJAX Search API SOAP Search API (deprecated) Custom search engine with Google Co-op Google Map API Google Data API (GData) Blogger, Google Base, Calendar, Gmail, Spreadsheets, and a lot more Google Talk XMPP for communication and IM Google Translation (http://www.oreillynet.com/pub/h/4807) Many more undocumented/unlisted APIs to be discovered in Google Blog

  23. eBay API Buyers: Get the current list of eBay categories View information about items listed on eBay Display eBay listings on other sites Leave feedback about other users at the conclusion of a commerce transaction Sellers: Submit items for listing on eBay Get high bidder information for items you are selling Retrieve lists of items a particular user is currently selling through eBay Retrieve lists of items a particular user has bid on

  24. Other Services/APIs Providers Yahoo! http://developer.yahoo.com/ Search (web, news, video, audio, image…) Flickr, del.icio.us, MyWeb, Answers API Windows Live http://msdn2.microsoft.com/en-us/live/default.aspx Search (SOAP, REST) Spaces (blog), Virtual Earth, Live ID Wikipedia Downloadable database http://en.wikipedia.org/wiki/Wikipedia:Technical_FAQ#Is_it_possible_to_download_the_contents_of_Wikipedia.3F Many more at Programmableweb.com http://www.programmableweb.com/apis

  25. Services by Category Search Google, MSN, Yahoo E-Commerce Amazon, Ebay, Google Checkout TechBargain, DealSea, FatWallet Mapping Google, Yahoo!, Microsoft Community Blogger, MySpace, MyWeb del.icio.us, StumbleUpon Photo/ Video YouTube, Google Video, Flckr Identity/ Authentication Microsoft, Google, Yahoo News Various news feed websites including Reuters, Yahoo! and many more.

  26. Mashup:A Novel Form of Web Reuse “A mashup is a website or application that combines content from more than one source into an integrated experience.” – Wikipedia API X + API Y = mashup Z Business model: Advertisement

  27. Web Mining: Machine Learning for Web Applications Hsinchun Chen and Michael Chau ARIST, 38, 2004

  28. What is Web Mining? • The term Web Mining was coined by Etzioni (1996) to denote the use of Data Mining techniques to automatically discover Web documents and services, extract information from Web resources, and uncover general patterns on the Web. • In this article, we have adopted a broad definition that considers Web mining to be “the discovery and analysis of useful information from the World Wide Web” (Cooley et al., 1997). • Also, web mining research overlaps substantially with other areas, including data mining, text mining, information retrieval, and web retrieval.

  29. Machine Learning Paradigms • Machine learning algorithms can be classified as • Supervised learning: Training examples contain input/output pair patterns. Learn how to predict the output values of new examples. • Unsupervised learning: Training examples contain only the input patterns and no explicit target output. The learning algorithm needs to generalize from the input patterns to discover the output values. • We have identified the following five major Machine Learning paradigms: • Probabilistic models • Symbolic learning and rule induction • Neural networks • Analytic learning and fuzzy logic. • Evolution-based models • Hybrid approaches

  30. Machine Learning for Information Retrieval: Pre-Web • Learning techniques had been applied in Information Retrieval (IR) applications long before the recent advances of the Web. • In this section, we will briefly survey some of the research in this area, covering the use of Machine Learning in • Information extraction • Relevance feedback • Information filtering • Text classification and text clustering

  31. Web Mining • Web Mining research can be classified into three categories: • Web content mining refers to the discovery of useful information from Web contents, including text, images, audio, video, etc. • Web structure mining studies the model underlying the link structures of the Web. • It has been used for search engine result ranking and other Web applications (e.g., Brin & Page,1998; Kleinberg, 1998). • Web usage mining focuses on using data mining techniques to analyze search logs to find interesting patterns. • One of the main applications of Web usage mining is its use to learn user profiles (e.g., Armstrong et al., 1995; Wasfi et al., 1999).

  32. Intelligence and Security Informatics: COPLINK and Dark Web

  33. Intelligence and Security Informatics (ISI): Development of advanced information technologies, systems, algorithms, and databases for national security related applications, through an integrated technological, organizational, and policy-based approach” (Chen et al., 2003a) • Data, text, and web mining • From COPLINK to Dark Web • H. Chen, computer scientist, artificial intelligence, U. of Arizona (2006)

  34. A knowledge discovery research framework for ISI A knowledge discovery research framework for ISI

  35. ISI Research: KDD Techniques • Information Sharing and Collaboration • Crime Association Mining • Crime Classification and Clustering • Intelligence Text Mining • Crime Spatial and Temporal Mining • Criminal Network Analysis

  36. COPLINK • 1996-, DOJ, NIJ, NSF, ITIC, DHS • Connect • Detect • Agent • STV (Spatio-Temporal Visualization) • CAN (Criminal Activity Network) • BorderSafe (Mutual Information) • AI Lab  Knowledge Computing Corporation • Tucson, Phoenix  AZ  1600 agencies, 20 states

  37. The New York Times  November 2, 2002 • ABC News  April 15, 2003 • Newsweek Magazine  March3, 2003

  38. Dark Web • 2002-, ITIC, NSF, LOC • Discussions: FBI, DOD/Dept of Army, NSA, DHS • Connection: • Web site spidering • Forum spidering • Video spidering • Analysis and Visualization: • Link and content analysis (web sites) • Web metrics analysis (web sites sophistication) • Authorship analysis (forums; CyberGate) • Sentiment analysis (forums; CyberGate) • Video coding and analysis (videos; MCT)

  39. The Dark Web project in the Press Project Seeks to Track Terror Web Posts, 11/11/2007 Researchers say tool could trace online posts to terrorists, 11/11/2007 Mathematicians Work to Help Track Terrorist Activity, 9/14/2007 Team from the University of Arizona identifies and tracks terrorists on the Web, 9/10/2007

  40. COPLINK Connect Consolidating & Sharing Information promotes problem solving and collaboration Records Management Systems (RMS) Gang Database Mugshots Database

  41. COPLINK Detect Consolidated information enables targeted problem solving via powerful investigative criminal association analysis

  42. COPLINK Detect 2.0/2.5

  43. Association Retrieval and Visualization

  44. Spatio-temporal Analysis and Visualization

  45. Border Crossing • An aerial photograph of a typical U.S. port of entry (southern border). • Vehicle lanes are backed up with dozens of vehicles during peak times. • Criminal vehicles operate in groups. • If one is caught others turn back into Mexico. • They may join the lines one at a time or use turn-out points.

  46. Shape Indicates Object Type circles are people rectangles are vehicles Color Denotes Activity History Larger Size Indicates higher levels of activity Border Crossing Plates are outlined in Red Gang related Violent crimes Narcotics crimes Violent & Narcotics A Vehicle to Watch (via SNA)?

  47. Dark Web Collection Where/how to find them?

  48. Web Site Example: Links to Multimedia and Manuals Berg beheading others videos of Zarqawi Azzam Speeches Complete 65 pages manual of a 50 caliber rifle in pdf Link to “The General of Islam” Radio Station Source: http://www.al-ghazawat.110mb.com/, French and Arabic Web Site

  49. Web Site Example: Links to Web Sites and Forums • Links to Several Iraqi Jihadist Web Sites and Forums • Source: http://almaaber.jeeran.com/, Arabic Web Site

More Related