Mining di dati web

Slide 1:Mining di dati web
A.A 2006/2007
Slide 2:Il Corso
Codice:nw451 Sigla:MDW Crediti:6 Orario: Mercoled� e Venerd� 16:00-18:00, aula B Ricevimento: Richiedere appuntamento per e-mail c/o ISTI, Area Ricerca CNR, localit� San Cataldo, Pisa, ingresso 19
Slide 3:Docenti
Raffaele Perego raffaele.perego@isti.cnr.it, tel.0503152993 Claudio.Lucchese claudio.lucchese@isti.cnr.it, tel.0503152967 Fabrizio Silvestri fabrizio.silvestri@isti.cnr.it, tel.0503153011 Diego Puppin diego.puppin@isti.cnr.it, tel.0503153011 Antonio Panciatici antonio.panciatici@isti.cnr.it, tel.0503152967
Slide 4:Obiettivi del corso
Il World Wide Web (WWW) ha cambiato il modo di concepire le informazioni, di renderle fruibili e di gestirle. Scoprire nel web informazioni non note, non banali e rilevanti � sempre pi� importante e difficile. Il Web mining � quindi diventato fondamentale per l�ottimizzazione di strumenti strategici quali i siti di e-commerce, i motori di ricerca, le directory Il corso si propone l�obiettivo di fornire strumenti e conoscenze in questo settore
Slide 5:Contenuti del Corso
Introduzione Data Mining, Knowledge Discovery e il Web Motori di Ricerca Crawling, indexing, querying Web Content Mining Similarit�, clustering, classificazione di testi Web Structure Mining Social networks, ranking, ecc. Web Usage Mining Recommender systems, ecc. Argomenti avanzati (?!)
Slide 6:Materiale didattico
Libro di testo Mining the Web: discovering knowledge from hypertext data. S. Chakrabarti. Morgan Kaufmann, 2003. Libri Consigliati Managing Gigabytes. I.H. Witten e A. Moffat e T.C. Bell. Morgan Kaufmann, 1999. Modern Information Retrieval. R. Baeza-Yates e B. Ribeiro-Neto. Addison Wesley, 1999. Lucidi delle lezioni e articoli Pubblicati su http://malvasia.isti.cnr.it/~raffaele/webmining
Slide 7:Materiale didattico
Si ringraziano Chakrabarti e Ramakrishnan Per i lucidi allegati al libro di testo scaricabili all�indirizzo: http://www.cse.iitb.ac.in/~soumen/mining-the-web/ Fosca Giannotti e Dino Pedreschi Per i lucidi introduttivi mutuati dal corso TDM KDNUGGETS (http://www.kdnuggets.com) Ferragina, Attardi, Garcia Molina, ecc. Internet :-)
Slide 8:Esame
Prerequisiti (consigliati) AA270 � TDM � Tecniche di �Data Mining� � Primo Semestre. Modalit� di Esame Il superamento dell�esame � condizionato al corretto svolgimento di un progetto (individuale o di gruppo?) e da una discussione orale sui contenuti del corso (seminario su un articolo a scelta?).
Slide 9:Introduzione
Data Mining e Knowledge Discovery Ipertesti e cenni di storia del Web Web Mining
Slide 10:What is DM?
Slide 11:What is DM?
Slide 12:Motivations for DM
Data explosion problem: Automated data collection tools, mature database technology and internet, lead to tremendous amounts of data stored in databases, data warehouses and other information repositories. We are drowning in information, but starving for knowledge! (John Naisbett) Data mining : Extraction of interesting knowledge (rules, regularities, patterns, constraints) from large amounts of data
Slide 13:Abundance of business and industry data Competitive focus - Knowledge Management Inexpensive, powerful computing engines Strong theoretical/mathematical foundations machine learning & logic statistics database management systems Etc.
Motivations for DM
Slide 14:Sources of Data (e.g.)
Business Transactions widespread use of bar codes => storage of millions of transactions daily (e.g., Walmart: 2000 stores => 20M transactions per day, credit card records!!) most important problem: effective use of the data in a reasonable time frame for competitive decision-making e-commerce data Scientific Data data generated through multitude of experiments and observations examples, geological data, satellite imaging data, NASA earth observations, CERN HEP rate of data collection far exceeds the speed by which we analyze them Financial Data company information economic data (GNP, price indexes, etc.) stock markets
Slide 15:Sources of Data (e.g.)
Personal / Statistical Data government census medical histories customer profiles demographic data data and statistics about sports and athletes World Wide Web and Online Repositories Billions of Web documents, images, video, etc. emails, news, messages link structure of the hypertext from millions of Web sites Web usage data (from server/proxy logs, network traffic, and user registrations) online databases, and digital libraries
Slide 16:Classes of DM applications
Database analysis and decision support Market analysis target marketing, customer relation management, market basket analysis Risk analysis Forecasting, customer retention, quality control, competitive analysis. Fraud detection Text mining E.g. Mining opinions from email, documents
Slide 17:THE WEB!! Searching: google, askjeeves, yahoo, etc. Social networks analysis Web advertizing E.g. IBM Surf-Aid applies data mining algorithms to Web access logs for market-related pages to discover customer preference and behavior, analyzing effectiveness of Web marketing, improving Web site organization, etc. Watch for the PRIVACY pitfall! Many Others �. Sports. IBM Advanced Scout analyzed NBA game statistics (shots blocked, assists, and fouls) to gain competitive advantage for New York Knicks and Miami Heat. Astronomy. JPL and the Palomar Observatory discovered 22 quasars with the help of data mining
Classes of DM applications
Slide 18:The selection and processing of data for: the identification of novel, accurate, and useful patterns, and the modeling of real-world phenomena. Data mining is a major component of the KDD process automated discovery of patterns and development of predictive and explanatory models.
What is KDD? A process!
Slide 19:The KDD process
Slide 20:The KDD Process in Practice
KDD steps can be merged or combined Data Selection + Data Transformation = Data Consolidation Data Cleaning + Data Integration = Data Preprocessing KDD is an Iterative Process art + engineering rather than science
Identify Problem or Opportunity Measure effect of Action Act on Knowledge Knowledge Results Strategy Problem
Slide 21:The virtuous cycle
Slide 22:Learning the application domain: relevant prior knowledge and goals of application Data consolidation: Creating a target data set Selection and Preprocessing Data cleaning : (may take 60% of effort!) Data reduction and projection: find useful features, dimensionality/variable reduction, invariant representation. Choosing data mining methods E.g., classification, association, clustering. Choosing the mining algorithm(s) Data mining: search for patterns of interest Interpretation and evaluation: analysis of results. visualization, transformation, removing redundant patterns, � Use of discovered knowledge
The steps of the KDD process
Slide 23:Roles in the KDD process
Slide 24:Major Data Mining Tasks
Classification: predicting an item class Clustering: finding clusters in data Associations: e.g. A & B & C occur frequently Visualization: to facilitate human discovery Summarization: describing a group Deviation Detection: finding changes Estimation: predicting a continuous value Link Analysis: finding relationships �
Slide 25:Classification
Learn a method for predicting the instance class from pre-labeled (classified) instances Many approaches: Statistics, Decision Trees, Neural Networks, ...
Slide 26:Clustering
Find �natural� grouping of instances given un-labeled data
Slide 27:Association Rules & Frequent Itemsets
Transactions Frequent Itemsets: Milk, Bread (4) Bread, Cereal (3) Milk, Bread, Cereal (2) � Rules: Milk => Bread (66%)
Slide 28:Visualization & Data Mining
Visualizing the data to facilitate human discovery Presenting the discovered results in a visually "nice" way
Slide 29:Summarization
Describe features of the selected group Use natural language and graphics Usually in Combination with Deviation detection or other methods Average length of stay in this study area rose 45.7 percent, from 4.3 days to 6.2 days, because ...
Slide 30:Data Mining Central Quest
Find true patterns and avoid overfitting
Slide 31:Overfitting
Finding seemingly significant but really random patterns due to searching too many possibilities Violation of Occam�s razor the explanation of any phenomenon should make as few assumptions as possible lex parsimoniae entia non sunt multiplicanda praeter necessitatem,
Slide 32:Hypertexts and the Web
Slide 33:World Wide Web
Hypertext documents Text Links Web billions of documents authored by millions of diverse people edited by no one in particular distributed over millions of computers, connected by variety of media
Slide 34:History of Hypertext
Citation Hyperlinking Branching, non-linear discourse, nested commentary Ramayana - one of the great epic poems of India; attributed to the sage Valmiki, it recounts the life and exploits of Lord Rama. Mahabharata- an epic poem that recounts the struggle between the Kauravas and Pandavas over the disputed kingdom of Bharata, the ancient name for India Talmud - compilation of Jewish oral teachings, assembled in written form in the early centuries of the Christian era Dictionary, encyclopedia self-contained networks of textual nodes joined by referential links
Slide 35:Hypertext systems
Memex, 1945 [Vannevar Bush, US President Roosevelt's science advisor] stands for �memory extension� Aim: to create and help follow hyperlinks across documents photoelectrical-mechanical storage and computing device that could store vast amounts of information, in which a user had the ability to create links of related text and illustrations. This trail could then be stored and used for future reference. Bush believed that using this associative method of information gathering was not only practical in its own right, but was closer to the way the mind ordered information."
Hypertext, term coined by Ted Nelson in a 1965 paper to the ACM 20th national conference: [...] By 'hypertext' mean nonsequential writing - text that branches and allows choice to the reader, best read at an interactive screen.
The first hypertext-based system was developed in 1967 by a team of researchers led by Dr. Andries van Dam at Brown University. The research was funded by IBM and the first hypertext implementation, Hypertext Editing System, ran on an IBM/360 mainframe. IBM later sold the system to the Houston Manned Spacecraft Center which reportedly used it for the Apollo space program documentation
Xanadu hypertext, by Ted Nelson, 1981: In the Xanadu scheme, a universal document database (docuverse), would allow addressing of any substring of any document from any other document. "This requires an even stronger addressing scheme than the Universal Resource Locators used in the World-Wide Web." [De Bra] Additionally, Xanadu would permanently keep every version of every document, thereby eliminating the possibility of a broken link. Xanadu would only maintain the current version of the document in its entirety.
Slide 39:World-wide Web
Initiated at CERN in 1989 By Tim Berners-Lee, now w3c director: �W3 was originally developed to allow information sharing within internationally dispersed teams, and the dissemination of information by support groups. Originally aimed at the High Energy Physics community, it has spread to other areas and attracted much interest in user support, resource discovery and collaborative work areas. It is currently the most advanced information system deployed on the Internet, and embraces within its data model most information in previous networked information systems.�
Slide 40:World-wide Web
GUIs Berners-Lee (WorldWideWeb - 1990) Erwise and Viola(1992), Midas (1993) Mosaic (1993) a hypertext GUI for the X-window system HTML: markup language for rendering hypertext HTTP: hypertext transport protocol for sending HTML and other data over the Internet CERN HTTPD: server of hypertext documents
The early days of the Web : CERN HTTP traffic grows by 1000 between 1991-1994 (image courtesy W3C)
The early days of the Web: The number of servers grows from a few hundred to a million between 1991 and 1997 (image courtesy Nielsen)
Slide 43:1994: the landmark year
Foundation of the �Mosaic Communications Corporation� (later Nestcape) first World-Wide Web conference MIT and CERN agreed to set up the World-wide Web Consortium (W3C).
Slide 44:The Web
A populist, participatory medium number of writers =(approx) number of readers. enables near-zero-cost dissemination of information Abundance and authority crisis liberal and informal culture of content generation and dissemination. Very little uniform civil code. redundancy and non-standard form and content. millions of qualifying pages for most broad queries Example: java or kayaking no per se authoritative information about the reliability of a site
Slide 45:Problems due to Uniform accessibility
little support for adapting to the background of specific users. commercial interests routinely influence the operation of Web search Users pay for connection costs, not for contents Profit depends from ads, sales, etc �Search Engine Optimization� !!
Slide 46:What is Web Mining?
Examples: Web search, e.g. Google, Yahoo, MSN, Ask, � Specialized search: e.g. Froogle (comparison shopping), job ads (Flipdog) eCommerce : Recommendations: e.g. Netflix, Amazon improving conversion rate: next best product to offer Advertising, e.g. Google Adsense Fraud detection: click fraud detection, � Improving Web site design and performance Discovering interesting and useful information from Web content, structure and usage
Reproduced from Ullman & Rajaraman with permission
Slide 47:How does it differ from �classical� Data Mining?
The web is not a relation Textual information and linkage structure Usage data is huge and growing rapidly Google�s usage logs are bigger than their web crawl Data generated per day is comparable to largest conventional data warehouses Content and structure data rich in features and patterns spontaneous formation and evolution of topic-induced graph clusters hyperlink-induced communities Ability to react in real-time to usage patterns No human in the loop
Slide 48:How big is the Web ?
Number of pages Technically, infinite Because of dynamically generated content Lots of duplication (30-40%) Best estimate of �unique� static HTML pages comes from search engine claims Google = 8 billion, Yahoo = 20 billion Lots of marketing hype Reproduced from Ullman & Rajaraman with permission
Slide 49:96,854,877 web sites (Sept 2006)
http://news.netcraft.com/archives/web_server_survey.html Total Sites Across All Domains August 1995 - September 2006
Slide 50:The web as a graph
Pages = nodes, hyperlinks = edges Ignore content Directed graph High linkage 8-10 links/page on average Power-law degree distribution Reproduced from Ullman & Rajaraman with permission
Slide 51:Power-law degree distribution
Source: Broder et al, 2000 Reproduced from Ullman & Rajaraman with permission
Slide 52:Power-laws abounding
In-degrees Out-degrees Number of pages per site Number of visitors Term distribution in pages Query distribution in query logs Let�s take a closer look at structure Broder et al. (2000) studied a crawl of 200M pages and other smaller crawls Not a �small world� Reproduced from Ullman & Rajaraman with permission
Slide 53:Bow-tie Structure
Source: Broder et al, 2000 Reproduced from Ullman & Rajaraman with permission
Slide 54:Searching the Web
Content consumers Reproduced from Ullman & Rajaraman with permission
Slide 55:Ads vs. search results
Slide 56:Ads vs. search results
Search advertising is the revenue model Multi-billion-dollar industry Advertisers pay for clicks on their ads Interesting problems How to pick the top 10 results for a search from 2,230,000 matching pages? What ads to show for a search? If I�m an advertiser, which search terms should I bid on and how much to bid? Reproduced from Ullman & Rajaraman with permission
Slide 57:Sidebar: What�s in a name?
Geico sued Google, contending that it owned the trademark �Geico� Thus, ads for the keyword geico couldn�t be sold to others Court Ruling: search engines can sell keywords including trademarks No court ruling yet: whether the ad itself can use the trademarked word(s) Reproduced from Ullman & Rajaraman with permission
Slide 58:Extracting Structured Data
http://www.simplyhired.com Reproduced from Ullman & Rajaraman with permission
Slide 59:Extracting structured data
http://www.fatlens.com Reproduced from Ullman & Rajaraman with permission
Slide 60:The Long Tail (yet another power-law)
Source: Chris Anderson (2004) Reproduced from Ullman & Rajaraman with permission
Slide 61:The Long Tail
Shelf space is a scarce commodity for traditional retailers Also: TV networks, movie theaters,� The web enables near-zero-cost dissemination of information about products More choices necessitate better filters Recommendation engines (e.g., Amazon) Reproduced from Ullman & Rajaraman with permission
Slide 62:Major Web Mining topics
Crawling the web Web graph analysis Structured data extraction Classification and vertical search Collaborative filtering Web advertising and optimization Mining web logs Systems Issues Reproduced from Ullman & Rajaraman with permission
Slide 63:Web search basics
Slide 64:Search engine components
Spider (a.k.a. crawler/robot) � builds corpus Collects web pages recursively For each known URL, fetch the page, parse it, and extract new URLs Repeat Additional pages from direct submissions & other sources The indexer � creates inverted indexes Various policies wrt which words are indexed, capitalization, support for Unicode, stemming, support for phrases, etc. Query processor � serves query results Front end � query reformulation, word stemming, capitalization, optimization of Booleans, etc. Back end � finds matching documents and ranks them Reproduced from Ullman & Rajaraman with permission

Mining di dati web

Mining di dati web

Presentation Transcript

Mining di dati web

Mining di dati web

Mining di dati web

Mining di Dati Web

Preparazione di Dati per Data Mining

Mining di dati web

Mining di Dati Web