1 / 20

Subject Name: Data Warehousing and data Mining

This article provides an introduction to web mining and discusses web content mining, text mining, and mining of spatial and temporal databases. It covers various techniques and applications of data mining in the context of the web.

brattonn
Download Presentation

Subject Name: Data Warehousing and data Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Subject Name: Data Warehousing and data Mining Subject Code: 10MCA542 & IS 74 Prepared By: V.Srikanth& Harkiran Preet Department : MCA & IS Date : 5/11/2014

  2. UNIT 8 WEB MINING • Introduction • Web Content Mining • Text Mining: Unstructured Text • Text Clustering and its applications • Mining Spatial and Temporal databases

  3. INTRODUCTION • The web has become an enormously important tool for communicating ideas, conducting business and entertainment. • From the beginning in the early 1990s the web is estimated to grow to almost 13 billion pages in 2011 with millions of people from all over the world accessing them every day. • The web has become the number one source for information for internet users . • Web mining is the application of data mining techniques to find interesting and potentially useful knowledge from web data. It is normally expected that either the hyperlink structure of the web or the web log data or both have been used in the mining process. • Web mining may be divided in to 1. Web content mining 2. Web structure Mining 3. Web usage Mining.

  4. WEB CONTENT MINING • Web content mining deals with discovering useful information or knowledge from web page contents. It goes well beyond using keywords in a search engine. In contrast to web usage mining and web structure mining, web content mining focuses on the web page rather than the links. Web content is a very rich information resource consisting of many types of information. For example, Unstructured free text, image,audio,video and metadata as well as hyperlinks. Here portals are employed to find what a user might be looking for . • Web structure mining deals with discovering and modeling the link structure of the web. Work has been carried out to model the web based on the topology of the hyperlinks. • Web usage mining deals with understanding user behavior in interacting with the web or with a web site.

  5. TEXT MINING:UNSTRUCTURED TEXT • “The non trivial extraction of implicit, previously unknown, and potentially useful information from (large amount of) textual data”. • An exploration and analysis of textual (natural-language) data by automatic and semi automatic means to discover new knowledge. • What is “previously unknown”information ? • Strict definition • Information that not even the writer knows. • e.g., Discovering a new method for a hair growth that is described as a side effect for a different procedure • Lenient definition • Rediscover the information that the author encoded in the text • e.g., Automatically extracting a product’s name from a web-page.

  6. TEXT MINING METHODS • Information Retrieval • Indexing and retrieval of textual documents • Information Extraction • Extraction of partial knowledge in the text • Web Mining • Indexing and retrieval of textual documents and extraction of partial knowledge using the web • Clustering • Generating collections of similar text documents

  7. TEXT MINING PROCESS

  8. TEXT MINING PROCESS • Text preprocessing • Syntactic/Semantic text analysis • Features Generation • Bag of words • Features Selection • Simple counting • Statistics • Text/Data Mining • Classification- Supervised learning • Clustering- Unsupervised learning • Analyzing results

  9. SYNTACTIC/SEMANTIC TEXT ANALYSIS • Part Of Speech (pos) tagging • Find the corresponding pos for each word e.g., John (noun) gave (verb) the (det) ball (noun) • ~98% accurate. • Word sense disambiguation • Context based or proximity based • Very accurate • Parsing • Generates a parse tree (graph) for each sentence • Each sentence is a stand alone graph 1/6/2020

  10. TEXT MINING: CLASSIFICATION DEFINITION • Given: a collection of labeled records (training set) • Each record contains a set of features (attributes), and the true class (label) • Find: a model for the class as a function of the values of the features • Goal: previously unseen records should be assigned a class as accurately as possible • A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it

  11. TEXT CLUSTERING -APPLICATIONS • Marketing: Discover distinct groups of potential buyers according to a user text based profile • e.g. amazon • Industry: Identifying groups of competitors web pages • e.g., competing products and their prices • Job seeking: Identify parameters in searching for jobs • e.g., www.flipdog.com

  12. MINING SPATIAL AND TEMPORAL DATABASES • Geometric, geographic or spatial data: space-related data • Example: Geographic space (2-D abstraction of earth surface), VLSI design, model of human brain, 3-D space representing the arrangement of chains of protein molecule. • Spatial database system vs. image database systems. • Image database system: handling digital raster image (e.g., satellite sensing, computer tomography), may also contain techniques for object analysis and extraction from images and some spatial database functionality. • Spatial (geometric, geographic) database system: handling objects in space that have identity and well-defined extents, locations, and relationships.

  13. GIS (Geographic Information System) • Analysis and visualization of geographic data • Common analysis functions of GIS • Search (thematic search, search by region) • Location analysis (buffer, corridor, overlay) • Terrain analysis (slope/aspect, drainage network) • Flow analysis (connectivity, shortest path) • Distribution (nearest neighbor, proximity, change detection) • Spatial analysis/statistics (pattern, centrality, similarity, topology) • Measurements (distance, perimeter, shape, adjacency, direction) 1/6/2020

  14. SPATIAL DBMS • SDBMS is a software system that • supports spatial data models, spatial ADTs, and a query language supporting them • supports spatial indexing, spatial operations efficiently, and query optimization • can work with an underlying DBMS • Examples • Oracle Spatial Data Cartridge • ESRI Spatial Data Engine 1/6/2020

  15. MODELLING SPATIAL OBJECTS • What needs to be represented? • Two important alternative views • Single objects: distinct entities arranged in space each of which has its own geometric description • modeling cities, forests, rivers • Spatially related collection of objects: describe space itself (about every point in space) • modeling land use, partition of a country into districts

  16. TEMPORAL DATABASES • Time-series database • Consists of sequences of values or events changing with time • Data is recorded at regular intervals • Characteristic time-series components • Trend, cycle, seasonal, irregular • Applications • Financial: stock price, inflation • Industry: power consumption • Scientific: experiment results • Meteorological: precipitation

  17. TEMPORAL DATABASES • Categories of Time-Series Movements • Long-term or trend movements (trend curve): general direction in which a time series is moving over a long interval of time • Cyclic movements or cycle variations: long term oscillations about a trend line or curve • e.g., business cycles, may or may not be periodic • Seasonal movements or seasonal variations • i.e, almost identical patterns that a time series appears to follow during corresponding months of successive years. • Irregular or random movements • Time series analysis: decomposition of a time series into these four basic movements • Additive Modal: TS = T + C + S + I • Multiplicative Modal: TS = T  C  S  I

  18. TEMPORAL DATABASES • Time-sequence query language • Should be able to specify sophisticated queries like Find all of the sequences that are similar to some sequence in class A, but not similar to any sequence in class B • Should be able to support various kinds of queries: range queries, all-pair queries, and nearest neighbor queries • Shape definition language • Allows users to define and query the overall shape of time sequences • Uses human readable series of sequence transitions or macros • Ignores the specific details • E.g., the pattern up, Up, UP can be used to describe increasing degrees of rising slopes • Macros: spike, valley, etc.

  19. SIMILAR TIME SERIES ANALYSIS

  20. MINING SPATIO-TEMPORAL DATA • Spatiotemporal data • Data has spatial extensions and changes with time • Ex: Forest fire, moving objects, hurricane & earthquakes • Automatic anomaly detection in massive moving objects • Moving objects are ubiquitous: GPS, radar, etc. • Ex: Maritime vessel surveillance • Problem: Automatic anomaly detection

More Related