advanced topics in data mining web mining l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Advanced Topics in Data Mining: Web Mining PowerPoint Presentation
Download Presentation
Advanced Topics in Data Mining: Web Mining

Loading in 2 Seconds...

play fullscreen
1 / 66

Advanced Topics in Data Mining: Web Mining - PowerPoint PPT Presentation


  • 360 Views
  • Uploaded on

Advanced Topics in Data Mining: Web Mining. Web Mining. Web Mining. Applications are ported to the Web at rapid pace On-line services, such as America Online (AOL), and CompuServe (merged to AOL), are anxious to know user access patterns; not just “search” in the Web How Amazon does it?

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Advanced Topics in Data Mining: Web Mining' - daniel_millan


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
web mining
Web Mining
  • Applications are ported to the Web at rapid pace
  • On-line services, such as America Online (AOL), and CompuServe (merged to AOL), are anxious to know user access patterns; not just “search” in the Web
  • How Amazon does it?
  • Understanding Web user behavior is important
    • It can improve Web page organization
    • It can increase Web server performance
    • It can exploit Web advertising
    • It can increase business opportunity
amazon web page
Amazon Web Page

Association Rules

more information desired
More Information Desired
  • Collect statistical information (page hits) only, which is insufficient since:
    • The hit frequency of a page depends not only on its content but also on its location
    • The number of users accessing a page is not available
    • Information on what pages accessed together is not available
  • Data mining in the Web (Web Mining)
    • Web Access Pattern Collection
    • Web User Pattern Mining
web access pattern collection
Web Access Pattern Collection
  • Server-Based Data Collection
    • Who are visiting a given Web site and what are they doing
  • Agent-Based Data Collection
    • What are the Web sites a particular user has visited?
server based data collection
Server-Based Data Collection
  • Examine the logs collected by HTTPd
    • Access Log (IP, Time, Access Data), Referred Log (AB), Error Log, …
    • We can combining some of them for our use if necessary
  • Problems
    • The use of proxy servers
    • The effect of caching
access log
Access Log

IP/Domain Name

Time

Access Data

referred log
Referred Log

不考慮Caching的問題

server based data collection11
Server-Based Data Collection
  • Have to be done in accordance with technology advances
    • The use of Active Server Pages (Session ID available)
      • The use of proxy servers
      • The effect of caching
    • HTTPd 1.1
  • Limitation
    • Can only capture the user behavior when they are within this site
agent based data collection
Agent-Based Data Collection
  • Understanding individual Web behavior needs client-based data collection
  • Results are useful
    • Better Personalized Service
    • Improved Web Page Organization
    • Better Pricing Policies
  • Methods
    • Applets can only read/write files in their source servers
      • a big security constraint
    • Using Active Components (ActiveX Control) and PlugIns
      • APCS (Access Pattern Collection Server)
agent based data collection18
Agent-Based Data Collection
  • Very difficult to do for non-registered users in the current Web environment
    • We have to be conducted with users’ consent
  • Very dependent upon available Web technologies
web user pattern mining
Web User Pattern Mining
  • Web user pattern mining is to discover user access patterns in Web servers
  • Pattern discovery and analysis tools
    • Some existing Web tools provide mechanisms for reporting user activity in the servers
    • Web Trends (http://www.webtrends.com.tw/)
    • Open Market (http://www.openmarket.com/)
    • Net.Genesis (http://www.netgen.com/)
path traversal patterns mining
Path Traversal Patterns Mining
  • Mining path traversal patterns in a distributed information providing environment (WWW) where documents or objects are linked together (via hyperlinks) to facilitate interactive access
  • Solution procedure consists of three steps:
    • Convert the original sequence of log data into a set of maximal forward references (MF)
      • Filter out the effect of some backward references
        • Mainly made for ease of traveling and concentrate on mining meaningful user access sequences
        • Some objects are visited because of their locations rather than their content
    • Determine the frequent traversal patterns, i.e., large reference sequences, from the maximal forward references obtained
    • Determine the maximal reference sequences from large reference sequences (Trivial)
step1 mf references
Step1: MF References
  • Suppose the traversal log contains the following traversal path for a user:
    • A, B, C, D, C, B, E, G, H, G, W, A, O, U, O, V

When backward references

occur, a forward reference

path terminate.

The set of maximal forward references is {ABCD, ABEGH, ABEGW, AOU, AOV}

slide24

Step1:Database Reduction

Database Reduction

step2 find frequent reference sequences
Step2: Find Frequent Reference Sequences
  • Two algorithms for finding Frequent Traversal Patterns (Frequent Reference Sequences, Frequent Consecutive Subsequences)
    • Full-Scan (FS) Algorithm
      • FS utilizes key ideas of the DHP algorithm
    • Selective-Scan (SS) Algorithm
      • SS reduces the number of database scans
slide26

Full-Scan (FS) Algorithm

Generate L1 & Hash Table

Scan

DB-1

slide27

Generate L1 & Hash Table

Scan

DB-1

h(x,y) = [ ( order of x ) * 23 + ( order of y ) ] mod 17

wap mine algorithm
WAP-Mine Algorithm
  • The key consideration is how to facilitate the tedious support counting and candidate generating operations in the mining procedure
  • Given WebAccess Sequencedatabase WAS and a support threshold , mine the complete set of -patterns of WAS

WAS

wap mine algorithm36
WAP-Mine Algorithm

(1)Scan WAS once,find all frequent-1 events

(2)Scan WAS again,construct a WAP-tree

(3)Recursively mine the WAP-tree using conditional search

Access patterns

wap t ree c onstruction
WAP-Tree Construction
  • Using frequent events to register all count information for further mining
mining w eb a ccess p atterns from wap t ree
Mining Web Access Patterns from WAP-Tree

Conditional Sequence Based on c

Generate Web Access Patterns: ac, bc

mining w eb a ccess p atterns from wap t ree40
Mining Web Access Patterns from WAP-Tree

Conditional Sequence Based on ac

Generate Web Access Patterns: aac, bac

mining w eb a ccess p atterns from wap t ree41
Mining Web Access Patterns from WAP-Tree

Conditional Sequence Based on bac

Generate Web Access Patterns: abac

mining w eb a ccess p atterns from wap t ree42
Mining Web Access Patterns from WAP-Tree

Conditional Sequence Based on abac

No Web Access Patterns are Generated

mining for web transactions
Mining for Web Transactions
  • To capture Web customer buying behavior
    • It is not just market basket transaction for the set of items bought by a customer in a single purchase (Association Rules)
    • It is not just Web user travel patterns (Path Traversal Patterns)
    • It is an extension from path traversal patterns
  • Exploring the relationship between traveling and buying
mining for web transactions44
Mining for Web Transactions

Web Transaction

Algorithm WR (Web-transaction-Record)

Web Transaction Records <Path: a Set of Purchases>

Algorithm WTM, MTSPJ,MTSPC

Frequent Transaction Patterns

Web Transaction Association Rules

mining for web transactions45
Mining for Web Transactions
  • Web-transaction-Record (WR) Algorithm
    • Extract meaningful Web transaction records from the given Web transaction
  • WTM (Web Transaction Mining)Algorithm
    • Mining Web Transaction Patterns
  • MTS (Maximal Transaction Segment) Algorithms are the improvement versions of WTM
wtm algorithm
WTM Algorithm
  • It joins the purchased itemsets for generating candidate transaction patterns
  • WTM employs a two-level hash tree, called Web transaction tree, to store candidate transaction patterns
    • WTM hashes not only each item but also each purchase in the path
wtm algorithm49

DATABASE

Web Transaction

WT_ID

Path

Purchase

100

ABCE

B{i1}, C{i2}, E{i4}

ABFGH

B{i1}, H{i6}

ASJL

S{i7}, L{i9}

200

ABCE

B{i1}, C{i2}, E{i4}

ASJLQ

S{i7}, Q{i10}

300

ABCE

B{i1}, E{i4}

ABFG

B{i1}, G{i5}

ASJL

S{i7}, J{i8}, L{i9}

400

ABD

D{i3}

ABFG

G{i5}

ASJLQ

S{i7}, J{i8}, Q{i10}

WTM Algorithm
wtm algorithm51
WTM Algorithm

Support Count >= 2

C1

T1

Path

Purchase

Sup.

AB

B{i1}

3

ABC

C{i2}

2

ABD

D{i3}

1

ABCE

E{i4}

3

ABFG

G{i5}

2

ABFGH

H{i6}

1

AS

S{i7}

4

ASJ

J{i8}

2

ASJL

L{i9}

2

ASJLQ

Q{i10}

2

wtm algorithm52

T2

C2

Path

Purchase

Sup.

ABC

B{i1} C{i2}

2

Path

Purchase

Sup.

ABCE

B{i1} E{i4}

3

ABC

B{i1} C{i2}

2

ABCE

C{i2} E{i4}

2

ABCE

B{i1} E{i4}

3

ASJ

S{i7} J{i8}

2

ASJL

S{i7} L{i9}

2

AS

B{i1} S{i7}

0

ASJLQ

S{i7} Q{i10}

2

ASJ

B{i1} J{i8}

0

ASJLQ

J{i8} Q{i10}

1

ASJLQ

L{i9} Q{i10}

0

WTM Algorithm

Support Count >= 2

共28個

wtm algorithm53
WTM Algorithm

Support Count >= 2

C3

Path

Purchase

Sup.

ABCE

B{i1} C{i2} E{i4}

2

T3

Path

Purchase

Sup.

ABCE

B{i1} C{i2} E{i4}

2

wtm disadvantages
WTM Disadvantages
  • WTM may generate a lot of unqualified candidate transaction patterns without utilizing the paths of frequent transaction patterns
  • This will degrade the performance
mts pj algorithm
MTSPJAlgorithm
  • Algorithm MTSPJ uses maximal transaction segment that contains frequent transaction patterns and the maximal path, to solve the unqualified candidate transaction pattern problem
  • MTSPJ generalizescandidate transaction patterns only when the leaf node of the Web transaction tree is reached
mts pj algorithm56

DATABASE

A

Web Transaction

WT_ID

Path

Purchase

B

S

100

ABCE

B{i1}, C{i2}, E{i4}

ABFGH

B{i1}, H{i6}

C

F

D

J

ASJL

S{i7}, L{i9}

G

L

200

ABCE

B{i1}, C{i2}, E{i4}

E

ASJLQ

S{i7}, Q{i10}

H

Q

300

ABCE

B{i1}, E{i4}

ABFG

B{i1}, G{i5}

ASJL

S{i7}, J{i8}, L{i9}

400

ABD

D{i3}

ABFG

G{i5}

ASJLQ

S{i7}, J{i8}, Q{i10}

MTSPJAlgorithm
mts pj algorithm57

A

B

S

J

C

F

L

E

G

Q

MTSPJAlgorithm

Support Count >= 2

C1

T1

Path

Purchase

Sup.

AB

B{i1}

3

ABC

C{i2}

2

ABCD

D{i3}

1

ABCE

E{i4}

3

ABFG

G{i5}

2

ABFGH

H{i6}

1

AS

S{i7}

4

ASJ

J{i8}

2

ASJL

L{i9}

2

ASJLQ

Q{i10}

2

mts pj algorithm58

A

B

S

J

C

F

L

E

G

Q

Maximal Transaction Segment

Maximal Transaction Segment

ABCE

B{i1} C{i2} E{i4}

ABFG

B{i1} G{i5}

C2

Path

Purchase

Sup.

C2

Maximal Transaction Segment

ABC

B{i1} C{i2}

Path

Purchase

Sup.

ASJLQ

S{i7} J{i8} L{i9} Q{i10}

ABCE

B{i1} E{i4}

ABFG

B{i1} G{i5}

ABCE

C{i2} E{i4}

MTSPJAlgorithm

C2

Path

Purchase

Sup.

ASJ

S{i7} J{i8}

2

2

1

2

1

0

ASJL

S{i7} L{i9}

ASJL

J{i8} L{i9}

ASJLQ

S{i7} Q{i10}

ASJLQ

J{i8} Q{i10}

ASJLQ

L{i9} Q{i10}

2

3

2

1

mts pj algorithm60

A

B

S

J

C

L

E

Q

Maximal Transaction Segment

ABCE

B{i1} C{i2} E{i4}

C3

Path

Purchase

Sup.

ABCE

2

B{i1} C{i2} E{i4}

MTSPJAlgorithm
mts pc algorithm

A

B

S

J

C

F

L

E

G

Q

MTSPCAlgorithm

MTSPC utilizes the LC (Large Count) to Filter Candidates

Support Count >= 2

C1

T1

Path

Purchase

Sup.

AB

B{i1}

3

ABC

C{i2}

2

ABCD

D{i3}

1

ABCE

E{i4}

3

ABFG

G{i5}

2

ABFGH

H{i6}

1

AS

S{i7}

4

ASJ

J{i8}

2

ASJL

L{i9}

2

ASJLQ

Q{i10}

2

mts pc algorithm62

A

K=1

Maximal Transaction Segment

B

S

Maximal Path

Item

LC

ABCE

B{i1}

1

J

C

F

C{i2}

1

E{i4}

1

L

E

G

Q

C2

Path

Purchase

Sup.

ABC

B{i1} C{i2}

2

C2

ABCE

B{i1} E{i4}

3

Path

Purchase

Sup.

ABCE

C{i2} E{i4}

2

ABFG

B{i1} G{i5}

1

MTSPCAlgorithm

|I| = 4 > 1

|I| = 3 > 1 (K-1)

C2

Path

Purchase

Sup.

ASJ

S{i7} J{i8}

2

ASJL

S{i7} L{i9}

2

ASJL

J{i8} L{i9}

1

|I| = 2 > 1

ASJLQ

S{i7} Q{i10}

2

ASJLQ

J{i8} Q{i10}

1

ASJLQ

L{i9} Q{i10}

0

mts pc algorithm64

A

K=2

B

S

Maximal Transaction Segment

Maximal Path

Item

LC

J

ABCE

B{i1}

2

C

C{i2}

2

L

E{i4}

2

E

Q

C3

Path

Purchase

ABCE

B{i1} C{i2} E{i4}

MTSPCAlgorithm

T2

|I| = 3 > 2

|I| = 1 < 2

No Generations

mining for web transactions65
Mining for Web Transactions
  • <ABCE : B{1}, E{4}> = 2
  • <AB : B{1}> = 3
  • We can derive <ABCE : B{1} => E{4}>
    • support_count(<ABCE : B{1} => E{4}>) = 2
    • confidence(<ABCE : B{1} => E{4}>) =
summary
Summary
  • Data mining in the Web is an area of growing importance
    • In particular, the emerging of EC
    • More and more applications will benefit from the knowledge from data mining
  • Web Mining = Web Data Collection + Traditional Data Mining?
  • Important Issues
    • Incremental Web Mining