Advanced topics in data mining web mining
Download
1 / 66

Advanced Topics in Data Mining: - PowerPoint PPT Presentation


  • 344 Views
  • Updated On :

Advanced Topics in Data Mining: Web Mining. Web Mining. Web Mining. Applications are ported to the Web at rapid pace On-line services, such as America Online (AOL), and CompuServe (merged to AOL), are anxious to know user access patterns; not just “search” in the Web How Amazon does it?

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Advanced Topics in Data Mining:' - daniel_millan


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript


Web mining l.jpg
Web Mining

  • Applications are ported to the Web at rapid pace

  • On-line services, such as America Online (AOL), and CompuServe (merged to AOL), are anxious to know user access patterns; not just “search” in the Web

  • How Amazon does it?

  • Understanding Web user behavior is important

    • It can improve Web page organization

    • It can increase Web server performance

    • It can exploit Web advertising

    • It can increase business opportunity


Amazon web page l.jpg
Amazon Web Page

Association Rules


More information desired l.jpg
More Information Desired

  • Collect statistical information (page hits) only, which is insufficient since:

    • The hit frequency of a page depends not only on its content but also on its location

    • The number of users accessing a page is not available

    • Information on what pages accessed together is not available

  • Data mining in the Web (Web Mining)

    • Web Access Pattern Collection

    • Web User Pattern Mining


Web access pattern collection l.jpg
Web Access Pattern Collection

  • Server-Based Data Collection

    • Who are visiting a given Web site and what are they doing

  • Agent-Based Data Collection

    • What are the Web sites a particular user has visited?


Server based data collection l.jpg
Server-Based Data Collection

  • Examine the logs collected by HTTPd

    • Access Log (IP, Time, Access Data), Referred Log (AB), Error Log, …

    • We can combining some of them for our use if necessary

  • Problems

    • The use of proxy servers

    • The effect of caching



Access log l.jpg
Access Log

IP/Domain Name

Time

Access Data


Referred log l.jpg
Referred Log

不考慮Caching的問題


Server based data collection11 l.jpg
Server-Based Data Collection

  • Have to be done in accordance with technology advances

    • The use of Active Server Pages (Session ID available)

      • The use of proxy servers

      • The effect of caching

    • HTTPd 1.1

  • Limitation

    • Can only capture the user behavior when they are within this site


Agent based data collection l.jpg
Agent-Based Data Collection

  • Understanding individual Web behavior needs client-based data collection

  • Results are useful

    • Better Personalized Service

    • Improved Web Page Organization

    • Better Pricing Policies

  • Methods

    • Applets can only read/write files in their source servers

      • a big security constraint

    • Using Active Components (ActiveX Control) and PlugIns

      • APCS (Access Pattern Collection Server)







Agent based data collection18 l.jpg
Agent-Based Data Collection

  • Very difficult to do for non-registered users in the current Web environment

    • We have to be conducted with users’ consent

  • Very dependent upon available Web technologies


Web user pattern mining l.jpg
Web User Pattern Mining

  • Web user pattern mining is to discover user access patterns in Web servers

  • Pattern discovery and analysis tools

    • Some existing Web tools provide mechanisms for reporting user activity in the servers

    • Web Trends (http://www.webtrends.com.tw/)

    • Open Market (http://www.openmarket.com/)

    • Net.Genesis (http://www.netgen.com/)


Path traversal patterns mining l.jpg
Path Traversal Patterns Mining

  • Mining path traversal patterns in a distributed information providing environment (WWW) where documents or objects are linked together (via hyperlinks) to facilitate interactive access

  • Solution procedure consists of three steps:

    • Convert the original sequence of log data into a set of maximal forward references (MF)

      • Filter out the effect of some backward references

        • Mainly made for ease of traveling and concentrate on mining meaningful user access sequences

        • Some objects are visited because of their locations rather than their content

    • Determine the frequent traversal patterns, i.e., large reference sequences, from the maximal forward references obtained

    • Determine the maximal reference sequences from large reference sequences (Trivial)


Step1 mf references l.jpg
Step1: MF References

  • Suppose the traversal log contains the following traversal path for a user:

    • A, B, C, D, C, B, E, G, H, G, W, A, O, U, O, V

When backward references

occur, a forward reference

path terminate.

The set of maximal forward references is {ABCD, ABEGH, ABEGW, AOU, AOV}



Slide23 l.jpg

Step1:Arrange Database

Encoding


Slide24 l.jpg

Step1:Database Reduction

Database Reduction


Step2 find frequent reference sequences l.jpg
Step2: Find Frequent Reference Sequences

  • Two algorithms for finding Frequent Traversal Patterns (Frequent Reference Sequences, Frequent Consecutive Subsequences)

    • Full-Scan (FS) Algorithm

      • FS utilizes key ideas of the DHP algorithm

    • Selective-Scan (SS) Algorithm

      • SS reduces the number of database scans


Slide26 l.jpg

Full-Scan (FS) Algorithm

Generate L1 & Hash Table

Scan

DB-1


Slide27 l.jpg

Generate L1 & Hash Table

Scan

DB-1

h(x,y) = [ ( order of x ) * 23 + ( order of y ) ] mod 17








Slide34 l.jpg

Step 3: Generate FrequentTraversal Patterns

Maximal Reference Sequences


Wap mine algorithm l.jpg
WAP-Mine Algorithm

  • The key consideration is how to facilitate the tedious support counting and candidate generating operations in the mining procedure

  • Given WebAccess Sequencedatabase WAS and a support threshold , mine the complete set of -patterns of WAS

WAS


Wap mine algorithm36 l.jpg
WAP-Mine Algorithm

(1)Scan WAS once,find all frequent-1 events

(2)Scan WAS again,construct a WAP-tree

(3)Recursively mine the WAP-tree using conditional search

Access patterns



Wap t ree c onstruction l.jpg
WAP-Tree Construction

  • Using frequent events to register all count information for further mining


Mining w eb a ccess p atterns from wap t ree l.jpg
Mining Web Access Patterns from WAP-Tree

Conditional Sequence Based on c

Generate Web Access Patterns: ac, bc


Mining w eb a ccess p atterns from wap t ree40 l.jpg
Mining Web Access Patterns from WAP-Tree

Conditional Sequence Based on ac

Generate Web Access Patterns: aac, bac


Mining w eb a ccess p atterns from wap t ree41 l.jpg
Mining Web Access Patterns from WAP-Tree

Conditional Sequence Based on bac

Generate Web Access Patterns: abac


Mining w eb a ccess p atterns from wap t ree42 l.jpg
Mining Web Access Patterns from WAP-Tree

Conditional Sequence Based on abac

No Web Access Patterns are Generated


Mining for web transactions l.jpg
Mining for Web Transactions

  • To capture Web customer buying behavior

    • It is not just market basket transaction for the set of items bought by a customer in a single purchase (Association Rules)

    • It is not just Web user travel patterns (Path Traversal Patterns)

    • It is an extension from path traversal patterns

  • Exploring the relationship between traveling and buying


Mining for web transactions44 l.jpg
Mining for Web Transactions

Web Transaction

Algorithm WR (Web-transaction-Record)

Web Transaction Records <Path: a Set of Purchases>

Algorithm WTM, MTSPJ,MTSPC

Frequent Transaction Patterns

Web Transaction Association Rules


Mining for web transactions45 l.jpg
Mining for Web Transactions

  • Web-transaction-Record (WR) Algorithm

    • Extract meaningful Web transaction records from the given Web transaction

  • WTM (Web Transaction Mining)Algorithm

    • Mining Web Transaction Patterns

  • MTS (Maximal Transaction Segment) Algorithms are the improvement versions of WTM




Wtm algorithm l.jpg
WTM Algorithm

  • It joins the purchased itemsets for generating candidate transaction patterns

  • WTM employs a two-level hash tree, called Web transaction tree, to store candidate transaction patterns

    • WTM hashes not only each item but also each purchase in the path


Wtm algorithm49 l.jpg

DATABASE

Web Transaction

WT_ID

Path

Purchase

100

ABCE

B{i1}, C{i2}, E{i4}

ABFGH

B{i1}, H{i6}

ASJL

S{i7}, L{i9}

200

ABCE

B{i1}, C{i2}, E{i4}

ASJLQ

S{i7}, Q{i10}

300

ABCE

B{i1}, E{i4}

ABFG

B{i1}, G{i5}

ASJL

S{i7}, J{i8}, L{i9}

400

ABD

D{i3}

ABFG

G{i5}

ASJLQ

S{i7}, J{i8}, Q{i10}

WTM Algorithm



Wtm algorithm51 l.jpg
WTM Algorithm

Support Count >= 2

C1

T1

Path

Purchase

Sup.

AB

B{i1}

3

ABC

C{i2}

2

ABD

D{i3}

1

ABCE

E{i4}

3

ABFG

G{i5}

2

ABFGH

H{i6}

1

AS

S{i7}

4

ASJ

J{i8}

2

ASJL

L{i9}

2

ASJLQ

Q{i10}

2


Wtm algorithm52 l.jpg

T2

C2

Path

Purchase

Sup.

ABC

B{i1} C{i2}

2

Path

Purchase

Sup.

ABCE

B{i1} E{i4}

3

ABC

B{i1} C{i2}

2

ABCE

C{i2} E{i4}

2

ABCE

B{i1} E{i4}

3

ASJ

S{i7} J{i8}

2

ASJL

S{i7} L{i9}

2

AS

B{i1} S{i7}

0

ASJLQ

S{i7} Q{i10}

2

ASJ

B{i1} J{i8}

0

ASJLQ

J{i8} Q{i10}

1

ASJLQ

L{i9} Q{i10}

0

WTM Algorithm

Support Count >= 2

共28個


Wtm algorithm53 l.jpg
WTM Algorithm

Support Count >= 2

C3

Path

Purchase

Sup.

ABCE

B{i1} C{i2} E{i4}

2

T3

Path

Purchase

Sup.

ABCE

B{i1} C{i2} E{i4}

2


Wtm disadvantages l.jpg
WTM Disadvantages

  • WTM may generate a lot of unqualified candidate transaction patterns without utilizing the paths of frequent transaction patterns

  • This will degrade the performance


Mts pj algorithm l.jpg
MTSPJAlgorithm

  • Algorithm MTSPJ uses maximal transaction segment that contains frequent transaction patterns and the maximal path, to solve the unqualified candidate transaction pattern problem

  • MTSPJ generalizescandidate transaction patterns only when the leaf node of the Web transaction tree is reached


Mts pj algorithm56 l.jpg

DATABASE

A

Web Transaction

WT_ID

Path

Purchase

B

S

100

ABCE

B{i1}, C{i2}, E{i4}

ABFGH

B{i1}, H{i6}

C

F

D

J

ASJL

S{i7}, L{i9}

G

L

200

ABCE

B{i1}, C{i2}, E{i4}

E

ASJLQ

S{i7}, Q{i10}

H

Q

300

ABCE

B{i1}, E{i4}

ABFG

B{i1}, G{i5}

ASJL

S{i7}, J{i8}, L{i9}

400

ABD

D{i3}

ABFG

G{i5}

ASJLQ

S{i7}, J{i8}, Q{i10}

MTSPJAlgorithm


Mts pj algorithm57 l.jpg

A

B

S

J

C

F

L

E

G

Q

MTSPJAlgorithm

Support Count >= 2

C1

T1

Path

Purchase

Sup.

AB

B{i1}

3

ABC

C{i2}

2

ABCD

D{i3}

1

ABCE

E{i4}

3

ABFG

G{i5}

2

ABFGH

H{i6}

1

AS

S{i7}

4

ASJ

J{i8}

2

ASJL

L{i9}

2

ASJLQ

Q{i10}

2


Mts pj algorithm58 l.jpg

A

B

S

J

C

F

L

E

G

Q

Maximal Transaction Segment

Maximal Transaction Segment

ABCE

B{i1} C{i2} E{i4}

ABFG

B{i1} G{i5}

C2

Path

Purchase

Sup.

C2

Maximal Transaction Segment

ABC

B{i1} C{i2}

Path

Purchase

Sup.

ASJLQ

S{i7} J{i8} L{i9} Q{i10}

ABCE

B{i1} E{i4}

ABFG

B{i1} G{i5}

ABCE

C{i2} E{i4}

MTSPJAlgorithm

C2

Path

Purchase

Sup.

ASJ

S{i7} J{i8}

2

2

1

2

1

0

ASJL

S{i7} L{i9}

ASJL

J{i8} L{i9}

ASJLQ

S{i7} Q{i10}

ASJLQ

J{i8} Q{i10}

ASJLQ

L{i9} Q{i10}

2

3

2

1


Mts pj algorithm59 l.jpg
MTSPJAlgorithm

C2

T2


Mts pj algorithm60 l.jpg

A

B

S

J

C

L

E

Q

Maximal Transaction Segment

ABCE

B{i1} C{i2} E{i4}

C3

Path

Purchase

Sup.

ABCE

2

B{i1} C{i2} E{i4}

MTSPJAlgorithm


Mts pc algorithm l.jpg

A

B

S

J

C

F

L

E

G

Q

MTSPCAlgorithm

MTSPC utilizes the LC (Large Count) to Filter Candidates

Support Count >= 2

C1

T1

Path

Purchase

Sup.

AB

B{i1}

3

ABC

C{i2}

2

ABCD

D{i3}

1

ABCE

E{i4}

3

ABFG

G{i5}

2

ABFGH

H{i6}

1

AS

S{i7}

4

ASJ

J{i8}

2

ASJL

L{i9}

2

ASJLQ

Q{i10}

2


Mts pc algorithm62 l.jpg

A

K=1

Maximal Transaction Segment

B

S

Maximal Path

Item

LC

ABCE

B{i1}

1

J

C

F

C{i2}

1

E{i4}

1

L

E

G

Q

C2

Path

Purchase

Sup.

ABC

B{i1} C{i2}

2

C2

ABCE

B{i1} E{i4}

3

Path

Purchase

Sup.

ABCE

C{i2} E{i4}

2

ABFG

B{i1} G{i5}

1

MTSPCAlgorithm

|I| = 4 > 1

|I| = 3 > 1 (K-1)

C2

Path

Purchase

Sup.

ASJ

S{i7} J{i8}

2

ASJL

S{i7} L{i9}

2

ASJL

J{i8} L{i9}

1

|I| = 2 > 1

ASJLQ

S{i7} Q{i10}

2

ASJLQ

J{i8} Q{i10}

1

ASJLQ

L{i9} Q{i10}

0


Mts pc algorithm63 l.jpg
MTSPCAlgorithm

C2

T2


Mts pc algorithm64 l.jpg

A

K=2

B

S

Maximal Transaction Segment

Maximal Path

Item

LC

J

ABCE

B{i1}

2

C

C{i2}

2

L

E{i4}

2

E

Q

C3

Path

Purchase

ABCE

B{i1} C{i2} E{i4}

MTSPCAlgorithm

T2

|I| = 3 > 2

|I| = 1 < 2

No Generations


Mining for web transactions65 l.jpg
Mining for Web Transactions

  • <ABCE : B{1}, E{4}> = 2

  • <AB : B{1}> = 3

  • We can derive <ABCE : B{1} => E{4}>

    • support_count(<ABCE : B{1} => E{4}>) = 2

    • confidence(<ABCE : B{1} => E{4}>) =


Summary l.jpg
Summary

  • Data mining in the Web is an area of growing importance

    • In particular, the emerging of EC

    • More and more applications will benefit from the knowledge from data mining

  • Web Mining = Web Data Collection + Traditional Data Mining?

  • Important Issues

    • Incremental Web Mining


ad