Apriori Algorithm and the World Wide Web
This presentation is the property of its rightful owner.
Sponsored Links
1 / 17

Apriori Algorithm and the World Wide Web Roger G. Doss CIS 734 PowerPoint PPT Presentation


  • 66 Views
  • Uploaded on
  • Presentation posted in: General

Apriori Algorithm and the World Wide Web Roger G. Doss CIS 734. The purpose of this presentation is to introduce an application of the Apriori algorithm to perform association rule data mining on data gathered from the World Wide Web.

Download Presentation

Apriori Algorithm and the World Wide Web Roger G. Doss CIS 734

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Apriori algorithm and the world wide web roger g doss cis 734

Apriori Algorithm and the World Wide Web

Roger G. Doss

CIS 734


Apriori algorithm and the world wide web roger g doss cis 734

The purpose of this presentation is to introduce

an application of the Apriori algorithm to perform

association rule data mining on data gathered from

the World Wide Web.

Specifically, a system will be designed that

gathers information ( web pages ) from a user

specified site, and performs association rule

data mining on that data.


Apriori algorithm and the world wide web roger g doss cis 734

We have already seen the Apriori algorithm applied

to textual data in class.

Given an implementation that can work with

textual data...


Apriori algorithm and the world wide web roger g doss cis 734

What we want to do, is to use Apriori in the following

manner:

Given an input of:

(url,N links,support,confidence,keywords)

*obtain the url

* traverse all adjacent links up to N

*format the data

*compute support and confidence levels

for each word in a user supplied keyword set.


Apriori algorithm and the world wide web roger g doss cis 734

We can invision several components to this

system which can be divided into four

components:

Phase 0: User input.

Phase 1: Data Acquisition.

Phase 2: Running Apriori on the data.

Phase 3: User output.


Apriori algorithm and the world wide web roger g doss cis 734

Data Acquisition:

Traverse Web Page(URL,N)

|

while N web pages not visited

|

Obtain WebPage via HTTP

|

Parse information

( look for keywords, adjacent links )

|

Store keywords in a file

Store adjacent links to visit


Apriori algorithm and the world wide web roger g doss cis 734

Running Apriori on the Data:

If we treat the initial web page and each adjacent web page

as a transaction, then each occurance

of a keyword is an element in that transaction.

At this point, the Apriori algorithm can be run

on the data, producing a set of Association Rules

based on desired Confidence and Support levels.


Apriori algorithm and the world wide web roger g doss cis 734

Some modules that may be needed to implement the

system:

* HTTP Client.

Accessing a web page from a URL mechanically.

* Data Cleaning.

Extracting words that match keyword list.

Extracting hyper text references, ie,

href="http://www.njit.edu".

* Apriori Algorithm.

* Web traversal.


Apriori algorithm and the world wide web roger g doss cis 734

Building this system allows one to have a code base that can

be used for future research and work. An HTTP client is needed

to obtain data from the web, web traversal is important in

web crawling and parsing HTML allows one to extract

information from web pages.

An interesting problem is how one could traverse a web page

and visit N links reachable from that web page.

We can view the WWW as a graph. Each URL is a node

on that graph. From each page, we have hyper-text references

that point to other resources, including other web pages.

We consider these other web pages as adjacent nodes.


Apriori algorithm and the world wide web roger g doss cis 734

Assume that you have the following primitives:

string get_webpage( string url );

list<string> get_adj_webpages( string webpage );

Using C++ Standard Template Library, implement

Breadth First Search to traverse all adjacent web pages

from an initial web page source.

Hint: The following containers might be useful:

map<string, bool> visited;

queue<string> q;


Apriori algorithm and the world wide web roger g doss cis 734

void bfs( string url )

{

// Maps urls to boolean value indicating

// if they were visited.

mapM<string, bool> visited;

// FIFO queue of urls.

queue<string> q;

// List of adjacent urls.

list<string> adj;

// Contains web page results.

string data;

// Mark initial url as not visited.

visited[url] = false;


Apriori algorithm and the world wide web roger g doss cis 734

// Insert into queue the initial url.

q.push(url);

// Traverse the web pages.

while(q.size() != 0) {

if(visited[(url=q.top())] == false) {

data = get_webpage(url);

adj = get_adj_wepages(data);

// Mark as visited.

visited[url] = true;

// Remove url just visited from queue.

q.pop();


Apriori algorithm and the world wide web roger g doss cis 734

// Insert into queue all adjacent webpages.

for(list<string>::iterator i =adj.begin(); i != adj.end(); i++) {

// If we did not already visit this page...

if(visited[(*i)] != true) {

q.push((*i));

visited[(*i)]=false;

}

}

}

}

}// bfs


Apriori algorithm and the world wide web roger g doss cis 734

We have a given node/url, A, with adjacent nodes/urls B,C,D

as follows:

page A adj B,C,D.

page B adj A,E,F.

page C adj G.

page D adj A.

Or as a directed graph:

B <---------> A <---------> D

| |

EC

| |

FG


Apriori algorithm and the world wide web roger g doss cis 734

(init) visit A

(from A) visit B,C,D

(from B) visit E,F

(from C) visit G

* We do not consider URLs already visited.

* Each time we visit a page, some processing can be done.

In this case, we obtain a list of words that we are interested

in.


Apriori algorithm and the world wide web roger g doss cis 734

Given that we can extract a set of words from a web page,

we know what URL those words appeared on, and we

can produce support and confidence levels using Apriori,

design a simple database using SQL and a RDBMS

that allows one to model the following information:

keyword, site, url, support, confidence

and give an example query where provided

the keyword, support and confidence levels,

we can obtain the site,url's that contain that

keyword with the desired support and confidence level.

Site refers to the WWW address, such as www.njit.eduan

d URL refers to the location, such as /index.html


  • Login