Apriori Algorithm and the World Wide Web Roger G. Doss CIS 734

1 / 17

# Apriori Algorithm and the World Wide Web Roger G. Doss CIS 734 - PowerPoint PPT Presentation

Apriori Algorithm and the World Wide Web Roger G. Doss CIS 734. The purpose of this presentation is to introduce an application of the Apriori algorithm to perform association rule data mining on data gathered from the World Wide Web.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'Apriori Algorithm and the World Wide Web Roger G. Doss CIS 734' - ziazan

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

The purpose of this presentation is to introduce

an application of the Apriori algorithm to perform

association rule data mining on data gathered from

the World Wide Web.

Specifically, a system will be designed that

gathers information ( web pages ) from a user

specified site, and performs association rule

data mining on that data.

We have already seen the Apriori algorithm applied

to textual data in class.

Given an implementation that can work with

textual data...

What we want to do, is to use Apriori in the following

manner:

Given an input of:

* obtain the url

* traverse all adjacent links up to N

* format the data

* compute support and confidence levels

for each word in a user supplied keyword set.

We can invision several components to this

system which can be divided into four

components:

Phase 0: User input.

Phase 1: Data Acquisition.

Phase 2: Running Apriori on the data.

Phase 3: User output.

Data Acquisition:

Traverse Web Page(URL,N)

|

while N web pages not visited

|

Obtain WebPage via HTTP

|

Parse information

( look for keywords, adjacent links )

|

Store keywords in a file

Running Apriori on the Data:

If we treat the initial web page and each adjacent web page

as a transaction, then each occurance

of a keyword is an element in that transaction.

At this point, the Apriori algorithm can be run

on the data, producing a set of Association Rules

based on desired Confidence and Support levels.

Some modules that may be needed to implement the

system:

* HTTP Client.

Accessing a web page from a URL mechanically.

* Data Cleaning.

Extracting words that match keyword list.

Extracting hyper text references, ie,

href="http://www.njit.edu".

* Apriori Algorithm.

* Web traversal.

Building this system allows one to have a code base that can

be used for future research and work. An HTTP client is needed

to obtain data from the web, web traversal is important in

web crawling and parsing HTML allows one to extract

information from web pages.

An interesting problem is how one could traverse a web page

and visit N links reachable from that web page.

We can view the WWW as a graph. Each URL is a node

on that graph. From each page, we have hyper-text references

that point to other resources, including other web pages.

We consider these other web pages as adjacent nodes.

Assume that you have the following primitives:

string get_webpage( string url );

list<string> get_adj_webpages( string webpage );

Using C++ Standard Template Library, implement

Breadth First Search to traverse all adjacent web pages

from an initial web page source.

Hint: The following containers might be useful:

map<string, bool> visited;

queue<string> q;

void bfs( string url )

{

// Maps urls to boolean value indicating

// if they were visited.

mapM<string, bool> visited;

// FIFO queue of urls.

queue<string> q;

// List of adjacent urls.

// Contains web page results.

string data;

// Mark initial url as not visited.

visited[url] = false;

// Insert into queue the initial url.

q.push(url);

// Traverse the web pages.

while(q.size() != 0) {

if(visited[(url=q.top())] == false) {

data = get_webpage(url);

// Mark as visited.

visited[url] = true;

// Remove url just visited from queue.

q.pop();

// Insert into queue all adjacent webpages.

for(list<string>::iterator i =adj.begin(); i != adj.end(); i++) {

if(visited[(*i)] != true) {

q.push((*i));

visited[(*i)]=false;

}

}

}

}

}// bfs

We have a given node/url, A, with adjacent nodes/urls B,C,D

as follows:

page A adj B,C,D.

page B adj A,E,F.

page C adj G.

page D adj A.

Or as a directed graph:

B <---------> A <---------> D

| |

E C

| |

F G

(init) visit A

(from A) visit B,C,D

(from B) visit E,F

(from C) visit G

* We do not consider URLs already visited.

* Each time we visit a page, some processing can be done.

In this case, we obtain a list of words that we are interested

in.

Given that we can extract a set of words from a web page,

we know what URL those words appeared on, and we

can produce support and confidence levels using Apriori,

design a simple database using SQL and a RDBMS

that allows one to model the following information:

keyword, site, url, support, confidence

and give an example query where provided

the keyword, support and confidence levels,

we can obtain the site,url's that contain that

keyword with the desired support and confidence level.

Site refers to the WWW address, such as www.njit.eduan

d URL refers to the location, such as /index.html