slide1 n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
KDD Cup 2000 Question 1 PowerPoint Presentation
Download Presentation
KDD Cup 2000 Question 1

Loading in 2 Seconds...

play fullscreen
1 / 13

KDD Cup 2000 Question 1 - PowerPoint PPT Presentation


  • 75 Views
  • Uploaded on

KDD Cup 2000 Question 1. Overview. Objective Given a set of page views, predict whether the visitor will view another page or not Data Raw Data - Clicks Aggregated Data - Sessions Some sessions clipped in the middle Indicator: Session continues Methods and Tools

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'KDD Cup 2000 Question 1' - gary-norris


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide1

KDD Cup 2000

Question 1

slide2

Overview

  • Objective
    • Given a set of page views, predict whether the visitor will view another page or not
  • Data
    • Raw Data - Clicks
    • Aggregated Data - Sessions
      • Some sessions clipped in the middle
      • Indicator: Session continues
  • Methods and Tools
    • Exploratory Data Analysis - SAS
    • Classification Tree – Amdocs Business Insight Tool
      • Decision tree
      • Rules Extraction
      • Modeling
      • Combining models
the winning model introduction
The Winning Model - Introduction

This model combines …

Artificial intelligence, i.e. Automated procedures

with

Human intuition / Domain knowledge decisions

slide5

Rule Generator

Rule Generator

Rule Generator

1466 rules

1466 rules

1466 rules

111 continue rules

111 continue rules

111 continue rules

Best

Best

Best

Hybrid

Hybrid

Hybrid

Merged

Merged

Merged

Rule

Rule

Rule

Model

Model

Model

Rules

Rules

Rules

Building Main Model

Decision Tree

Decision Tree

Decision Tree

5 trees

5 trees

5 trees

built on 34000 cases

built on 34000 cases

built on 34000 cases

slide6

Description of sub-models

Each model captures a different aspect of the overall behavior in the data.

Combining or ensembling the models provides the best prediction results.

Best rule

Chooses most accurate rule satisfied by each record

Logistic regression on rule set + raw field values combine to define score for each record

Hybrid Model

Logistic regression on rule set defines score for each record as a combination of rules the record satisfies

Merged Rules

slide7

DATA

Score Model

Score Model

Score Model

Average

Average

Average

Scores

Scores

Scores

Applying Main Model

Decision Tree

Decision Tree

Decision Tree

5 trees

5 trees

5 trees

built on 34000 cases

built on 34000 cases

built on 34000 cases

Rule Generator

Rule Generator

Rule Generator

1466 rules

1466 rules

1466 rules

111 continue rules

111 continue rules

111 continue rules

Best

Best

Best

Hybrid

Hybrid

Hybrid

Merged

Merged

Merged

Rule

Rule

Rule

Model

Model

Model

Rules

Rules

Rules

slide9

Decision Tree

Building The Model

Rule Generator

Hand selected rules with near perfect accuracy

Small Whitebox

slide10

Rule Generator

One-Click

Non-crawlers

Hand selected rules with near perfect accuracy

Score = 1

Score = 0

Small Whitebox

Decision Tree

Applying The Model

the prediction
The prediction

The prediction is not that much better than choosing the majority class. But it is enough to win first place!

final considerations
Final Considerations
  • Since both types of errors (false positives and true negatives) are given the same weight, a segment must have a very high probability of continuing to justify not being classified as the majority class.
  • The ratio of continue / not continue in the test set must be estimated as accurately as possible.
  • The cutoff point (which score threshold divides the two classes) must be carefully chosen.