download estimation for kdd cup 2003 n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Download Estimation for KDD Cup 2003 PowerPoint Presentation
Download Presentation
Download Estimation for KDD Cup 2003

Loading in 2 Seconds...

play fullscreen
1 / 26

Download Estimation for KDD Cup 2003 - PowerPoint PPT Presentation


  • 87 Views
  • Uploaded on

Download Estimation for KDD Cup 2003. Janez Brank and Jure Leskovec Jo žef Stefan Institute Ljubljana, Slovenia. Task Description. Inputs: Approx. 29000 papers from the “ high energy physics – theory ” area of arxiv.org For each paper: Full text (TeX file, often very messy)

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Download Estimation for KDD Cup 2003' - ham


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
download estimation for kdd cup 2003

Download Estimationfor KDD Cup 2003

Janez Brank and Jure Leskovec

Jožef Stefan Institute

Ljubljana, Slovenia

task description
Task Description
  • Inputs:
    • Approx. 29000 papers from the “high energy physics – theory” area of arxiv.org
    • For each paper:
      • Full text (TeX file, often very messy)
      • Metadata in a nice, structured file (authors, title, abstract, journal, subject classes)
    • The citation graph (excludes citations pointing outside our dataset)
task description1
Task Description
  • Inputs (continued):
    • For papers from 6 months (the training set, 1566 papers)
      • The number of times this paper was downloaded during its first two months in the archive
  • Problem:
    • For papers from 3 months (the test set, 678 papers), predict the number of downloads in their first two months in the archive
    • Only the 50 most frequently downloaded papers from each month will be used for evaluation!
our approach
Our Approach
  • Textual documents have traditionally been treated as “bags of words”
    • The number of occurrences of each word matters, but the order of the words is ignored
    • Efficiently represented by sparse vectors
  • We extend this to include other items besides words (“bag of X”)
    • Most of our work was spent trying various features and adjusting their weight (more on that later)
  • Use support vector regression to train a linear model, which is then used to predict the download counts on test papers
a few initial observations
A Few Initial Observations
  • Our predictions will be evaluated on 50 most downloaded papers from each month — about 20% of all papers from these months
    • It’s OK to be horribly wrong on other papers
    • Thus we should be optimistic, treating every paper as if it was in the top 20%
    • Maybe we should train the model using only 20% of the most downloaded training papers
      • Actually, 30% usually works a little better
      • To evaluate a classifier, we look at 20% of the most downloaded test papers
cross validation
Cross-Validation

Labeled papers (1566)

Split into 10 folds

9 folds (approx. 1409)

1 fold (approx. 157)

30% most frequentlydownloaded (approx. 423 papers)

20% most frequentlydownloaded (approx. 31 papers)

Train

Model

Evaluate

Lather, rinse, repeat (10 times)

Report average

a few initial observations1
A Few Initial Observations
  • We are interested in the downloads within 60 days since inclusion in the archive
    • Most of the downloadsoccur within the first fewdays, perhaps a week
    • Most are probably comingfrom the “What’s new” page, which contains only:
      • Author names
      • Institution name (rarely)
      • Title
      • Abstract
    • Citations probably don’t directly influence downloads in the first 60 days
      • But they show which papers are good, and the readers perhaps sense this in some other way from the authors / title / abstract
the rock bottom
The Rock Bottom
  • The trivial model: always predictthe average download count(computed on the training data)
    • Average download count: 384.2
    • Average error: 152.5 downloads
abstract
Abstract
  • Abstract: use the text of the abstract and title of the paper in the traditional bag-of-words style
    • 19912 features
    • No further feature selection etc.
    • This part of the vector was normalized to unit length (Euclidean norm = 1)
  • Average error: 149.4
author
Author
  • One attribute for each possible author
  • Preprocessing to tidy up the original metadata:

Y.S. Myung and Gungwon Kang myung-y kang-g

  • xa = nonzero iff. a is one of the authors of the paper x
  • This part is normalized to unit length
  • 5716 features
  • Average error: 146.4
address
Address
  • Intuition: people are more likely to download a paper if the authors are from a reputable institution
    • Admittedly, the “What’s new” page usually doesn’t mention the institution
    • Nor is it provided in the metadata,we had to extract it from TeX files (messy!)
  • Words from the address are represented using the bag-of-words model
    • But they get their own namespace, separate from the abstract and title words
    • This part of the vector is also normalizedto unit length
  • Average error: 154.0 ( worse than useless)
abstract author address
Abstract, Author, Address
  • We used Author + Abstract (“AA” for short) as the baseline for adding new features
using the citation graph
Using the Citation Graph
  • InDegree, OutDegree
    • These are quite large in comparison to the text-based features (average indegree = approx. 10)
    • We must use weighting, otherwise they will appear too important to the learner
  • InDegree is useful
  • OutDegree is largely useless (which is reasonable)

AA + InDegree

using the citation graph1
Using the Citation Graph
  • InLinks = add one feature for each paper i; it will be nonzero in vector x iff. the paper xis referenced by the paper i
    • Normalize this part of the vector to unit length
  • OutLinks = the same, nonzero iff. x references i(results on next slide)
using the citation graph2
Using the Citation Graph
  • Use HITS to compute a hub value and an authority value for each paper ( two new features)
  • Compute PageRank and add this as a new feature
    • Bad: all links point backwards in time (unlike on the web) — PageRank accumulates in the earlier years
  • InDegree, Authority, and PageRank are strongly correlated,no improvement over previous results
  • Hub is strongly correlated with OutDegree, and is just as useless
journal
Journal
  • The “Journal” field in the metadata indicates that the paper has been (or will be?) published in a journal
    • Present in about 77% of the papers
    • Already in standardized form, e.g. “Phys. Lett.” (never “Physics Letters”, “Phys. Letters”, etc.)
    • There are over 50 journals, but only 4 have more than 100 training papers
  • Papers from some journals are downloadedmore often than from others:
    • JHEP 248, J. Phys. 104, global average 194
  • Introduce one binary feature for each journal(+ one for “missing”)
miscellaneous statistics
Miscellaneous Statistics
  • TitleCc, TitleWc: number of characters/words in the title
    • The most frequently downloaded papers have relatively short titles:

The holographic principle (2927 downloads)

Twenty Years of Debate with Stephen (1540)

Brane New World (1351)

A tentative theory of large distance physics (1351)

(De)Constructing Dimensions (1343)

Lectures on supergravity (1308)

A Short Survey of Noncommutative Geometry (1246)

miscellaneous statistics1
Miscellaneous Statistics
  • Average error: 119.561 for weight = 0.02
  • The model says that the number of downloads decreases by 0.96 for each additional letter in the title :-)
  • TitleWc is useless
miscellaneous statistics2
Miscellaneous Statistics
  • AbstractCc, AbstractWc: number of characters/words in the abstract
    • Both useless
  • Number of authors (useless)
  • Year (actually Year – 2000)
    • Almost useless (reduces error from 119.56 to 119.28)
clustering
Clustering
  • Each paper was represented by a sparse vector (bag-of-words, using the abstract + title)
  • Use 2-means to split into two clusters, then split each of them recursively
    • Stop splitting if one of the two clusters would have < 600 documents
  • We ended up with 18 clusters
    • Hard to say if they’re meaningful (ask a physicist?)
  • Introduce one binary feature for each cluster(useless)
  • Also a feature (ClusDlAvg) to contain the average no. of downloads over all the training documents from the same cluster
    • Reduces error from 119.59 to 119.30
tweaking and tuning
Tweaking and Tuning
  • AA + 0.005 InDegree + 0.5 InLinks + 0.7 OutLinks + 0.3 Journal + 0.02 TitleCc/5 + 0.6 (Year – 2000) + 0.15 ClusDlAvg: 29.544 / 119.072
  • The “C” parameter for SVM regression was fixed at 1 so far
  • C = 0.7, AA + 0.006 InDegree + 0.7 InLinks + 0.85 OutLinks + 0.35 Journal + 0.03 TitleCc/5 + 0.3 ClusDlAvg: 31.805 / 118.944
    • This is the one we submitted
conclusions
Conclusions
  • It’s a nasty dataset!
    • The best model is still disappointingly inaccurate
    • …and not so much better than the trivial model
    • Weighting the features is very important
    • We tried several other features (not mentioned in this presentation) that were of no use
    • Whatever you do, there’s still so much variance left
  • SVM learns well enough here, but it can’t generalize well
    • It isn’t the trivial sort of overfitting that could be removed simply by decreasing the C parameter in SVM’s optimization problem
further work
Further Work
  • What is it that influences readers’ decisions to download a paper?
    • We are mostly using things they can see directly: author, title, abstract
    • But readers are also influenced by their background knowledge:
      • Is X currently a hot topic within this community? ( Will reading this paper help me with my own research?)
      • Is Y a well-known author?How likely is the paper to be any good?
    • It isn’t easy to catch these things,and there is a risk of ovefitting