archive overview and projects too
Download
Skip this Video
Download Presentation
Archive overview and projects too

Loading in 2 Seconds...

play fullscreen
1 / 14

Archive overview and projects too - PowerPoint PPT Presentation


  • 95 Views
  • Uploaded on

Archive overview and projects too. Important links. Need to sign up for “library cards” http://www.archive.org/account/login.createaccount.php Then you can access following pages: www.archive.org/web/researcher/researcher.php www.archive.org/web/researcher/data_available.php

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Archive overview and projects too' - karina-baldwin


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
important links
Important links
  • Need to sign up for “library cards”
    • http://www.archive.org/account/login.createaccount.php
  • Then you can access following pages:
    • www.archive.org/web/researcher/researcher.php
    • www.archive.org/web/researcher/data_available.php
    • www.archive.org/web/researcher/parallel.php
    • www.archive.org/web/researcher/example_research_create_arc.php
machine overview
Machine overview
  • Data stored on ~200 desktop computers
  • Host names: ia00xxx (e.g., ia00660)
    • Initially, you’ll use ia0010[0-7]
  • Four 160GB drives on each
    • /0, /1, /2, and /3
    • /1-/3 filled to capacity
    • /0 filled to 1/2 capacity
      • /0/tmp is “temp” space for computations
your account
Your account
  • Fill out form at: http://www.soe.ucsc.edu/~raymie/290g-userinfo.html
  • I’ll take it from there
  • Expect an e-mail
files
Files
  • ARC files -- contain raw data
    • Multiple doc’s/file, ~100MB per file
  • DAT files -- contain commonly-used fields
  • CDX files -- index of ARC and DAT
    • /0/tmp/complete.cdx -- per machine
    • Archive-wide cdx’s on 6 machines (wayback)
  • All compressed (ARC on page boundaries)
programs
Programs
  • Unix tools
    • grep, join, cut, Awk, perl, screen(!), ...
  • Alexa tools
  • P2
alexa tools
Alexa tools
  • av_arcfilter, av_cat, av_getpage, av_grep, av_prepend_random, av_randomize, av_search, av_sort
slide10
P2
  • Based on data-parallel programming model
    • SIMD, single-instruction, multiple data
    • Thinking machines
  • Idea: run the same command line on all
slide11
P2
  • P2 program [-c combiner] -p machines
    • program: command-line to be run
    • combiner: program to combine results
    • machines: machines to use
      • “-p /net/ia00100 /net/ia00101”
      • “-p $rack1”
        • $rack[1-5], $arcs
p2 example
P2 - example
  • p2 uptime -p $ARCS
    • Returns result of uptime on all machines
  • p2 ‘zcat /0/tmp/complete.cdx.gz | wc -l’ -p ..
    • Returns length (in lines) of indexes
slide13
p2
  • Output of “subprograms” sent to initiating “p2” program
  • This program “combines” these lines
    • By default, av_cat is used to get them to standard output
    • The -c option allows the user to set a combiner
  • But lines from subprograms can be interleaved
possible projects
Crawl catalog

Counts & histograms

Page-change

Word-change study

Language id

Table detection

RSS download/studies

Id “soft” 404/30x’s

Mirror detection

Javascript link extract

Storage redundancy

URL database

Validating host counts

IP sampling vs. crawls

Correcting for vrt. host

Possible projects
ad