Archive overview and projects too
This presentation is the property of its rightful owner.
Sponsored Links
1 / 14

Archive overview and projects too PowerPoint PPT Presentation


  • 65 Views
  • Uploaded on
  • Presentation posted in: General

Archive overview and projects too. Important links. Need to sign up for “library cards” http://www.archive.org/account/login.createaccount.php Then you can access following pages: www.archive.org/web/researcher/researcher.php www.archive.org/web/researcher/data_available.php

Download Presentation

Archive overview and projects too

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Archive overview and projects too

Archive overviewand projects too


Important links

Important links

  • Need to sign up for “library cards”

    • http://www.archive.org/account/login.createaccount.php

  • Then you can access following pages:

    • www.archive.org/web/researcher/researcher.php

    • www.archive.org/web/researcher/data_available.php

    • www.archive.org/web/researcher/parallel.php

    • www.archive.org/web/researcher/example_research_create_arc.php


Machine overview

Machine overview

  • Data stored on ~200 desktop computers

  • Host names: ia00xxx (e.g., ia00660)

    • Initially, you’ll use ia0010[0-7]

  • Four 160GB drives on each

    • /0, /1, /2, and /3

    • /1-/3 filled to capacity

    • /0 filled to 1/2 capacity

      • /0/tmp is “temp” space for computations


Your account

Your account

  • Fill out form at: http://www.soe.ucsc.edu/~raymie/290g-userinfo.html

  • I’ll take it from there

  • Expect an e-mail


Files

Files

  • ARC files -- contain raw data

    • Multiple doc’s/file, ~100MB per file

  • DAT files -- contain commonly-used fields

  • CDX files -- index of ARC and DAT

    • /0/tmp/complete.cdx -- per machine

    • Archive-wide cdx’s on 6 machines (wayback)

  • All compressed (ARC on page boundaries)


Arc format

ARC format


Dat format

DAT format


Programs

Programs

  • Unix tools

    • grep, join, cut, Awk, perl, screen(!), ...

  • Alexa tools

  • P2


Alexa tools

Alexa tools

  • av_arcfilter, av_cat, av_getpage, av_grep, av_prepend_random, av_randomize, av_search, av_sort


Archive overview and projects too

P2

  • Based on data-parallel programming model

    • SIMD, single-instruction, multiple data

    • Thinking machines

  • Idea: run the same command line on all


Archive overview and projects too

P2

  • P2 program [-c combiner] -p machines

    • program: command-line to be run

    • combiner: program to combine results

    • machines: machines to use

      • “-p /net/ia00100 /net/ia00101”

      • “-p $rack1”

        • $rack[1-5], $arcs


P2 example

P2 - example

  • p2 uptime -p $ARCS

    • Returns result of uptime on all machines

  • p2 ‘zcat /0/tmp/complete.cdx.gz | wc -l’ -p ..

    • Returns length (in lines) of indexes


Archive overview and projects too

p2

  • Output of “subprograms” sent to initiating “p2” program

  • This program “combines” these lines

    • By default, av_cat is used to get them to standard output

    • The -c option allows the user to set a combiner

  • But lines from subprograms can be interleaved


Possible projects

Crawl catalog

Counts & histograms

Page-change

Word-change study

Language id

Table detection

RSS download/studies

Id “soft” 404/30x’s

Mirror detection

Javascript link extract

Storage redundancy

URL database

Validating host counts

IP sampling vs. crawls

Correcting for vrt. host

Possible projects


  • Login