140 likes | 269 Views
Discover essential resources for accessing and analyzing archived data. Begin by signing up for a library card at http://www.archive.org/account/login.createaccount.php. Once registered, explore various research pages, including data availability, example research, and parallel processing tools. Gain insights into machine storage systems with data stored on approximately 200 desktop computers and learn about file types like ARC, DAT, and CDX. Utilize Unix and Alexa tools for programmatic analysis, and engage in parallel programming with the P2 system.
E N D
Important links • Need to sign up for “library cards” • http://www.archive.org/account/login.createaccount.php • Then you can access following pages: • www.archive.org/web/researcher/researcher.php • www.archive.org/web/researcher/data_available.php • www.archive.org/web/researcher/parallel.php • www.archive.org/web/researcher/example_research_create_arc.php
Machine overview • Data stored on ~200 desktop computers • Host names: ia00xxx (e.g., ia00660) • Initially, you’ll use ia0010[0-7] • Four 160GB drives on each • /0, /1, /2, and /3 • /1-/3 filled to capacity • /0 filled to 1/2 capacity • /0/tmp is “temp” space for computations
Your account • Fill out form at: http://www.soe.ucsc.edu/~raymie/290g-userinfo.html • I’ll take it from there • Expect an e-mail
Files • ARC files -- contain raw data • Multiple doc’s/file, ~100MB per file • DAT files -- contain commonly-used fields • CDX files -- index of ARC and DAT • /0/tmp/complete.cdx -- per machine • Archive-wide cdx’s on 6 machines (wayback) • All compressed (ARC on page boundaries)
Programs • Unix tools • grep, join, cut, Awk, perl, screen(!), ... • Alexa tools • P2
Alexa tools • av_arcfilter, av_cat, av_getpage, av_grep, av_prepend_random, av_randomize, av_search, av_sort
P2 • Based on data-parallel programming model • SIMD, single-instruction, multiple data • Thinking machines • Idea: run the same command line on all
P2 • P2 program [-c combiner] -p machines • program: command-line to be run • combiner: program to combine results • machines: machines to use • “-p /net/ia00100 /net/ia00101” • “-p $rack1” • $rack[1-5], $arcs
P2 - example • p2 uptime -p $ARCS • Returns result of uptime on all machines • p2 ‘zcat /0/tmp/complete.cdx.gz | wc -l’ -p .. • Returns length (in lines) of indexes
p2 • Output of “subprograms” sent to initiating “p2” program • This program “combines” these lines • By default, av_cat is used to get them to standard output • The -c option allows the user to set a combiner • But lines from subprograms can be interleaved
Crawl catalog Counts & histograms Page-change Word-change study Language id Table detection RSS download/studies Id “soft” 404/30x’s Mirror detection Javascript link extract Storage redundancy URL database Validating host counts IP sampling vs. crawls Correcting for vrt. host Possible projects