1 / 17

Working with Command-Line Tools

Working with Command-Line Tools. Danielle Cunniff Plumer School of Information The University of Texas at Austin Summer 2014. Download the dataset.

aadi
Download Presentation

Working with Command-Line Tools

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Working with Command-Line Tools Danielle Cunniff Plumer School of Information The University of Texas at Austin Summer 2014

  2. Download the dataset • We will be working with a smallish (34M) dataset consisting of US Trademark Application Images from the USPTO. We will only be working with images from January 4, 2008. The data is made available by PublicResource.org. • Wget • GNU Wget is a free utility for non-interactive download of files from the Web. It supports HTTP , HTTPS , and FTP protocols, as well as retrieval through HTTP proxies. • Because we are downloading only a single file, you do not need to specify any options. • Open a terminal bcadmin@ubuntu:~$ cd Downloads/ bcadmin@ubuntu:~/Downloads$ wget https://bulk.resource.org/trademark/USTrademarkImages/hr080104.zip

  3. Run a checksum on the zip file • md5sum • Print or check MD5 (128-bit) checksums. With no FILE, or when FILE is -, read standard input. • In terminal • make sure you are in the Downloads directory or other directory containing the zip file $ md5sum hr080104.zip • Redirect the output to a file • Syntax: command and arguments followed by > and name of file for output. • In terminal $ md5sum hr080104.zip > hr080104zip_md5sum.txt $ less hr080104_md5sum.txt

  4. Unzip the file using tar • Unzip • unzip will list, test, or extract files from a ZIP archive, commonly found on MS-DOS systems. The default behavior (with no options) is to extract into the current directory (and subdirectories below it) all files from the specified ZIP archive. • Option: -d will extract into a directory (directory does not need to exist) • In terminal $ unzip hr080104.zip –d hr080104 • taris an alternative to unzip, and more powerful in general, but it doesn’t work for zip files. man tar for details.

  5. Inspect the files • Install tree • Tree is a recursive directory listing program that produces a depth indented listing of files $ sudo apt-get install tree • Look at the files in the unzipped directory $ tree hr080104

  6. Tree options • Options $ man tree • -a Includes hidden files (those beginning with a dot ‘.’). • -f Prints the full path prefix for each file. • -i Makes tree not print the indentation lines, useful when used in conjunction with the -f option. • -p Print the file type and permissions for each file (as per ls -l). • -s Print the size of each file in bytes along with the name. • -h Print the size of each file but in a more human readable way. • -D Print the date of the last modification time for the file listed. • -ofilename Send output to filename. • -r Sort the output in reverse alphabetic order. • -t Sort the output by last modification time instead of alphabetically. • Look at the files again $ tree -afihD hr080104 –o hr080104.txt $ less hr080104.txt

  7. Make a copy of a few files to play with $ mkdir temp $ cphr080104/773621/77362188/* temp $ cd temp $ ls • Remember that you can use the Ubuntu autocomplete options to help avoid typing mistakes • tab will complete the name of a directory or a file after you’ve typed the first few characters, starting in the directory you’re currently in. • tab tab will show you what files match the characters you’ve entered so far • The up and down arrows will let you go back to commands you’ve previously entered.

  8. Corrupt a file • Calculate a checksum on the .xml files $ md5sum 00000001.XML > md5sum.txt • Open the file (for simplicity, we’ll use gedit). Be sure to enter the file name correctly; if you see an empty document, gedit has created a new document with nothing in it. $ gedit 00000001.XML • Change one character, save the file with a new name, and close gedit (either click the x in the top left, or do a Ctrl-C from the command line) • Save as 00000001r.XML • Run the checksum again, using >> toappend the new output to the file you previously created $ md5sum 00000001r.XML >> md5sum.txt • Compare the two checksums $ less md5sum.txt

  9. Corrupt an image file • Calculate a checksum on one of the .jpg files $ md5sum 00000002.JPG > md5sum_jpg.txt • Open the file (for simplicity, we’ll use ghex) $ ghex 00000002.JPG • Change one character, save the file with a new name, and close ghex • Save as 00000002r.JPG • Run the checksum again $ md5sum 00000002r.JPG >> md5sum_jpg.txt • Compare the two checksums $ less md5sum_jpg.txt

  10. JHOVE • See http://jhove.sourceforge.net/using.html • Install JHOVE • sudoapt-get install jhove • Run JHOVE on the XML file in the directory that you DIDN’T edit • $ jhove00000001.XML • Run JHOVE on the XML file in the directory that you corrupted • $ jhove00000001r.XML • It might help to open these side-by-side in two terminal windows • Repeat for the JPG files. What difference do you see? Why?

  11. Extract metadata with ExifTool • See http://www.sno.phy.queensu.ca/~phil/exiftool/ • Run exiftool on your uncorrupted image file $ exiftool00000002.JPG • Try it on the corrupted image file $ exiftool 00000002r.JPG • Output exiftool results to CSV $ cd .. $ exiftool –csvtemp > out.csv • Open results in LibreOfficeCalc (be sure to select the “comma” option when importing

  12. Bulk metadata operations with ExifTool • Run exiftool over your complete download $ exiftool–r –csvhr080104 > hr080104.csv • Open results in LibreOfficeCalc • For more work with exiftool, see the video tutorials by AVPreserve • http://www.avpreserve.com/exiftool-tutorial-series/

  13. FITS • FITS is a powerful set of tool for extracting and validating metadata. FITS includes: • Jhove • Exiftool • National Library of New Zealand Metadata Extractor • DROID • FFIdent • File Utility (windows) • To run FITS, locate the script fits.sh on your virtual machine. It is probably located in /home/bcadmin/Tools/fits/. Verify this: $ ls/home/bcadmin/Tools/fits/

  14. FITS options -i The input file you want to examine -o The destination of the output XML file. -r process directories recursively when -i is a directory -h Prints the usage message -v Displays the FITS version number -x convert FITS output to a standard metadata schema -xc output using a standard metadata schema and include FITS xml • If -o is not specified then the output is sent to the console window. • The general syntax for our purposes is: $ /home/bcadmin/Tools/fits/fits.sh-iinput_file -o output_file

  15. Using FITS • From the directory containing the temp directory and the hr080104 directory, try the following commands: $ /home/bcadmin/Tools/fits/fits.sh -itemp/0000001.XML • You will probably see an error, followed by the output of the command printed to the screen. To save the output, add: $ /home/bcadmin/Tools/fits/fits.sh -itemp/0000001.XML -o xml_fits.txt • Convert the output to a standard metadata scheme: $ /home/bcadmin/Tools/fits/fits.sh -x -itemp/0000001.XML -o xmlstd_fits.txt • Repeat for JPG files. Note the different standard metadata schemas.

  16. Using FITS over directories • You can process an entire directory of files with FITS. You need to add the –r (recursive) option if there are sub-directories and specify a folder to hold the output/ $ mkdirfits_temp $ /home/bcadmin/Tools/fits/fits.sh -x –i temp/ -o fits_temp/ $ mkdirfits_hr080104 $ /home/bcadmin/Tools/fits/fits.sh -x –r –i hr080104/ -o fits_hr080104/ • This will take a long time and you will see a lot of errors. • Inspect the results. The main problem is that all the files are stored in a single directory and it’s difficult to see which fits output goes with which file in the original directory.

  17. bash scripting • Some of the problems we’re seen (such as with the FITS output) can be solved by careful use of scripting. • For a good introduction to BASH, see • The Linux Documentation Project. (n.d.) Bash Tutorial Intro & How-To. Available from http://tldp.org/HOWTO/Bash-Prog-Intro-HOWTO-1.html • Other options include python and perl scripting. If you want to do this sort of work professionally, it’s highly recommended that you learn at least one of these.

More Related