1 / 26

Using a Simple Python Script to Download Data

Using a Simple Python Script to Download Data. Rob Letzler Goldman School of Public Policy July 2005. Overview. Explain the problem Talk about the solution strategy Then walk through the code line by line; and explain the tools and ideas in the solution.

wynona
Download Presentation

Using a Simple Python Script to Download Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Using a Simple Python Script to Download Data Rob Letzler Goldman School of Public Policy July 2005

  2. Overview • Explain the problem • Talk about the solution strategy • Then walk through the code line by line; and explain the tools and ideas in the solution

  3. What’s not here that we might want to discuss in the future • High speed numerical Python: a slow language; with fast libraries • Writing your own objects • good program structure • Functional programming: map, filter, lambda, and reduce commands. Good short overview at: http://scott.andstuff.org/FunctionalPython • (Stata generate / replace commands are roughly map; and Stata drop if ~Xis roughly filter)

  4. The Challenge • Download > 1000 daily and monthly electricity market database files from the California Independent System Operator Website.

  5. Overview • Explain the problem • Talk about the solution strategy • Then walk through the code line by line; and explain the tools and ideas in the solution

  6. Solution Strategy • Research the http:// location (URL) of each database • Write Python Code that executes once for each month t from the sample period • Generate strings for the locations of the webpage and local disk file for month t • Open the web page • Create a local disk file • Read the web page and save it in the local disk file

  7. Disclaimer • This is my first Python program. • I fear that I’ve reinvented a lot of wheels. This program uses lots of basic Python functions rather than tapping into libraries and extensions in ways that would create a shorter program. • This program structure – which has a main loop that is not in a function or object -- is fine for a simple program; but is dangerous for large, complex programs

  8. Overview • Explain the problem • Talk about the solution strategy • Then walk through the code line by line; and explain the tools and ideas in the solution

  9. Python Syntax We’ll Need • Loops • Conditional Statements • Functions • File / web reading and writing • Exception Handling

  10. For Loops in Python • Python loops over the elements of a list; not by updating an integer. • Python requires a colon (:) between a conditional / loop / function declaration and the block of additional statements it affects For item in list: Do stuff • Other programming languages would approach this as: For integer i = start to stop {Do stuff} • Python’s range(start,stop+1) is identical to other languages’ start to stop

  11. Solution Strategy • Research the database’s http:// location (URL) • Write Python Code that executes once for each month t from the sample period • Generate strings for the webpage and local disk file for month t • open the web page • create a local disk file • Read the web page and save it in the local file

  12. The Main Loop Part I month_length = [31,28,31,30,31,30,31,31,30,31,30,31] #list of number of days in each month for year in range(2001,2005): #years 2001 to 2004 -- notice ranges include the #first num, but are strictly less than the last num for month in range(1,13): if ((year in range(2002,2004)) or (year == 2001 and month > 3) or (year == 2004 and month < 10)): #only begins executing the main block if we are in #the sample period Red highlights: • Logical operators are words andand or; not & and | • To test whether a and b are the same usea == bwithtwo equal signs; to put b in a use a=bwith one equal sign.

  13. Functions • Functions are groups of statements other parts of the code can call def FunctionName (parameters): statements return optional return value • Functions may return a value. If the function returns a value, you can call it in an assignment statement, like result=FunctionName(inputs) • Functions and objects are crucial tools to design large programs that are modular, flexible, and reliable. See McConnell, Code Complete for more detail.

  14. Python passes scalar parameters by value. It passes more complex things as references to their memory locations. Different functions work on different copies of the values / references which can protect values from being accidentally changed. • If you create a new object in the function, the original will be unaffected. list_var = list_var+[“C”, “D”] • If you modify the original object without changing its memory address, the original will be changed: list_var.extend(["C", "D"]) or list_var[1]=“C” • Any variable that is defined outside of a function or object is global and can get changed by any part of the code. Avoid using global variables because it can be difficult to find and fix errors involving changes in them.

  15. notice that test_list has changed to ['A', 'B', 'C', 'D'] but that test_integer is still 5 but the copy we returned is 5000 def python_copies_numbers_but_shares_lists_and_objects(list_input, integer_input): integer_input = integer_input*1000 list_input.extend(["C","D"]) return integer_input def main (): test_list = ["A","B"] test_integer = 5 updated_integer = python_copies_numbers_but_shares_lists_and_objects(test_list, test_integer) print "notice that test_list has changed to " print test_list print "but that test_integer is still " + fpformat.fix(test_integer,0) + " but the copy we returned has changed to " + fpformat.fix(updated_integer,0) return main() Passing by Value and Reference

  16. Solution Strategy • Research the http:// location (URL) of each database • Write Python Code that executes once for each month t from the sample period • Generate strings for the webpage and local disk file for month t • open the web page • create a local disk file • Read the web page and saves it in the local file

  17. Main loop then Calls a Functions month_string = make_two_dgt_string(month) import fpformat # fpformat formats floating point numbers into strings def make_two_dgt_string(n): #takes a number and adds a leading zero if the number is less than 10 #assumes that the input number is < 100 if n > 9: #check whether we need to pad the date with a leading zero n_string = fpformat.fix(n,0) #if we don't need to pad, convert the number directly to a string else: #pad low numbers with a leading zero n_string = "0"+fpformat.fix(n,0) #otherwise convert to string and add a leading zero to the string. return n_string #either way, return the results.

  18. Main Loop then creates strings and calls more functions • #now, for each month in the sample, request a price data file • #generate caiso URL • load_url = "http://oasis.caiso.com/…&dstartdate="+fpformat.fix(year,0)+month_string… • #generate file name for my hard disk • load_file_name = "caiso_price_"+fpformat.fix(year,0)+"-"+month_string+"-"+"1-"+fpformat.fix(end_date,0)+".zip" • #download and save the requested files. • get_save_file(load_url,load_file_name) • #continue looping until we go through every month in the sample...

  19. Solution Strategy • We have: • Researched the http:// location (URL) of each database • Written Python Code that executes once for each month t from the sample period • Generated strings for the webpage and local disk file for month t • We’ve called but not seen the code that: • opens the web page • creates a local disk file • Reads the web page and saves it in the local file

  20. Connect to the webpage def get_save_file(url, file_name): #this function gets the file specified in URL from the web and then saves it in #location FILE_NAME #Designates the location in which to save the file path = "C:\\rjl\\ca_amp\\download\\price\\"+file_name try: web_data = urllib.urlopen(url) #attempt to create a shortcut / handle to the desired web page / web file except IOError, msg: print "didn't open URL %s: %s", url, str(msg)

  21. Creating and Using Objects • Many python libraries are object oriented • An object bundles a kind of data with “member functions” for manipulating that data. • Steps: 1) create (“instantiate”) objects 2) use their functions. objectName = libName.constructor(initial values) objectName.doSomething(parameters)

  22. Exceptions • try/except sequences handle routine problems like file not found errors ("exceptions") gracefully rather than ending the whole program. • try: • SomethingThatMightNotWork #this will either work or it fail and generate an exception message of failureType • except failureType1 • {If we get failure type 1, do this and continue from here} • Dividing by zero or inverting a singular matrix might throw exceptions. • limited goto statement – if there is an exception, the program stops executing and jumps immediately to the next except statement that handles that error

  23. create a local file and save the downloaded page try: f = open(path, "wb") #create a handle to a new file for "wb": _w_riting in _b_inary f.write(web_data.read()) #write into the new file the results from downloading the webpage f.close() #complete writing process. print "saved %s", path except IOError, msg: print "didn't save %s: %s", path, str(msg) return #end the routine

  24. File Manipulation in Python Details on files: Python Tutorial Section 7.2

  25. Possible extensions • Unzip the files that we downloaded (easy?) import os os.system(‘unzip ’+file_name) (See http://docs.python.org/lib/module-zipfile.html) • Test that downloaded data have expected characteristics (e.g. four fields per line) using regular expressions • Read in and manipulate the XML databases (harder?) • Enter these file names into a SAS or Stata import / analysis code and run SAS / Stata

  26. Python can do far more with webpages • Details on web: http://docs.python.org/lib/module-urllib.html • Its sample programs include: • Webchecker.py (checks for broken links on a website) • Websucker.py (downloads a whole website) • I found their code a bit hard to follow. • I used snippets of those programs as examples for this program

More Related