170 likes | 356 Views
Who am I. Graduate: Computer Systems Technology
E N D
1. Building Your Own Web Spider Thoughts, Considerations and Problems
2. Who am I
3. Why Discuss This?
4. What Will We Talk About?
5. Why Build a Spider?
6. Current Products
7. Design Considerationsaka ‘Spider Do’s and Don’ts”
8. ‘Do’s and Don’ts’ #2
9. Do’s and Don’ts #3 & #4
10. Do’s and Don’ts #5 & #6
11. Hurdles
12. Hurdles #2
13. Hurdles #3
14. Simple Spider Sample
15. Simple Spider Sample Continued def getLinks ( start_page, page_data ) : url_list = [] anchor_href_regex = '<\s*a\s*href\s*=\s*[\x27\x22]?([a-zA-Z0-9:/\\\\._-]*)[\x27\x22]?\s*' urls = re.findall(anchor_href_regex,page_data) for url in urls : url_list.append(urlparse.urljoin( start_page, url )) return url_list def getPage ( url ) : page_data = urllib.urlopen(url).read() return page_data
16. Simple Spider Sample Continued (2) if __name__ == '__main__' : end_results = [] recursion_count = 0 try: page_array = [sys.argv[1]] except IndexError: print 'Please provide a valid url.' sys.exit() while recursion_count < RECURSION_LEVEL: results = [] for current_page in page_array: page_data = getPage( current_page ) link_list = getLinks(current_page, page_data) for item in link_list: if item.find( current_page ) != -1: results.append( item ) results = list(set(results)) page_array = results end_results += results end_results = list(set(end_results)) recursion_count += 1 for item in end_results: print item
17. Q & A