120 likes | 133 Views
Network Programming Kansas State University at Salina. Parsing HTML Topic 3, Chapter 7. Picking information from an HTML page. A difficult problem HTML defines page layout, not content – advantage XML Very useful because of volume of data available
E N D
Network Programming Kansas State University at Salina Parsing HTMLTopic 3, Chapter 7
Picking information from an HTML page • A difficult problem • HTML defines page layout, not content – advantage XML • Very useful because of volume of data available • If the format of the page changes, your program is broken.
HTML • Definition: Token – one piece of information in an HTML formatted page • HTML tag – usually only relates to formatting • URL or image reference • Textual information • Must look at several tokens to determine context of the data • Start-tag, End-tag structure leads parsing code to use finite state machines and stacks. ( <TABLE> … </TABLE> )
Tokens <HTML> <HEAD> <TITLE> Tim Bower </TITLE> </HEAD> <BODY BGCOLOR="lightyellow"> <TABLE> <TR> <TD> <H1>Tim Bower</H1> {'data': [], 'type': 'StartTag', 'name': u'html'} {'data': [], 'type': 'StartTag', 'name': u'head'} {'data': u'\n ', 'type': 'SpaceCharacters'} {'data': [], 'type': 'StartTag', 'name': u'title'} {'data': u' ', 'type': 'SpaceCharacters'} {'data': u'Tim Bower', 'type': 'Characters'} {'data': u' ', 'type': 'SpaceCharacters'} {'data': [], 'type': 'EndTag', 'name': u'title'} {'data': u'\n', 'type': 'SpaceCharacters'} {'data': [], 'type': 'EndTag', 'name': u'head'} {'data': u'\n\n', 'type': 'SpaceCharacters'} {'data': [(u'bgcolor', u'lightyellow')], 'type': 'StartTag', 'name': u'body'} {'data': u' \n\n', 'type': 'SpaceCharacters'} {'data': [], 'type': 'StartTag', 'name': u'table'} {'data': u' ', 'type': 'SpaceCharacters'} {'data': [], 'type': 'StartTag', 'name': u'tbody'} {'data': [], 'type': 'StartTag', 'name': u'tr'} {'data': u'\n', 'type': 'SpaceCharacters'} {'data': [], 'type': 'StartTag', 'name': u'td'} {'data': u'\n', 'type': 'SpaceCharacters'} {'data': [], 'type': 'StartTag', 'name': u'h1'} {'data': u'Tim Bower', 'type': 'Characters'} {'data': [], 'type': 'EndTag', 'name': u'h1'}
Two main programming strategies • The call-back approach (HTMLParser shown in text book) • Define your own class that extends the HTMLParser class • Nice use of inheritance and polymorphism • Pass the HTML page to the parser and it calls functions from your class as needed to process the start-tags, data elements, end-tags and a few other miscellaneous tags. • The document tree approach • Parser builds a tree (data structure object) based on the page contents • You iterate through the tree or a list of tokens taken from the tree looking for desired data.
HTMLParser import HTMLParser class TitleParser(HTMLParser): def __init__(self): self.title = '' self.readingtitle = 0 HTMLParser.__init__(self) def handle_starttag(self, tag, \ attrs): if tag == 'title': self.readingtitle = 1 def handle_data(self, data): if self.readingtitle: self.title += data def handle_endtag(self, tag): if tag == 'title': print “*** %s ***” % \ self.title self.readingtitle = 0 fd = open(sys.argv[1]) tp = TitleParser() tp.feed(fd.read())
Argh!, HTMLParser is fragile and hard to debug. Traceback (most recent call last): File "C:\Users\tim\Documents\Classes\Net_Programming\Source_code\ Topic 3 - Web\weatherParser.py", line 258, in <module> parser.feed(data) File "C:\Python25\lib\HTMLParser.py", line 108, in feed self.goahead(0) File "C:\Python25\lib\HTMLParser.py", line 148, in goahead k = self.parse_starttag(i) File "C:\Python25\lib\HTMLParser.py", line 226, in parse_starttag endpos = self.check_for_whole_start_tag(i) File "C:\Python25\lib\HTMLParser.py", line 301, in check_for_whole_start_tag self.error("malformed start tag") File "C:\Python25\lib\HTMLParser.py", line 115, in error raise HTMLParseError(message, self.getpos()) HTMLParseError: malformed start tag, at line 120, column 477
html5lib • Found on Python package index • Install setuptools then use Python to install html5lib (see the README file). Both are on K-State Online. • Advantages: • Robust, standards based parser • Filtering data after the page is parsed is easier to follow and debug than the call-back approach • Disadvantage: • Documentation of API for traversing the tree
Build the tree: Loop through tokens: html5lib Usage p = html5lib.HTMLParser( \ tree=treebuilders.getTreeBuilder("dom")) f = open( "weather.html", "r" ) dom_tree = p.parse(f) f.close() walker = treewalkers.getTreeWalker("dom") stream = walker(dom_tree) passtags = [ u'a', u'h1', u'h2', u'h3', u'h4',u'em', \ u'strong', u'br', u'img', \ u'dl', u'dt', u'dd' ] for token in stream: # Don't show non interesting stuff if token.has_key('name'): if token['name'] in passtags: continue print token
The DOM tree alternative • The DOM tree may be used directly. • Not documented with html5lib, but xml.dom package is standard with Python. • DOM trees are normally used with XML, but html5lib can make a DOM tree from HTML. • Walk through the tree by examining children nodes of each node. With knowledge of the page structure, you may be able to go almost directly to the desired information. • See chapter 8 and DOMtry.py posted file.
html5lib tokens • Stream of tokens is a list • Each token is a dictionary • token[ ‘data’ ] • String (unicode encoding) • Empty list • List of tuples for formatting attributes • token[ ‘type’ ] – (StartTag, EndTag, Characters, SpaceCharacters) • token[ ‘name’ ] – description of start and end tags. (table, tr, td, h1, br, ul, li, … ) • See example of tokens on previous slide
html5lib token parsing doingTitle = False for token in stream: if token.has_key('name'): if token['name'] in passtags: continue else: tName = token['name'] tType = token['type'] if tType == 'StartTag': if tName == u'title': title = '' doingTitle = True if tType == 'EndTag': if tName == u'title': print "*** %s ***\n" % title doingTitle = False if tType == 'Characters': if doingTitle: title += token['data']