Thursday, May 10, 2012

Bonus Question Blog!


First thoughts, let's step through the program execution:

This trace follows the program in general.  If I go down a path, it is because the initializer was explicitly called (I don't say it every time for the sake of space)
  1. python runs the '__main__'  block and calls the HtmlTreeViewer function
  2. Creates a HtmlTreeParser with "htmlString" as parameter.  One of the arguments is formatter.NullFormatter( ) which is some interesting reading found /usr/share/jython/Lib/formatter.py
    1. HtmlTreeParser is a subclass of htmllib.HTMLParser
      1. Inside htmllib is a class HTMLParser which in turn is a subclass of sgmllib.SGMLParser
        1. In the file sgmllib there is SGMLParser which calls markupbase.ParserBase.reset(self)
    2. I haven't found where  HtmlTagTreeModel() is located yet, but I don't think it is vital for the error at this point.
    3. Call function feed(initialText) which is a part of sgmllib.py
      1. feed goes along and takes in the raw data and begins to process it.  On line 138 we call self.parse_endtag(i), where i=4279 on the error producing run.
        1. Parse_endtag runs along, does its thing and calls finish_endtag, again passing the value 4279
          1. finish_endtag, executes the 'else' statement, 
          2. "if tag not in self.stack" returns true
            1. tag='p' self.stack = ['html', 'body']
            2. The try statement throws the AttributeError and the program calls unknown_endtag, which in sgmllib is passed, should be passed all the way up to our code!
  3. Our code gets a call to unknown_endtag which just passes the buck to endtag()
  4.  endtag() runs through all of the tags in the tagStack and checks for a match.  If no match it 'pushes' the popped tag back onto the previously popped tag until a match is found.  But for our 'p', no match is found and we cannot pop yet another (popping from the previously popped <i.e. the last> tag in the tagStack.  Note: I am saying "pushing" or "popping" but the code is "addChildren" etc.
  5. So, because that last tag does not have another child, in the quest to find the end tag for 'p', we attempt to access the empty children's self.children
Future Work
  • Find out how to fix it (obviously!)
  • Figure out why 'p' is giving us so much trouble
    • The source for www.jython.org, when I open it in just a simple editor, the first </p> end tag (the one we are looking for, I think) is in red (why?)  
  • Analyze how tags are added to the tagStack and why our 'p' doesn't have a corresponding one.
  •  Determine where the fix should be
    • starttag
    • unknown_endtag
    • other?

No comments:

Post a Comment