Wednesday, May 23, 2012

Aha!
 
The last post was a bit off the mark.  While stepping through the sequence of the HtmlTreeViewer was educational, it did not tell us how to fix the problem.  Actually, as since discovered, there are two problems:
 
  1. HtmlTreeViewer is not properly parsing the HTML for the page www.jython.com, and we can conclude it will have errors on others as well.
  2. For the sites that do load, www.google.com for example, when clicking on a leaf node, we get an error message and the program fails to do anything else.
 
So, let's start with the first problem.  The motivation for this solution comes from this enlightening paragraph in section 9.4 of Jython Essentials:
 
If HTML were perfectly consistent and all tags had separate starts and ends, all the endtag method would have to do is pop the stack. However, a number of HTML tags, such as img or br, don’t have end tags. Ignoring that fact leads to an unbalanced tree, and to tags that are still in the stack at the end of the program. Example 9-5 handles the case pretty simply. The while loop continually pops tags off the stack until the start tag corresponding to the end tag is reached. Any children accumulated along the way are “rolled up” into the parent so that when the start tag is finally reached, it has all the tags beneath it as children, as you would expect. You can see in Figure 9-4, for instance, that the single tags <br> and <p> are both sibling children of the <td> tag—without this loop, <p> would be considered a child of <br>.
 
Clearly, www.jython.com's source is not perfectly consistent.  So when we come to the error producing tag to pop (the </p> from the earlier blog post) we end up popping all of the tags off of the tag stack and even try to pop another.  This occurs because the start tag <p> was "rolled up" as Jython Essentials indicates.
 
To fix this problem, in HtmlTreeParser re-write endtag to be:
 
def endtag(self, tag):
        poppedTags = []
        loop = True
        while loop:
            try:
                poppedTag = self.tagStack.pop( )
                if poppedTag.tagString == tag:
                    for each in poppedTags:
                        each.myParent.addChildren(each.children)
                        each.children = []
                    break
                else:
                    poppedTags.append(poppedTag)
            except:
                try:
                    while 1:
                        self.tagStack.append(poppedTags.pop())
                except:
                    loop = False
                 
Now, when we are processing an endtag "tag," we pop an element off of self.tagStack.  If this element is not "tag," then we simply keep track of it in our new list "poppedTags."   Suppose now that we eventually pop an element that is "tag."  Then, we are done popping new tags, and for each element stored in "poppedTags" we must add to the element's parent the element's children.  This is the "rolling up" talked about in the text.  
 
Problem 1 occured since the start tag to </p> had been rolled up earlier and could not be found in this way.  Thus, we handle this exception (eventually we pop everything, and try to pop the empty tagStack) by assuming the start tag was rolled up and simply ignoring the end tag that was found.  Not necessarily an ideal solution, but it is effective.  Since the source code for www.jython.com works, the error of not finding the start tag to this end tag must me minimal, and even negligible.  
 
Problem 2 is much simpler.  When we click on a node that does not have any children (an AttributeError) we get an error message and the program fails.  If we carefully examine the given "isLeaf" function in HtmlTagTreeModel, we notice that the statement
 
return (type(node) == type("")) or (len(node.children) == 0)
 
is vacuously true if "node" has no children (if you have none, the length is 0).  But as written, this will not return True, it will return an error.  Thus, rewrite to:
 
    def isLeaf(self, node):
        try:
            node.children
        except(AttributeError):
            return True
        return (type(node) == type("")) or (len(node.children) == 0)
 
and problem solved!
 
 I hope this helps!
Happy Coding!
 

No comments:

Post a Comment