Aha!
The last post was a bit off the mark. While stepping through the sequence of the HtmlTreeViewer was educational, it did not tell us how to fix the problem. Actually, as since discovered, there are two problems:
- HtmlTreeViewer is not properly parsing the HTML for the page www.jython.com, and we can conclude it will have errors on others as well.
- For the sites that do load, www.google.com for example, when clicking on a leaf node, we get an error message and the program fails to do anything else.
So, let's start with the first problem. The motivation for this solution comes from this enlightening paragraph in section 9.4 of Jython Essentials:
If HTML were perfectly consistent and all tags had separate starts
and ends, all the endtag method would have to do is
pop the stack. However, a number of HTML tags, such as
img or br, don’t have end tags.
Ignoring that fact leads to an unbalanced tree, and to tags that are
still in the stack at the end of the program. Example 9-5 handles the case
pretty simply. The while loop continually pops tags
off the stack until the start tag corresponding to the end tag is
reached. Any children accumulated along the way are “rolled up” into the
parent so that when the start tag is finally reached, it has all the
tags beneath it as children, as you would expect. You can see in Figure 9-4, for instance, that the single tags
<br> and <p> are both
sibling children of the <td> tag—without this
loop, <p> would be considered a child of
<br>.
Clearly, www.jython.com's source is not perfectly consistent. So when we come to the error producing tag to pop (the </p> from the earlier blog post) we end up popping all of the tags off of the tag stack and even try to pop another. This occurs because the start tag <p> was "rolled up" as Jython Essentials indicates.
To fix this problem, in HtmlTreeParser re-write endtag to be:
def endtag(self, tag):
poppedTags = []
loop = True
while loop:
try:
poppedTag = self.tagStack.pop( )
if poppedTag.tagString == tag:
for each in poppedTags:
each.myParent.addChildren(each.children)
each.children = []
break
else:
poppedTags.append(poppedTag)
except:
try:
while 1:
self.tagStack.append(poppedTags.pop())
except:
loop = False
poppedTags = []
loop = True
while loop:
try:
poppedTag = self.tagStack.pop( )
if poppedTag.tagString == tag:
for each in poppedTags:
each.myParent.addChildren(each.children)
each.children = []
break
else:
poppedTags.append(poppedTag)
except:
try:
while 1:
self.tagStack.append(poppedTags.pop())
except:
loop = False
Now, when we are processing an endtag "tag," we pop an element off of self.tagStack. If this element is not "tag," then we simply keep track of it in our new list "poppedTags." Suppose now that we eventually pop an element that is "tag." Then, we are done popping new tags, and for each element stored in "poppedTags" we must add to the element's parent the element's children. This is the "rolling up" talked about in the text.
Problem 1 occured since the start tag to </p> had been rolled up earlier and could not be found in this way. Thus, we handle this exception (eventually we pop everything, and try to pop the empty tagStack) by assuming the start tag was rolled up and simply ignoring the end tag that was found. Not necessarily an ideal solution, but it is effective. Since the source code for www.jython.com works, the error of not finding the start tag to this end tag must me minimal, and even negligible.
Problem 2 is much simpler. When we click on a node that does not have any children (an AttributeError) we get an error message and the program fails. If we carefully examine the given "isLeaf" function in HtmlTagTreeModel, we notice that the statement
return (type(node) == type("")) or (len(node.children) == 0)
is vacuously true if "node" has no children (if you have none, the length is 0). But as written, this will not return True, it will return an error. Thus, rewrite to:
def isLeaf(self, node):
try:
node.children
except(AttributeError):
return True
return (type(node) == type("")) or (len(node.children) == 0)
try:
node.children
except(AttributeError):
return True
return (type(node) == type("")) or (len(node.children) == 0)
and problem solved!
I hope this helps!
Happy Coding!
No comments:
Post a Comment