Wednesday, May 23, 2012

Final Working Code for HtmlTreeViewer.py

import htmllib
import formatter
import javax.swing as swing
import java
import urllib

def count(f):
    f.counter = 0
    def real_f(x):
        f.counter +=1
        real_f.counter = f.counter
        return f(x)
    return real_f


class HtmlTreeParser(htmllib.HTMLParser):

    def __init__(self, initialText = ""):
        htmllib.HTMLParser.__init__(self, formatter.NullFormatter( ), 0)
        self.model = HtmlTagTreeModel( )
        self.tagStack = []
        self.feed(initialText)

    def currentTag(self):
        if self.tagStack:
            return self.tagStack[len(self.tagStack) - 1]

    def handle_starttag(self, tag, method, attrs):
        self.starttag(tag, attrs)

    def unknown_starttag(self, tag, attrs):
        self.starttag(tag, attrs)

    def starttag(self, tag, attrs):
        tag = Tag(tag, attrs)
        if not self.model.rootTag:
            self.model.rootTag = tag
        if self.tagStack:
            self.currentTag( ).addChild(tag)
        self.tagStack.append(tag)

    def handle_endtag(self, tag, method):
        self.endtag(tag)

    def unknown_endtag(self, tag):
        self.endtag(tag)

    def endtag(self, tag):
        poppedTags = []
        loop = True
        while loop:
            try:
                poppedTag = self.tagStack.pop( )
                if poppedTag.tagString == tag:
                    for each in poppedTags:
                        each.myParent.addChildren(each.children)
                        each.children = []
                    break
                else:
                    poppedTags.append(poppedTag)
            except:
                try:
                    while 1:
                        self.tagStack.append(poppedTags.pop())
                except:
                    loop = False
              
    def handle_data(self, data):
        data = data.strip( )
        if data:
            self.currentTag( ).children.append(data)

class Tag(java.lang.Object):

    def __init__(self, tagString, attrs=[]):
        self.tagString = tagString
        self.children = []
        self.data = ""
        self.arguments = attrs
        self.arguments.sort( )
        self.myParent = None

    def addChild(self, newChild):
        self.children.append(newChild)
        if isinstance(newChild, Tag):
            newChild.myParent = self

    def removeChild(self, oldKid):
        self.children.remove(oldKid)

    def removeChildren(self, children):
        for each in children:
            self.removeChild(each)

    def addChildren(self, children):
        for each in children:
            self.addChild(each)

    def argumentString(self):
        stringList = ["%s = %s" % (key, value)
                for key, value in self.arguments]
        return ', '.join(stringList)

    def toString(self):
        string = "<%s>" % self.tagString
        if self.arguments:
            string += self.argumentString( )
        return string

    def __str__(self):
        return toString(self)


class HtmlTagTreeModel(swing.tree.TreeModel):

    def __init__(self):
        self.rootTag = None

    def addTreeModelListener(self, listener): pass
    def removeTreeModelListener(self, listener): pass
    def valueForPathChanged(path, newValue): pass

    def getChild(self, parent, index):
        return parent.children[index]

    def getChildCount(self, parent):
        return len(parent.children)

    def getIndexOfChild(self, parent, child):
        return parent.children.index(child)

    def getRoot(self):
        return self.rootTag

    def isLeaf(self, node):
        try:
            node.children
        except(AttributeError):
            return True
        return (type(node) == type("")) or (len(node.children) == 0)


class HtmlTreeViewer(swing.JFrame):

    def __init__(self, htmlString="", url=""):
        swing.JFrame.__init__(self, title="HTML Source Browser",
                size=(400, 600))
        if url:
            connection = urllib.urlopen(url)
            htmlString = connection.read( )
        self.parser = HtmlTreeParser(htmlString)
        self.tree = swing.JTree(model=self.parser.model)
        scrollpane = swing.JScrollPane(self.tree)
        self.contentPane.add(scrollpane)
        self.defaultCloseOperation = swing.JFrame.EXIT_ON_CLOSE


if __name__ =="__main__":    HtmlTreeViewer(url="http://www.jython.com").show( )


Aha!
 
The last post was a bit off the mark.  While stepping through the sequence of the HtmlTreeViewer was educational, it did not tell us how to fix the problem.  Actually, as since discovered, there are two problems:
 
  1. HtmlTreeViewer is not properly parsing the HTML for the page www.jython.com, and we can conclude it will have errors on others as well.
  2. For the sites that do load, www.google.com for example, when clicking on a leaf node, we get an error message and the program fails to do anything else.
 
So, let's start with the first problem.  The motivation for this solution comes from this enlightening paragraph in section 9.4 of Jython Essentials:
 
If HTML were perfectly consistent and all tags had separate starts and ends, all the endtag method would have to do is pop the stack. However, a number of HTML tags, such as img or br, don’t have end tags. Ignoring that fact leads to an unbalanced tree, and to tags that are still in the stack at the end of the program. Example 9-5 handles the case pretty simply. The while loop continually pops tags off the stack until the start tag corresponding to the end tag is reached. Any children accumulated along the way are “rolled up” into the parent so that when the start tag is finally reached, it has all the tags beneath it as children, as you would expect. You can see in Figure 9-4, for instance, that the single tags <br> and <p> are both sibling children of the <td> tag—without this loop, <p> would be considered a child of <br>.
 
Clearly, www.jython.com's source is not perfectly consistent.  So when we come to the error producing tag to pop (the </p> from the earlier blog post) we end up popping all of the tags off of the tag stack and even try to pop another.  This occurs because the start tag <p> was "rolled up" as Jython Essentials indicates.
 
To fix this problem, in HtmlTreeParser re-write endtag to be:
 
def endtag(self, tag):
        poppedTags = []
        loop = True
        while loop:
            try:
                poppedTag = self.tagStack.pop( )
                if poppedTag.tagString == tag:
                    for each in poppedTags:
                        each.myParent.addChildren(each.children)
                        each.children = []
                    break
                else:
                    poppedTags.append(poppedTag)
            except:
                try:
                    while 1:
                        self.tagStack.append(poppedTags.pop())
                except:
                    loop = False
                 
Now, when we are processing an endtag "tag," we pop an element off of self.tagStack.  If this element is not "tag," then we simply keep track of it in our new list "poppedTags."   Suppose now that we eventually pop an element that is "tag."  Then, we are done popping new tags, and for each element stored in "poppedTags" we must add to the element's parent the element's children.  This is the "rolling up" talked about in the text.  
 
Problem 1 occured since the start tag to </p> had been rolled up earlier and could not be found in this way.  Thus, we handle this exception (eventually we pop everything, and try to pop the empty tagStack) by assuming the start tag was rolled up and simply ignoring the end tag that was found.  Not necessarily an ideal solution, but it is effective.  Since the source code for www.jython.com works, the error of not finding the start tag to this end tag must me minimal, and even negligible.  
 
Problem 2 is much simpler.  When we click on a node that does not have any children (an AttributeError) we get an error message and the program fails.  If we carefully examine the given "isLeaf" function in HtmlTagTreeModel, we notice that the statement
 
return (type(node) == type("")) or (len(node.children) == 0)
 
is vacuously true if "node" has no children (if you have none, the length is 0).  But as written, this will not return True, it will return an error.  Thus, rewrite to:
 
    def isLeaf(self, node):
        try:
            node.children
        except(AttributeError):
            return True
        return (type(node) == type("")) or (len(node.children) == 0)
 
and problem solved!
 
 I hope this helps!
Happy Coding!
 

Thursday, May 10, 2012

Bonus Question Blog!


First thoughts, let's step through the program execution:

This trace follows the program in general.  If I go down a path, it is because the initializer was explicitly called (I don't say it every time for the sake of space)
  1. python runs the '__main__'  block and calls the HtmlTreeViewer function
  2. Creates a HtmlTreeParser with "htmlString" as parameter.  One of the arguments is formatter.NullFormatter( ) which is some interesting reading found /usr/share/jython/Lib/formatter.py
    1. HtmlTreeParser is a subclass of htmllib.HTMLParser
      1. Inside htmllib is a class HTMLParser which in turn is a subclass of sgmllib.SGMLParser
        1. In the file sgmllib there is SGMLParser which calls markupbase.ParserBase.reset(self)
    2. I haven't found where  HtmlTagTreeModel() is located yet, but I don't think it is vital for the error at this point.
    3. Call function feed(initialText) which is a part of sgmllib.py
      1. feed goes along and takes in the raw data and begins to process it.  On line 138 we call self.parse_endtag(i), where i=4279 on the error producing run.
        1. Parse_endtag runs along, does its thing and calls finish_endtag, again passing the value 4279
          1. finish_endtag, executes the 'else' statement, 
          2. "if tag not in self.stack" returns true
            1. tag='p' self.stack = ['html', 'body']
            2. The try statement throws the AttributeError and the program calls unknown_endtag, which in sgmllib is passed, should be passed all the way up to our code!
  3. Our code gets a call to unknown_endtag which just passes the buck to endtag()
  4.  endtag() runs through all of the tags in the tagStack and checks for a match.  If no match it 'pushes' the popped tag back onto the previously popped tag until a match is found.  But for our 'p', no match is found and we cannot pop yet another (popping from the previously popped <i.e. the last> tag in the tagStack.  Note: I am saying "pushing" or "popping" but the code is "addChildren" etc.
  5. So, because that last tag does not have another child, in the quest to find the end tag for 'p', we attempt to access the empty children's self.children
Future Work
  • Find out how to fix it (obviously!)
  • Figure out why 'p' is giving us so much trouble
    • The source for www.jython.org, when I open it in just a simple editor, the first </p> end tag (the one we are looking for, I think) is in red (why?)  
  • Analyze how tags are added to the tagStack and why our 'p' doesn't have a corresponding one.
  •  Determine where the fix should be
    • starttag
    • unknown_endtag
    • other?
Welcome!
If you are here, it is probably for something specific.  Choose the appropriate link!

For the CS305 group page here is the Group page.

To see some thoughts on the bonus question, here is the blog

For my general help file, written directly in html (sorry for the ugliness!) here is the link to the public Dropbox file.  Note, not all of the links will work (particularly the top ones) since they link to files located in non-public Dropbox files on my local machine.