benlast ([info]benlast) wrote,
@ 2004-06-13 09:12:00
Previous Entry  Add to memories!  Tell a Friend!  Next Entry
Current mood: bouncy
Current music:Bach - Cantatas

One Of These Things Is Not Like The Other
I've been playing with Leonard Richardson's useful BeautifulSoup module; a lazy, doesn't-care, do-it-anyway parser for HTML.  This is, in turn, because I've been trying to knock up a little lookup application that will do translations using the WordReference site, but without the ads,  popups, popunders or the hassle of clicking on forms.  I'm after something that I can drop a word on and have it translated.  Both HTMLParser and htmllib choked on the output from a page such as these definitions of 'cara', so I turned to BeautifulSoup.

Which did almost exactly what I wanted - it ate the HTML and built me an object tree that I could then walk, filtering out what I didn't need.  Unfortunately, it and I suffered from a small mismatch of worldview.  I use Unicode.  A lot.

I grabbed the webpage using urllib, something like:

uo = urllib.FancyURLopener()
uo.addheader('Accept-charset','utf-8,*')

f = uo.open("http://www.wordreference.com/es/en/translation.asp?spen=",urllib.quote_plus('cara'))

#Decode the response so we have a unicode string; we always get iso-8859-1, no matter what we ask for. response = f.read().decode('iso-8859-1')

#Finished with the request f.close()


(It's actually a little more complex - you need to handle the character sets more flexibly, and override the user-agent so that WordReference doesn't block you).
Anyway, that gets me a Unicode string in response.  I can then pass it to a BeautifulSoup object, with:
soup = BeautifulSoup.BeautifulSoup()
soup.feed(response)

But... calling soup.first() (or a number of other functions) can throw me the notorious UnicodeEncodingError.  Hmm.

It turns out that BeautifulSoup is, for want of a better term, Unicode-oblivious.  If you give it a Unicode data source all the internal strings get silently promoted, but there's no specific Unicode handling in there.  This is not a bad approach, and would work very well, if it weren't for the fact that the objects use str(), a lot.  Printing any BeautifulSoup instance invokes str() to return a string representation, which uses the default encoding, which is often 'ascii'.

Implicit in the design of BeautifulSoup is the assumption that str() is a good way to represent/return the "value" of an object.  For a Tag object, __repr__calls __str__.  Given that the objects here are derived from a stream of characters, that's not unreasonable, but it misses the point that __str__() is usually supposed to return a printable representation, in the default character encoding.  When the result is Unicode that can't be converted to a string, that assumption breaks.

I think what would make more sense (from a Unicode point of view) would be to separate the value of the data from the representation of the data, so that one (for example) accessed the NavigableText.string data attribute (via a function wrapper) to get the value, but accepted that str() applied to an instance would do something like:
def __str__(self):
    """Return representation of self, omitting characters that can't be printed."""
    return self.string.encode(sys.getdefaultencoding(),'replace')


Value and representation.  Two things that can look the same, but aren't.

Oh, and I still like BeautifulSoup very much; so much so that I'm using it, with a patch to avoid the problem, submitted to Leonard.



(Post a new comment)

Soup
[info]lordlaraby
2005-12-21 12:54 pm UTC (link)
I have trouble with navigating the tree after finding a navigabletext node. I get the notorious 'NavigableText' object has no attribute 'next'... When doing this: >>> x = bs.firstText(lambda x: x.find('Member Last Name') != -1) >>> x 'Member Last Name:' >>> x.__class__
[Error: Irreparable invalid markup ('<class 'uglysoup.navigablestring'>') in entry. Owner must fix manually. Raw contents below.]

I have trouble with navigating the tree after finding a navigabletext node. I get the notorious 'NavigableText' object has no attribute 'next'... When doing this: >>> x = bs.firstText(lambda x: x.find('Member Last Name') != -1) >>> x 'Member Last Name:' >>> x.__class__ <class 'uglysoup.NavigableString'> >>> x.firstNext('td') Traceback... I hope I didn't break it.

(Reply to this)


Create an Account
Forgot your login or password?
Login w/ OpenID
English • Español • Deutsch • Русский…