| benlast ( @ 2004-06-13 09:12:00 |
| Current mood: | |
| Current music: | Bach - Cantatas |
One Of These Things Is Not Like The Other
I've been playing with Leonard Richardson's useful BeautifulSoup module; a lazy, doesn't-care, do-it-anyway parser for HTML. This is, in turn, because I've been trying to knock up a little lookup application that will do translations using the WordReference site, but without the ads, popups, popunders or the hassle of clicking on forms. I'm after something that I can drop a word on and have it translated. Both HTMLParser and htmllib choked on the output from a page such as these definitions of 'cara', so I turned to BeautifulSoup.
Which did almost exactly what I wanted - it ate the HTML and built me an object tree that I could then walk, filtering out what I didn't need. Unfortunately, it and I suffered from a small mismatch of worldview. I use Unicode. A lot.
I grabbed the webpage using urllib, something like:
uo = urllib.FancyURLopener()
uo.addheader('Accept-charset','utf-8,*')
f = uo.open("http://www.wordreference.com/es/en/translation.asp?spen=",urllib.quote_plus('cara'))
#Decode the response so we have a unicode string; we always get iso-8859-1, no matter what we ask for.
response = f.read().decode('iso-8859-1')
#Finished with the request
f.close()
(It's actually a little more complex - you need to handle the character sets more flexibly, and override the user-agent so that WordReference doesn't block you).
Anyway, that gets me a Unicode string in response. I can then pass it to a BeautifulSoup object, with:
soup = BeautifulSoup.BeautifulSoup() soup.feed(response)
But... calling soup.first() (or a number of other functions) can throw me the notorious UnicodeEncodingError. Hmm.
It turns out that BeautifulSoup is, for want of a better term, Unicode-oblivious. If you give it a Unicode data source all the internal strings get silently promoted, but there's no specific Unicode handling in there. This is not a bad approach, and would work very well, if it weren't for the fact that the objects use str(), a lot. Printing any BeautifulSoup instance invokes str() to return a string representation, which uses the default encoding, which is often 'ascii'.
Implicit in the design of BeautifulSoup is the assumption that str() is a good way to represent/return the "value" of an object. For a Tag object, __repr__calls __str__. Given that the objects here are derived from a stream of characters, that's not unreasonable, but it misses the point that __str__() is usually supposed to return a printable representation, in the default character encoding. When the result is Unicode that can't be converted to a string, that assumption breaks.
I think what would make more sense (from a Unicode point of view) would be to separate the value of the data from the representation of the data, so that one (for example) accessed the NavigableText.string data attribute (via a function wrapper) to get the value, but accepted that str() applied to an instance would do something like:
def __str__(self): """Return representation of self, omitting characters that can't be printed.""" return self.string.encode(sys.getdefaultencoding(),'replace')
Value and representation. Two things that can look the same, but aren't.
Oh, and I still like BeautifulSoup very much; so much so that I'm using it, with a patch to avoid the problem, submitted to Leonard.