UnicodeDecodeError using BeautifulSoup

UnicodeDecodeError: 'ascii' codec can't decode byte 0xa3 in position 0: ordinal not in range(128)

There’s a rather annoying bug in python 2.5′s sgmllib.py which is the cause.

The function convert_charref assumes that ascii characters have values up to 255; the correct limit is 127.

3 Responses to “UnicodeDecodeError using BeautifulSoup”

  1. [...] beautifulsoup A script to generate RSS feeds for wlu.ca Saved by TheGoldCrow on Mon 06-10-2008 UnicodeDecodeError using BeautifulSoup Saved by UcdLibraryGeography on Sun 05-10-2008 what ubuntu packages did i install again? Saved by [...]

  2. I’m not sure whether the cause is the same, but in Python 2.6, BeautifulSoup is still throwing UnicodeDecodeErrors on valid UTF-8 documents (speciifically, the contents of pypi.python.org). You can work around it by omitting everything past ASCII with BeautifulSoup.BeautifulSoup(text.decode(‘ascii’, ‘ignore’))… but, of course, you lose data that way.

Leave a Reply