Wednesday, June 17, 2009

Python: Working in Unicode

For my ongoing Google Summer of Code project, I need to write a lot of scripts that run analysis on Bengali words. So far I had been doing away with shell scripts and a lot of php-cli scripts. But writing long object oriented code in php seems a bit cumbersome to me (that's my personal opinion, no offense to the die hard php-lovers :)), so I decided to move to python. Most of the scripts that my mentor provided me was also in python, so it made really good sense.

While working on unicode based characters in python, you'll often come across this type error message (this cost me a while to fix).


UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128)


This will happen if you do not set your character encoding in your python file to UTF-8. First you need to make sure the first few lines of your programm looks like this.


#!/usr/bin/python
# coding=utf-8
# -*- encoding: utf-8 -*-


This enables you to write unicode characters in your source code. But this does not enable you to print them in console and you'll still keep getting the same error I previously mentioned.

To solve this you need to add the following code in your /usr/lib/python2.5/sitecustomize.py file (This might change depending your installed python version)


import sys;
sys.setdefaultencoding('utf-8')


I first tried to did this in the source code but it didn't work. I kept getting this error

AttributeError: 'module' object has no attribute 'setdefaultencoding'


I'm no python expert, but maybe python's default behavior is not to allow changing of the encoding in runtime just for safety (an increasing amount of system tools are written in python these days and they run all the time in Gnome and KDE). That'd make more sense.

Back to Blogging

After a failed attempt last year to get back to blogging, I'm trying it again this year. I really wanted to get back, but got busy will...