Wednesday, March 23, 2011

unicode() vs. str.decode() for a utf8 encoded byte string (python 2.x)

Is there any reason to prefer unicode(somestring, 'utf8') as opposed to somestring.decode('utf8')?

My only thought is that .decode() is a bound method so python may be able to resolve it more efficiently, but correct me if I'm wrong.

From stackoverflow
  • It's easy to benchmark it:

    >>> from timeit import Timer
    >>> ts = Timer("s.decode('utf-8')", "s = 'ééé'")
    >>> ts.timeit()
    8.9185450077056885
    >>> tu = Timer("unicode(s, 'utf-8')", "s = 'ééé'") 
    >>> tu.timeit()
    2.7656929492950439
    >>>
    

    Obviously, unicode() is faster.

    FWIW, I don't know where you get the impression that methods would be faster - it's quite the contrary.

    J.F. Sebastian : Fixed the example output.
    J.F. Sebastian : Python25: 3.0 vs. 0.9; Python26: 2.6 vs. 0.6 that is `unicode()` is about 4 time faster than `s.decode()`
  • I'd prefer 'something'.decode(...) since the unicode type is no longer there in Python 3.0, while text = b'binarydata'.decode(encoding) is still valid.

0 comments:

Post a Comment