mysql charsets, can I perform the conversion in python?
I have a MySQL database which contains some bad data.
I start with this Unicode string:
u'TECNOLOGÍA Y EDUCACIÓN'Encoding to utf8 for the database yields:
'TECNOLOG\xc3\x8dA Y EDUCACI\xc3\x93N'When I send these bytes to the
database, using connection charset latin1 and database charset utf8 (yes,
I know this is wrong, but this has already happened, many, many times, and
the goal now is to figure out the exact process of corruption so it can be
reversed), the data is converted to this (checked using BINARY()):
'TECNOLOG\xc3\x83\xc2\x8dA Y EDUCACI\xc3\x83\xe2\x80\x9cN'Double-encoding
aside, the result I'd expect here is:
'TECNOLOG\xc3\x83\xc2\x8dA Y EDUCACI\xc3\x83\xc2\x93N'Most of this makes
sense, as it is interpreting the multi-byte utf8 chars as latin1, and
encoding each byte as an individual char, but the conversion of \x93 ->
\xe2\x80\x9c makes no sense. latin1's \x93 does not convert to utf8
\xe2\x80\x9c, although \xe2\x80\x9c can be converted to unicode, yielding
u'\u201c', which is codepoint \x93 in the cp1252 charset.
Is mysql combining latin1 and cp1252 when it handles conversions? How can
I replicate the conversion process entirely in python? I've iterated
through every encoding on the system and none of them work for the entire
string. How, in python, can I get from 'TECNOLOG\xc3\x83\xc2\x8dA Y
EDUCACI\xc3\x83\xe2\x80\x9cN' back to 'TECNOLOG\xc3\x8dA Y
EDUCACI\xc3\x93N'? Decoding as utf8 will handle the first 3/4ths
correctly, but that last one is just wrong, and nothing I've tried will
return the correct results.
Wednesday, August 14, 2013
mysql charsets, can I perform the conversion in python?
Posted on 6:22 PM by Unknown
Subscribe to:
Post Comments (Atom)
0 comments:
Post a Comment