mysql charsets, can I perform the conversion in python? ~ Dell Released

mysql charsets, can I perform the conversion in python?

I have a MySQL database which contains some bad data.

I start with this Unicode string:

u'TECNOLOGÍA Y EDUCACIÓN'Encoding to utf8 for the database yields:

'TECNOLOG\xc3\x8dA Y EDUCACI\xc3\x93N'When I send these bytes to the

database, using connection charset latin1 and database charset utf8 (yes,

I know this is wrong, but this has already happened, many, many times, and

the goal now is to figure out the exact process of corruption so it can be

reversed), the data is converted to this (checked using BINARY()):

'TECNOLOG\xc3\x83\xc2\x8dA Y EDUCACI\xc3\x83\xe2\x80\x9cN'Double-encoding

aside, the result I'd expect here is:

'TECNOLOG\xc3\x83\xc2\x8dA Y EDUCACI\xc3\x83\xc2\x93N'Most of this makes

sense, as it is interpreting the multi-byte utf8 chars as latin1, and

encoding each byte as an individual char, but the conversion of \x93 ->

\xe2\x80\x9c makes no sense. latin1's \x93 does not convert to utf8

\xe2\x80\x9c, although \xe2\x80\x9c can be converted to unicode, yielding

u'\u201c', which is codepoint \x93 in the cp1252 charset.

Is mysql combining latin1 and cp1252 when it handles conversions? How can

I replicate the conversion process entirely in python? I've iterated

through every encoding on the system and none of them work for the entire

string. How, in python, can I get from 'TECNOLOG\xc3\x83\xc2\x8dA Y

EDUCACI\xc3\x83\xe2\x80\x9cN' back to 'TECNOLOG\xc3\x8dA Y

EDUCACI\xc3\x93N'? Decoding as utf8 will handle the first 3/4ths

correctly, but that last one is just wrong, and nothing I've tried will

return the correct results.

Dell Released

Wednesday, August 14, 2013

mysql charsets, can I perform the conversion in python?

0 comments:

Post a Comment

Popular Posts

Categories

Blog Archive

About Me