Posted By: Anonymous
After noticing an application tended to discard random emails due to incorrect string value errors, I went though and switched many text columns to use the
utf8 column charset and the default column collate (
utf8_general_ci) so that it would accept them. This fixed most of the errors, and made the application stop getting sql errors when it hit non-latin emails, too.
Despite this, some of the emails are still causing the program to hit incorrect string value errrors:
(Incorrect string value: 'xE4xC5xCCxC9xD3xD8...' for column 'contents' at row 1)
The contents column is a
MEDIUMTEXT datatybe which uses the
utf8 column charset and the
utf8_general_ci column collate. There are no flags that I can toggle in this column.
Keeping in mind that I don’t want to touch or even look at the application source code unless absolutely necessary:
- What is causing that error? (yes, I know the emails are full of random garbage, but I thought utf8 would be pretty permissive)
- How can I fix it?
- What are the likely effects of such a fix?
One thing I considered was switching to a utf8 varchar([some large number]) with the binary flag turned on, but I’m rather unfamiliar with MySQL, and have no idea if such a fix makes sense.
"xE4xC5xCCxC9xD3xD8" isn’t valid UTF-8. Tested using Python:
>>> "xE4xC5xCCxC9xD3xD8".decode("utf-8") ... UnicodeDecodeError: 'utf8' codec can't decode bytes in position 0-2: invalid data
If you’re looking for a way to avoid decoding errors within the database, the cp1252 encoding (aka “Windows-1252” aka “Windows Western European”) is the most permissive encoding there is – every byte value is a valid code point.
Of course it’s not going to understand genuine UTF-8 any more, nor any other non-cp1252 encoding, but it sounds like you’re not too concerned about that?