Replacing smart quotes, em-dashes, and ellipses with MySQL or PHP
March 5, 2009
Alternate title: "Help! My Quotes Appear as Question Marks or Other Strange Characters!"
The "Smart quotes" feature in Microsoft Office transforms straight quotes into curly quotes. It also transforms hyphens into em-dashes and three periods into ellipses. While one might think, "How lovely! My document looks almost as if I'm educated!" readers of said document may not. Microsoft, in its infinite wisdom, decided to assign special characters such as the ones we just mentioned to a range of codes above 128. Problem: these codes aren't compatible with other character sets such as ISO-8859-1 or UTF-8, resulting in frustrating issues with non-Microsoft systems.
Keep reading for some PHP and MySQL code to help out with this issue.
Our introduction to this was in a situation where we had people using many different systems submitting articles to one of our programs. We decided that we wanted all our articles to use straight quotes, hyphens, and periods. This was partly for consistency, and partly because these characters are common to many character sets and won't cause incompatibilities. Should your requirements be different, it should be trivial to modify the code below to fit your specific situation.
Here are some MySQL and PHP techniques for replacing all instances of smart quotes, plus the en dash, em dash, and ellipsis with straight quotes, one or two dashes, or three dots. This code should operate with both the Windows-1252 charset, and also UTF-8, an extended character set that is in many situations the "best" character set to use for email and websites.
MySQL:
# First, replace UTF-8 characters.
UPDATE `t` SET `c` = REPLACE(`c`, 0xE28098, "'");
UPDATE `t` SET `c` = REPLACE(`c`, 0xE28099, "'");
UPDATE `t` SET `c` = REPLACE(`c`, 0xE2809C, '"');
UPDATE `t` SET `c` = REPLACE(`c`, 0xE2809D, '"');
UPDATE `t` SET `c` = REPLACE(`c`, 0xE28093, '-');
UPDATE `t` SET `c` = REPLACE(`c`, 0xE28094, '--');
UPDATE `t` SET `c` = REPLACE(`c`, 0xE280A6, '...');
# Next, replace their Windows-1252 equivalents.
UPDATE `t` SET `c` = REPLACE(`c`, char(145), "'");
UPDATE `t` SET `c` = REPLACE(`c`, char(146), "'");
UPDATE `t` SET `c` = REPLACE(`c`, char(147), '"');
UPDATE `t` SET `c` = REPLACE(`c`, char(148), '"');
UPDATE `t` SET `c` = REPLACE(`c`, char(150), '-');
UPDATE `t` SET `c` = REPLACE(`c`, char(151), '--');
UPDATE `t` SET `c` = REPLACE(`c`, char(133), '...');
PHP:
// First, replace UTF-8 characters.
$text = str_replace(
array("\xe2\x80\x98", "\xe2\x80\x99", "\xe2\x80\x9c", "\xe2\x80\x9d", "\xe2\x80\x93", "\xe2\x80\x94", "\xe2\x80\xa6"),
array("'", "'", '"', '"', '-', '--', '...'),
$text);
// Next, replace their Windows-1252 equivalents.
$text = str_replace(
array(chr(145), chr(146), chr(147), chr(148), chr(150), chr(151), chr(133)),
array("'", "'", '"', '"', '-', '--', '...'),
$text);
Additionally, here's a table of character codes that you may find useful:
Windows-1252 characters not present in ISO-8859-1
Further reading from Wikipedia:
UTF-8
ISO/IEC 8859-1
Windows-1252
- Ananth
"\xe2\x80\x98", "\xe2\x80\x99", "\xe2\x80\x9c", "\xe2\x80\x9d", "\xe2\x80\x93", "\xe2\x80\x94", "\xe2\x80\xa6"
I am glad to hear this is helping.
UPDATE note
SET note_text = REPLACE(note_text, E'\…', E'\.\.\.')
WHERE note_text LIKE E'%\…%';
Thanks so much :)
Cheers from South Africa
One note..
I got an error in parsing the Ellipsis when using chr(133) in php.
(It changed all occurrences of norwegian "Å")
Php ord() gave 226 as the ellipsis and that worked well.
Best Regards
Eirik
I'd like to note that the second part related to Windows-1252 conversion causes problem in my case. My input string is UTF8. When testing with polish 'ą' it cuts off it and all following characters?!
Sample:
Input: 'abcdąefgh'
Output: 'abcd'
I think that for UTF8 input it is better not to convert Windows-1252 characters. For many people it might be obvious, but I'm posting it here in case somebody was struggling with it ;)
My PHP application saves teachers reports and then exports to a word merge and those smart quotes were crashing my program.
Thank you again.