Earlier today, I was banging my head against the wall trying to import some data in a CSV file into MySQL. While my imports have gone well thus far, this time around I was dealing with data involving lots of strange diacritics, runic squiggles and other manners of gibberish that make the world as fun as it can be. In other words, I was dealing with Unicode.
After a couple of false starts where I found that my imported data had somehow been corrupted in the database (and I knew this because all the gibberish was looking even more unintelligible than previously). A little bird told me that the problem was that I was mixing character sets by trying to shove latin1 into utf8 and what not. Before I go any further, messing around with unicode is injurious to your data, your database, your server, your workstation, your health and that of your neighbour's canine pet. Ensure that you are working on a test database, a test table and ideally at least a couple of miles away from your live server.
Where was I.. Yes, so I had to make sure that every step of the import process used the same character set, which in my case was utf8
(utf8_general_ci
). My life was also made needlessly complicated by the fact that I was mixing platforms as well with a Windows 7 desktop and a Kubuntu server.
At any rate, the things to do/check include:
- Make sure that every tool that you are using be it the text editor or even the terminal that you are using supports your character set.
- Take care to save the CSV/source file as a document in your preferred encoding. UTF-8 is usually a winner.
- Ensure that all relevant columns have the correct character set. This can be done in MySQL using the
SHOW FULL COLUMNS
command (DESC
is insufficient):mysql> SHOW FULL COLUMNS FROM my_table;
+-------+--------------+-----------------+------+-----+---------+----------------+---------------------------------+---------+
| Field | Type | Collation | Null | Key | Default | Extra | Privileges | Comment |
+-------+--------------+-----------------+------+-----+---------+----------------+---------------------------------+---------+
| id | int(10) | NULL | NO | PRI | NULL | auto_increment | select,insert,update,references | |
| pid | int(10) | NULL | YES | | 0 | | select,insert,update,references | |
| title | varchar(255) | utf8_general_ci | NO | | NULL | | select,insert,update,references | |
| body | longtext | utf8_general_ci | YES | | NULL | | select,insert,update,references | |
+-------+--------------+-----------------+------+-----+---------+----------------+---------------------------------+---------+
4 rows in set (0.00 sec)If they do not have the right encoding then an
ALTER TABLE
like the following is necessary to fix things:
mysql> alter table my_table DEFAULT CHARACTER SET utf8 COLLATE utf8_general_ci;
Note the difference between the character set and the collation. - It might also be worthwhile to change the default character set and the collation of the entire table itself:
ALTER TABLE my_table DEFAULT CHARACTER SET utf8 COLLATE utf8_general_ci
- Switch the character set of the MySQL connection to our preferred encoding:
mysql> \C utf8
Charset changedThis (temporary) change can be confirmed with:
mysql> \s
--------------
mysql Ver 14.14 Distrib 5.1.41, for debian-linux-gnu (i486) using readline 6.1
Connection id: 8731
SSL: Not in use
Current pager: stdout
Using outfile: ''
Using delimiter: ;
Server version: 5.1.41-3ubuntu12.10 (Ubuntu)
Protocol version: 10
Connection: stranger via TCP/IP
Client characterset: utf8
Server characterset: utf8
TCP port: 3306
Uptime: 6 hours 11 min 52 sec
Threads: 2 Questions: 54101 Slow queries: 1 Opens: 16527 Flush tables: 1 Open tables: 64 Queries per second avg: 2.424
-------------- - Now, import your CSV file:
mysql> LOAD DATA LOCAL INFILE 'migrate.csv' INTO TABLE my_table CHARACTER SET utf8 FIELDS TERMINATED BY ',' ENCLOSED BY '#' LINES TERMINATED BY '\r\n' (title, body);
Note the fact that we are explicitly specifying the character set of the source data.
- If you run into an error during the import:
mysql> LOAD DATA LOCAL INFILE 'migrate.csv' INTO TABLE my_table CHARACTER SET utf8 FIELDS TERMINATED BY ',' ENCLOSED BY '#' LINES TERMINATED BY '\r\n' (title, body);
Query OK, 1 row affected, 2 warnings (0.00 sec)
Records: 1 Deleted: 0 Skipped: 0 Warnings: 0
Warning (Code 1366): Incorrect string value: '\xFF\xFE#\x00P\x00...' for column 'title' at row 1
Warning (Code 1366): Incorrect string value: '\xE2\x00n\x00 \x00...' for column 'body' at row 1The above errors usually indicate that the source file is not in the correct encoding. This can be verified using the
file
command:
Jubal@Stranger:$ file migrate1.csv
migrate1.csv: Little-endian UTF-16 Unicode English text, with CRLF, LF line terminators
Jubal@Stranger:$ file migrate2.csv
migrate2.csv: UTF-8 Unicode (with BOM) English text, with CRLF line terminators
UTF-8 is not the same as UTF-16.
Once every step of the import process shared the same character set, my data came through intact. I hope that this saves time for somebody out there :)
- Log in to post comments
Comments
It did. :o)
It did. :o)
re
Thank you, that helped a lot.