You are here

MySQL charset issues while importing data using LOAD DATA INFILE

Submitted by Druss on Tue, 2011-09-27 20:47

Earlier today, I was banging my head against the wall trying to import some data in a CSV file into MySQL. While my imports have gone well thus far, this time around I was dealing with data involving lots of strange diacritics, runic squiggles and other manners of gibberish that make the world as fun as it can be. In other words, I was dealing with Unicode.

After a couple of false starts where I found that my imported data had somehow been corrupted in the database (and I knew this because all the gibberish was looking even more unintelligible than previously). A little bird told me that the problem was that I was mixing character sets by trying to shove latin1 into utf8 and what not. Before I go any further, messing around with unicode is injurious to your data, your database, your server, your workstation, your health and that of your neighbour's canine pet. Ensure that you are working on a test database, a test table and ideally at least a couple of miles away from your live server.

Where was I.. Yes, so I had to make sure that every step of the import process used the same character set, which in my case was utf8 (utf8_general_ci). My life was also made needlessly complicated by the fact that I was mixing platforms as well with a Windows 7 desktop and a Kubuntu server.

At any rate, the things to do/check include:

  • Make sure that every tool that you are using be it the text editor or even the terminal that you are using supports your character set.
  • Take care to save the CSV/source file as a document in your preferred encoding. UTF-8 is usually a winner.
  • Ensure that all relevant columns have the correct character set. This can be done in MySQL using the SHOW FULL COLUMNS command (DESC is insufficient):
    mysql> SHOW FULL COLUMNS FROM my_table;
    +-------+--------------+-----------------+------+-----+---------+----------------+---------------------------------+---------+
    | Field | Type         | Collation       | Null | Key | Default | Extra          | Privileges                      | Comment |
    +-------+--------------+-----------------+------+-----+---------+----------------+---------------------------------+---------+
    | id    | int(10)      | NULL            | NO   | PRI | NULL    | auto_increment | select,insert,update,references |         |
    | pid   | int(10)      | NULL            | YES  |     | 0       |                | select,insert,update,references |         |
    | title | varchar(255) | utf8_general_ci | NO   |     | NULL    |                | select,insert,update,references |         |
    | body  | longtext     | utf8_general_ci | YES  |     | NULL    |                | select,insert,update,references |         |
    +-------+--------------+-----------------+------+-----+---------+----------------+---------------------------------+---------+
    4 rows in set (0.00 sec)

    If they do not have the right encoding then an ALTER TABLE like the following is necessary to fix things:
    mysql> alter table my_table DEFAULT CHARACTER SET utf8 COLLATE utf8_general_ci;
    Note the difference between the character set and the collation.

  • It might also be worthwhile to change the default character set and the collation of the entire table itself:
    ALTER TABLE my_table DEFAULT CHARACTER SET utf8 COLLATE utf8_general_ci
  • Switch the character set of the MySQL connection to our preferred encoding:
    mysql> \C utf8
    Charset changed

    This (temporary) change can be confirmed with:

    mysql> \s
    --------------
    mysql  Ver 14.14 Distrib 5.1.41, for debian-linux-gnu (i486) using readline 6.1

    Connection id:          8731
    SSL:                    Not in use
    Current pager:          stdout
    Using outfile:          ''
    Using delimiter:        ;
    Server version:         5.1.41-3ubuntu12.10 (Ubuntu)
    Protocol version:       10
    Connection:             stranger via TCP/IP
    Client characterset:    utf8
    Server characterset:    utf8
    TCP port:               3306
    Uptime:                 6 hours 11 min 52 sec

    Threads: 2  Questions: 54101  Slow queries: 1  Opens: 16527  Flush tables: 1  Open tables: 64  Queries per second avg: 2.424
    --------------
  • Now, import your CSV file:
    mysql> LOAD DATA LOCAL INFILE 'migrate.csv' INTO TABLE my_table CHARACTER SET utf8 FIELDS TERMINATED BY ',' ENCLOSED BY '#' LINES TERMINATED BY '\r\n' (title, body);

    Note the fact that we are explicitly specifying the character set of the source data.

  • If you run into an error during the import:
    mysql> LOAD DATA LOCAL INFILE 'migrate.csv' INTO TABLE my_table CHARACTER SET utf8 FIELDS TERMINATED BY ',' ENCLOSED BY '#' LINES TERMINATED BY '\r\n' (title, body);
    Query OK, 1 row affected, 2 warnings (0.00 sec)
    Records: 1  Deleted: 0  Skipped: 0  Warnings: 0

    Warning (Code 1366): Incorrect string value: '\xFF\xFE#\x00P\x00...' for column 'title' at row 1
    Warning (Code 1366): Incorrect string value: '\xE2\x00n\x00 \x00...' for column 'body' at row 1

    The above errors usually indicate that the source file is not in the correct encoding. This can be verified using the file command:

    Jubal@Stranger:$ file migrate1.csv 
    migrate1.csv: Little-endian UTF-16 Unicode English text, with CRLF, LF line terminators
    Jubal@Stranger:$ file migrate2.csv
    migrate2.csv: UTF-8 Unicode (with BOM) English text, with CRLF line terminators

    UTF-8 is not the same as UTF-16.

Once every step of the import process shared the same character set, my data came through intact. I hope that this saves time for somebody out there :)

Comments

Thank you, that helped a lot.