Posts Tagged UTF8
Recently I did some data movement between Teradata and HDFS, and felt it might be worth posting some insight about how to deal with Unicode in Teradata. For folks who only deal with ASCII strings, you can skip the whole post now 🙂
[Error 3577] [SQLState HY000] Row size or Sort Key size overflow.
When come to the topic of Unicode in Teradata, the above error is an inevitable subject. DBA might tell you that your row is bigger than 64K bytes. But that’s normally not the case. If the table contains quite some VARCHAR/CHAR(n) columns, here is how Teradata estimate the column size regardless if CHARACTER SET Latin or Unicode is specified in the table DDL :
|Session CHARSET =||Estimated Column Size in Bytes for VARCHR(n)/CHAR(n)|
|Latin / ASCII||n x 1|
|UTF8||n x 3|
|UTF16||n x 2|
Then Teradata add the other columns byte size together to determine the row size used in the BTEQ/JDBC/ODBC/TPT/FEXP parcels. Even if the row is actually stored on the disk properly within the 64K limit, you still can’t retrieve it if the estimated row size is beyond 64K. This is quite annoying sometimes, because you can store the data but you can’t access it smoothly.
DBA might advise you to shrink the n in VARCHAR(n) or even vertically split the table into 2. But let’s try the following workaround before you restructure your table.
Set Session CharSet to ASCII
This will probably immediately resolve the row size overflow error. But all your French/Spanish/ Chinese/Japanese/Arabic characters will show up as “?”. This is not good!
Set Session CharSet to UTF16
Normally, people think UTF16 will require more space both on disk and over the network protocol, so UTF8 is always the best choice. But in Teradata’s case:
- Unicode strings are stored as UTF16 on disk anyway. Yes, space is wasted; that’s why there is an algorithm compression function to just compress UTF16 to UTF8 in version 13. Only version 14 can store UTF8 on disk.
- The Row Size estimation is 33% smaller in UTF16 mode than in UTF8 mode for VARCHAR/CHAR/CLOB columns.
- UTF16 mode will require more bytes over network transportation than UTF8
- UTF16 mode will occupy more disk space if you export the query output to a file (binary or vartext)
- but when UTF8 mode bombs, UTF16 may let your queries go through and display the proper characters in your SQL client or BI client
Therefore, UTF16 might be a better default choice for the follow clients:
- ODBC for SQL Assistant, Cognos, MicroStrategy, etc.
- JDBC for AquaData Studio, SQuirreL SQL Client, etc.
- CLIv2 for BTEQ -c utf16 -e utf8
For ETL tools, UTF16 provides better tolerance than UTF8 when non-BMP characters are encountered.
- When ETL writer (DML) session is using Teradata ODBC UTF8, the session will reject the records with extended characters (outside the Basic Multilingual Plane). The error message looks like “Failure 6705 An illegally formed character string was encountered during translation”
- The same extended characters will be converted to “�” using Teradata TPT UTF8, and the records can be written into database.
- Informatica’s implementation for Teradata ODBC has problem in UTF16 mode, please be very cautious.
Let’s see a few unicode characters, some of them are BMP, some of them are non-BMP. Stored in UTF8 mode as a Linux text file, some of them (line 3,4,7) are 3 bytes, others (line 1,2,5,6) are 4 bytes.
DDL: CREATE SET TABLE unicode_test ( ustring VARCHAR(40) CHARACTER SET UNICODE NOT CASESPECIFIC ) NO PRIMARY INDEX; $ cat insert8.sql insert into unicode_test values('1 [ 🐱 ]'); insert into unicode_test values('2 [ 🕓 ]'); insert into unicode_test values('3 [ ℃ ]'); insert into unicode_test values('4 [ ↳ ]'); insert into unicode_test values('5 [ 🏠 ]'); insert into unicode_test values('6 [ 🐎 ]'); insert into unicode_test values('7 [ 赞 ]'); $ bteq -c utf8 -e utf8 # only line 3,4,7 go through; line 1,2,5,6 are rejected # *** Failure 6705 An illegally formed character string was encountered during translation. #### UTF16 mode will load the data, but display them as diamond-shape ? #### $ bteq -c utf16 -e utf8 BTEQ -- Enter your SQL request or BTEQ command: run file insert8.sql; select * from unicode_test order by 1; *** Query completed. 7 rows found. One column returned. *** Total elapsed time was 1 second. ustring Char2HexInt(ustring) --------------------- ---------------------------------- 1 [ �� ] 00310020005B0020FFFDFFFD0020005D -- f0 9f 90 b1 (UTF8) 2 [ �� ] 00320020005B0020FFFDFFFD0020005D -- f0 9f 95 93 (UTF8) 3 [ ℃ ] 00330020005B0020210300200020005D -- e2 84 83 (UTF8) 4 [ ↳ ] 00340020005B002021B300200020005D -- e2 86 b3 (UTF8) 5 [ �� ] 00350020005B0020FFFDFFFD0020005D -- f0 9f 8f a0 (UTF8) 6 [ �� ] 00360020005B0020FFFDFFFD0020005D -- f0 9f 90 8e (UTF8) 7 [ 赞 ] 00370020005B00208D5E00200020005D -- e8 b5 9e (UTF8) ###################### without specifying any character set ################# $ bteq BTEQ -- Enter your SQL request or BTEQ command: *** Logon successfully completed. *** Teradata Database Release is 13.10.05.03 *** Teradata Database Version is 13.10.05.03a *** Transaction Semantics are BTET. *** Session Character Set Name is 'ASCII'. run file insert8.sql; select ustring, substr(char2hexint(ustring),1,40), octet_length(ustring) from scratch_ds.unicode_test order by 1; *** Query completed. 7 rows found. 2 columns returned. *** Total elapsed time was 1 second. ustring Substr(Char2HexInt(ustring),1,40) Octet_Length(ustring) ---------------------- ---------------------------------------- --------------------- 1 [ 🐱 ] 00310020005B002000D0009F009000B10020005D 10 2 [ 🕓 ] 00320020005B002000D0009F009500930020005D 10 3 [ ℃ ] 00330020005B002000E200D700B800200020005D 10 4 [ ↳ ] 00340020005B002000E2008600B300200020005D 10 5 [ 🏠 ] 00350020005B002000D0009F008F01610020005D 10 6 [ 🐎 ] 00360020005B002000D0009F0090008E0020005D 10 7 [ 赞 ] 00370020005B002000E800B5009E00200020005D 10 #### But if we query the same result from JDBC or ODBC client, the display is totally messed up #### ustring Char2HexInt(ustring) ---------- ------------------------------------------ 1 [ Ð± ] 00310020005B002000D0009F009000B10020005D 2 [ Ð ] 00320020005B002000D0009F009500930020005D 3 [ â×¸ ] 00330020005B002000E200D700B800200020005D 4 [ â³ ] 00340020005B002000E2008600B300200020005D 5 [ Ðš ] 00350020005B002000D0009F008F01610020005D 6 [ Ð ] 00360020005B002000D0009F0090008E0020005D 7 [ èµ ] 00370020005B002000E800B5009E00200020005D
As you can see from the above sample, Teradata’s UTF16 mode still can’t store those 4-byte and 3-byte unicode beyond BMP, but at least the ETL process will go through. But ORACLE handles these sample characters flawlessly.
Non-BMP Character in HDFS
Those 4-byte non-BMP characters are also causing trouble on HDFS, it can be stored, but can’t be decoded properly at retrieval time. So it will be a good idea to filter them out at the input GUI.
Column Size Calculation in TPT/BTEQ/FEXP /FASTLOAD
When we export/import data to/from files (binary or flat), the following 2 data types need special calculation to define the file schema:
- ASCII mode: VARCHAR/CHAR(n)
- UTF8 mode: VARCHAR/CHAR(n*3)
- UTF16 mode: VARCHAR/CHAR(n*2)
- MaxDecimalDigits = 15: any column with precision > 15 will have to be defined as DECIMAL(15,x) in the file schema
- MaxDecimalDigits = 18: any column with precision > 18 will have to be defined as DECIMAL(18,x) in the file schema
- MaxDecimalDigits = 38: this is not the default setting, so we have to specify it in the script
If we use BTEQ or FASTLOAD to import flat file (or binary data) in ASCII mode without specifying -c UTF8 or -c UTF16, the utility does not conduct any character validation, so the non-BMP and symbol characters can be imported as is, but the string cannot be queried or displayed in JDBC or ODBC client.
With 14.00 and 14.10 available for upgrade, Teradata tries to advise customers to set AcceptReplacementCharacters = FALSE, but NTA2805 states that 14.xx still does not accept as many unicode characters as customers expect. So it seems that AcceptReplacementCharacters = TRUE is still required for a while.