Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Column name decoding issue after v4.0.3 #194

Closed
jesaerys opened this issue Feb 15, 2017 · 18 comments
Closed

Column name decoding issue after v4.0.3 #194

jesaerys opened this issue Feb 15, 2017 · 18 comments

Comments

@jesaerys
Copy link

I'm connecting to a Teradata database from Python 3.6 on macOS. I'm seeing strange column names when I use v4.0.5 and v4.0.6 of pyodbc that I wasn't seeing before with v4.0.3. It almost looks like a decoding issue.

In v4.0.3, I can do this:

connection = pyodbc.connect(dsn='mydsn')

# Encoding/decoding for Teradata+Python3.6 as directed in the docs
connection.setdecoding(pyodbc.SQL_CHAR, encoding='utf-8')
connection.setdecoding(pyodbc.SQL_WCHAR, encoding='utf-8')
connection.setencoding(encoding='utf-8')

cursor = connection.cursor()
cursor.execute("SELECT TOP 5 * FROM mydb.mytable")
cursor.description

Which results in,

(('STATE', str, None, 2, 2, 0, True),
 ('ZIP', str, None, 5, 5, 0, True),
 ('CITY', str, None, 30, 30, 0, True),
 ('ADDRESS', str, None, 30, 30, 0, True))

However, if I repeat the same exact process in v4.0.5 and v.4.0.6, I get this:

(('S\x00T\x00A', str, None, 2, 2, 0, True),
 ('Z\x00I', str, None, 5, 5, 0, True),
 ('C\x00I\x00', str, None, 30, 30, 0, True),
 ('A\x00D\x00D\x00R', str, None, 30, 30, 0, True))

Have I set up my connection incorrectly, or do my findings point to a breaking change after v4.0.3?

Thanks.

@mkleehammer
Copy link
Owner

What I found out after 4.0.3 is that different databases return the column names differently. SQL Server, PostgreSQL, and MySQL all return them as UTF-16LE regardless of the data encoding.

I've added a 3rd decoding option for this but defaulted it to utf-16le. Please add this:

connection.setdecoding(pyodbc.SQL_WMETADATA, encoding='utf-8')

Sorry for the trouble, but each database does things different instead of following the spec. (In this case the spec says it should be UTF16LE.).

I'm considering creating a "factory" that can be configured and then connections created using factory.connect(). I'd also consider adding something like:

factory.autoconfigure('teradata')

with setting people have tested. (I don't have teradata.)

In the meantime I'll try to keep the docs up to date. Please let me know if this works for you.

@jesaerys
Copy link
Author

So here is my updated configuration block,

connection.setdecoding(pyodbc.SQL_CHAR, encoding='utf-8')
connection.setdecoding(pyodbc.SQL_WCHAR, encoding='utf-8')
connection.setdecoding(pyodbc.SQL_WMETADATA, encoding='utf-8')
connection.setencoding(encoding='utf-8')

And here is the new output from cursor.description in my example:

(('S\x00\x00\x00T\x00\x00\x00A\x00', str, None, 2, 2, 0, True),
 ('Z\x00\x00\x00I\x00', str, None, 5, 5, 0, True),
 ('C\x00\x00\x00I\x00\x00\x00', str, None, 30, 30, 0, True),
 ('A\x00\x00\x00D\x00\x00\x00D\x00\x00\x00R\x00', str, None, 30, 30, 0, True))

So the extra decoding setting did have an effect, but it's still not right.
Any ideas? If I knew more about encoding/decoding, I might be able to help troubleshoot this better, sorry.

@jesaerys
Copy link
Author

Not sure if this is helpful at all: http://info.teradata.com/HTMLPubs/DB_TTU_15_10/index.html#page/Connectivity/B035_2509_035K/2509ch08.10.04.html

I don't fully understand it, but the section "When Making SQLColAttribute Calls" feels like it might be relevant.

@mkleehammer
Copy link
Owner

Dang - the fact that drivers don't follow specs means I have to support every possible combination of buffers everywhere! I'm calling SQLDescribeColumnW which only accepts a SQLWCHAR buffer. Can you try one more thing? It shouldn't be necessary so I'm not expecting a change but I'd like to get it tested:

connection.setdecoding(pyodbc.SQL_WMETADATA, encoding='utf-8', ctype=pyodbc.SQL_CHAR)

Please note that this could cause a crash if it is what I think - it might look right but actually use bytes beyond the those in the buffer.

I'll try to find a teradata database I can test with soon. Do you know of any public ways to do so? I thought the only test version they offered was a full VM for something I don't use (I use VirtualBox).

@jesaerys
Copy link
Author

The extra ctype keyword didn't have any effect on the description output. No crash, either.

It looks like there's some documentation on character sets, but again, unfortunately I don't really understand how all the pieces fit together: http://info.teradata.com/HTMLPubs/DB_TTU_15_10/index.html#page/Connectivity/B035_2509_035K/2509ch08.10.32.html

@jesaerys
Copy link
Author

jesaerys commented Feb 15, 2017

Oh, and I use my company's instance of Teradata and I'm not aware of a public version. However, I'm more than happy to test whatever you need!

@jesaerys
Copy link
Author

Is it worth looking into how the teradata package does things?

@mkleehammer
Copy link
Owner

I'm going to have to find a way to test it so I can dump out buffers raw. In the meantime you can try these and see if they help:

connection.setdecoding(pyodbc.SQL_WMETADATA, encoding='utf-16le')
connection.setdecoding(pyodbc.SQL_WMETADATA, encoding='utf-16be')

The first is what it looks like should be correct, but is also the default. Maybe there is a bug and the default is getting overwritten?

@jesaerys
Copy link
Author

The first gives me the same output as in my original post (as expected, I think). The second gives me asian characters of some kind :)

I'll keep playing around with different encoding/decoding options on my own, including the CharacterSet connection option for the Teradata driver.

@mkleehammer
Copy link
Owner

OK - in the meantime I'm downloading Teradata Express so I can test in a VM. (I forgot you can run a VMWare vmdk in VirtualBox.). However, I'm wondering about the driver - it looks like it uses iODBC. Are you using a standard binary wheel of pyodbc or building your own with iODBC?

@jesaerys
Copy link
Author

jesaerys commented Feb 15, 2017

I built pyodbc myself against my driver manager, which is iODBC from Homebrew.

Should I try building against unixodbc instead?

@mkleehammer
Copy link
Owner

There might be a SQLWCHAR size difference then. You might try "utf32le" and "utf32be". It'll take me a bit longer to setup with iODBC. Working on it.

@mkleehammer
Copy link
Owner

TL;DR - Use 4.0.7 and set the decoding to:

cnxn.setdecoding(pyodbc.SQL_WMETADATA, encoding='utf-32le')

Whew ... it took forever to get to the point where I can test. I don't know if this is buggy behavior in the Teradata driver (my guess) or if iODBC returns 4-byte SQLCHAR for metadata. Either way I've added a hack that if the metadata is set to one of the UTF-32 encodings it assumes sizeof(SQLWCHAR) is 4.

In case you are wondering how all of this started and why we can't use what seemed to work before, I switched to the Unicode version of SQLDescribeCol to handle Unicode column names. That's not something I wanted to / could revert.

@jesaerys
Copy link
Author

Thanks for fixing this, it works perfectly now! The amount of work you put into this project is heroic.

@CoderSong2015
Copy link

Hi,
I met a problem when I connected to a database(product of my company) with pyodbc4.0.17 and unixODBC2.3.4 that the column name can't be decoded.Then I found that the unixODBC selects whether to call an API with W according to the application type(ANSI or Unicode) but pyodbc calls the SQLDescribeColumnW by default instead of according to the application type so that our driver can not handle inconsistent coding requirements before and after.I want to know that whether the design will result in only support for unicode? @mkleehammer

@kurtosis
Copy link

Just adding another data point- I saw the same issue connecting to Vertica, so at the least it's not unique to Teradata. (py3.6, macOS, pyodbc 4.0.22)

As described above, setting the metadata decoding to utf-32le worked for me:

conn_vert.setdecoding(pyodbc.SQL_WCHAR,` encoding='utf-8')
conn_vert.setdecoding(pyodbc.SQL_CHAR, encoding='utf-8')
conn_vert.setdecoding(pyodbc.SQL_WMETADATA, encoding='utf-32le')
conn_vert.setencoding(encoding='utf-8')

@freshwuzhere
Copy link

conn_vert.setdecoding(pyodbc.SQL_WMETADATA, encoding='utf-32le')

was a winner for me also....

@rsorma04
Copy link

Everyone, comment out, or delete, the following line of code line 80 in hyper_file_generator.py in the library (if you're creating a hyper):

print((col[0].decode('utf-8')))

This immediately fixed the issue for me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants