Python 2 vs Python 3: DevString with bytes #251

mguijarr · 2019-01-25T12:35:14Z

Hi all,

I noticed an intriguing behavior of PyTango running within a Python 3 environment.

Scenario

An old, existing C++ server to control a RS232 serial line. The server has a DevSerReadString
method to read chars from the serial line ; Tango output data type is DevString.
The equipment connected to the serial line sends binary data.

Problem

When calling the command from a Python 2 client, everything works fine -- the returned string corresponds to the bytes sent by the equipment.
When calling the command from a Python 3 client, None is returned.

Minimal example to reproduce the problem

Server in Python 2

from tango.server import run
from tango.server import Device
from tango.server import attribute, command

class BytesTest(Device):
    @command(dtype_out=str)
    def bytes_str1(self):
        return "binary_data\xee\xff"

if __name__ == "__main__":
    run((BytesTest,))

Python 3 client

 Python 3.7.1 (default, Dec 14 2018, 19:28:38) 
 [GCC 7.3.0] :: Anaconda, Inc. on linux
 Type "help", "copyright", "credits" or "license" for more information.
 >>> import tango; dev=tango.DeviceProxy("id00/tango/test")
 >>> print(dev.bytes_str1())
 None
 >>>

Additional information

Of course the problem comes from bytes/unicode handling in Python. It is also probably a
bad idea to use DevString for this kind of data transfer, maybe DevVarCharArray would
have been better...

According to the PyTango documentation, DevString is encoded as latin-1, but it does not seem to be the case in reality.

>>> s = "binary_data\xee\xff"
>>> print(s)
'binary_dataîÿ'
>>> print(s.encode("latin-1"))
b'binary_data\xee\xff'

The text was updated successfully, but these errors were encountered:

AntoineDupre · 2019-01-29T13:49:46Z

Hi Matias,

Tanks for the report, I feel that you have raised some inconsistencies issues in PyTango
I run some extra test and here is my report:

Test report

Command arg_out

I run your example with Python3 Server and a Python3 client and it's working well.

# Client python version 3.7
In [3]: ds37.bytes_str1()
Out[3]: 'cmd_outîÿ'

Python2 server with python2 client is working fine.
Python3 server with pyhon3 client is working fine
As you pointed it, the issue seems to be a compatibility between Python2 server and Python3.

I extended the test with a python3 server and python2 client:

# Client python version 2.7
In [28]: ds37.bytes_str1()
Out[28]: 'cmd_out\xc3\xae\xc3\xbf'

The output fit well with the utf-8 encoding:

In [4]: "binary_data\xee\xff".encode("utf-8")
Out[4]: b'binary_data\xc3\xae\xc3\xbf'

It seems that the tango documentation is some how not telling the truth ...

By looking inside the python boost implementation, I got the following points:

The output of the cpp tango core is a c++ std::String without decoding.
There is nothing in the CppCore that encode or decode in latin-1. (So I guess the encoding is usually manage by the client application layer.)
In command output case, the decoding might be defined by the standard boost decoder (utf-8)

One of issue raised by your example is that PyTango is failing silently:

# code from connection.py

def __Connection__command_inout(self, name, *args, **kwds):
    r = Connection.command_inout_raw(self, name, *args, **kwds)
    if isinstance(r, DeviceData):
        try:
            return r.extract(self.defaultCommandExtractAs)
        except Exception:
            return None
    else:
        return r

In fact, a UnicodeError is raised on the client side because we are using the default boost encoding (Utf-8), and `b"\xee\xff"`` can not be decode in utf-8 (which is quite confusing because any bytes send under b"\x80" would have worked).

What about command arg_in ?

# Python version 3.7
# Let's keep this in mind:
In [1]: "Café".encode("utf-8")
Out[1]: b'Caf\xc3\xa9'

In [2]: "Café".encode("latin-1")
Out[2]: b'Caf\xe9'

My server code

from tango.server import Device
from tango.server import command

class BytesTest(Device):

    @command(dtype_in=str)
    def bytes_in1(self, arg_in):
        print("Server stdout:: arg_in", arg_in)

I tried the same approach with the command argument.
Let's start with python2 only.

# python version 2.7
# first let see how "Café" looks like in python2
In [3]: print(repr("Café"))
'Caf\xc3\xa9'

In [3]: ds27.bytes_in1("Café")
('Server stdout:: arg_in', 'Caf\xc3\xa9')

In [4]: ds27.bytes_in1(" \xee")
('Server stdout:: arg_in', '\xee')

Now let's see between python3 client and python2 server

python version 3.7
In [15]: ds27.bytes_in1("Café")
('Server stdout:: arg_in', 'Caf\xe9')

 ds27.bytes_in1("\xee")
('Server stdout:: arg_in', '\xee')

Python3 client is sending arg_in not in utf-8 format but in latin-1.

Now let's try python3 client with python3 client

# python version 3.7
In [15]: ds37.bytes_in1("\xee")
DevFailed: DevFailed[
DevError[
    desc = UnicodeDecodeError: 'utf-8' codec can't decode byte 0xee in position 0: unexpected end of data
           ...

It's a devfailed, so the error come from the server side.
So ... arg_in are encoded in the client in latin-1 but decoded by the server in utf-8 ...

python version 3.7
# If we pre-encode it is working
ds37.bytes_in1("\xee".encode("utf-8"))
Server stdout:: arg_in î

Attributes

Now let see how attributes behave.

from tango.server import run
from tango.server import Device
from tango.server import attribute
from tango  import AttrWriteType

class BytesTest(Device):
    attr = ""

    bytes_attr_w = attribute(dtype=str, access=AttrWriteType.READ_WRITE)

    def read_bytes_attr_w(self):
        print("read, ",repr(self.attr))
        return self.attr

    def write_bytes_attr_w(self, value):
        print("write, ", repr(value))
        self.attr = value

Python 2.7 server and python3.7 client.

# python version 3.7
In [1]: ds27.bytes_attr_w = "\xee"
Server stdout: ('write, ', "'\\xc3\\xae'")   # client encode in utf-8 
In [2]: ds27.bytes_attr_w
Out[2]: 'î'    # client decode in utf-8
Server stdout: ('read, ', "'\\xc3\\xae'")

Python 3.7 server and python3.7 client.

# python version 3.7
In [1]:  ds37.bytes_attr_w = "\xee"
Server stdout:  write,  'î'     # server decode in utf-8

In [2]: ds37.bytes_attr_w
Server stdout:  read,  'î'                                     
client: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xee in position 0: unexpected end of dat

Failing on the client side, it means that the exception is on the client side.

And python 3.7 server and python 2 client:

In [6]: ds37.bytes_attr_w
Out[6]: '\xee'   # Server encode in latin-1
Server stdout: read,  'î'

Overview

command argout:

Server encode in utf-8
Client decode in utf-8

command argin:

Client encode in latin-1
Server decode in utf-8

Write attribute

Client encode in utf-8
Server decode in utf-8

Read attribute:

Server encode in latin-1
Client decode in utf-8

Conclusion.

First of all, command decoding shall not fail silently.

There is some quiet big inconsistencies in the way of how PyTango deals with string encoding. It is not only a issue between python2 and python3.

Pytango is not always using the default boost encoding/decoding (like in device_data.cpp/insert) and sometimes PyTango force the encoding itself.

This PR can be usefull: #180

I guess for python2 compatibility, we shall probably use latin-1 as a standard (as it is defined in the documentation). It means that python3 user will have to be aware about it (No Unicode code_point above \u00ff can be used).

mguijarr · 2019-01-30T09:49:28Z

@AntoineDupre Thanks a lot for the detailed explanation and for the time you took writing all this !

I agree with your conclusion.

At the moment this is not a blocking issue for us, since we can use direct connection to the serial
line I was talking about -- but it is certainly something to fix in PyTango so let's see if someone takes
the issue.

ajoubertza · 2019-01-30T20:12:03Z

@AntoineDupre Great investigation and report back - thanks! I started looked at it, but got a bit lost in all the boost/C++ wrapper magic. I agree with you conclusion too. The most urgent change is raising an error if the decoding fails.

It would be nice to have automated tests that validate the interaction between different Python versions like this - at least 2.7 and 3.7, but that seems tricky to do with our Travis build...

mguijarr · 2019-01-31T10:18:38Z

@ajoubertza , @AntoineDupre
About boost and C++: are you aware of a pybind11 binding for PyTango ? It is written/currently being written by @grm84 but I have no news about it, I don't know the state. For the future...

grm84 · 2019-01-31T11:23:12Z

@mguijarr, @ajoubertza , @AntoineDupre
Sorry, There's been little progress on pybind11, since the last Tango meeting, due to other work commitments. I hope this will be completed in the future.

tiagocoutinho · 2019-01-31T16:47:06Z

Hi all,
Sorry to join in so late.

First, I don't see how DevSerReadString could work for binary data. I think you should use DevSerReadChar. If you use DevSerReadString to get binary data, if PyTango was working as it should, you risk having exceptions randomly depending if the characters you receive from the serial line fit or not the range 0-127.

Now, some facts first to see if I get this right:

Tango uses c type char as DEV_STRING (not wchar)
As far as I understand from here, CORBA (or at least omniORB) expects c strings (char*) to be delivered as latin-1 encoded.

My understanding is that:

PyTango should just make sure any string that comes from python (in a client or a server, command or attribute) is properly encoded in latin-1 to pass it to the C++ layer:
- in python 2, if a str is given it should assume it is latin-1 from the user
- in python 2, if a unicode is given it should encode as latin-1 str
- in python 3, if bytes is given it should assume it is latin-1 from the user
- in python 3, if a str is given it should encode as latin-1 str
When PyTango receives a DEV_STRING (client or server, attribute or command) from the C++ layer it should assume it is latin-1 encoded and:
- in python 2, create a python str from the direct latin-1, fail if there are non latin-1 valid characters
- in python 3, create a python str decode from latin-1, fail if there are non latin-1 valid characters

I always get confused with this so there might be a mistake above. Please check carefully.
Anyway this is going to be a tough one. I remember when I first did this it involved intercepting the boost standard decoder from char* to python str which was quite tricky.

Any volunteer :-)?

vxgmichel · 2019-01-31T17:27:56Z

fail if there are non latin-1 valid characters

There is no such thing as an invalid latin-1 bytestring :)

>>> bytes(range(256)).decode('latin-1')                                     
'\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f !"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0¡¢£¤¥¦§¨©ª«¬\xad®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ'

However, most unicode code points are not included in latin-1:

>>> "€".encode('latin-1')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'latin-1' codec can't encode character '\u20ac' in position 0: ordinal not in range(256)

Sounds good otherwise!

andygotz · 2019-02-01T06:04:27Z

Hi,
thanks to all (Anton, Tiago, Vincent) who have commented on this issue. I am not sure there is a problem following Tiago's comments. I propose we take a look at this issue during the kernel meeting next week.

I fully agree with Tiago that any client trying to read binary data from the serial device server MUST use the DevSeReadrChar command. Using strings will cause the first null character to terminate the string.

mguijarr · 2019-02-01T10:11:12Z

It's the all-star game @AntoineDupre @ajoubertza @tiagocoutinho @vxgmichel @andygotz 😄

So, I checked more carefully the calls that we make today to the serial device server I was talking
about first:

DevSerReadLine()
DevSerReadNChar()
DevSerReadRaw()

According to the documentation here, the 3 commands return DevString...

As far as I understand, following @tiagocoutinho proposal, we may be able to replace those calls with:

DevSerReadChar(SL_LINE)
DevSerReadChar(SL_NCHAR) (but then, how do I specify number of characters to read?)
DevSerReadChar(SL_RAW)

Indeed, DevSerReadChar will return bytes in Python 3.

Is it right ?

This does not change the fact that there is a bug/weird behaviour with PyTango as demonstrated
by @AntoineDupre and my issue report.

Thanks for your help guys !

tiagocoutinho · 2019-02-01T14:42:46Z

You are right of course @vxgmichel! Always mess latin-1 with UTF-7 (don't ask me why :-).

@mguijarr, I agree there is a bug in PyTango. Both you and the complete @AntoineDupre analysis make absolute sense.
I made a mistake saying you probably run into trouble with DevSerReadString. Even if the binary data has '\0' in the middle it should work since the corba string is like a C++ string. It contains the buffer and the size in its structure. You would run into trouble only if the DevSerReadString implementation would calculate the string length with something like strlen() which doesn't seem to be the case. Also, as @vxgmichel pointed out there is never an invalid latin-1 so it should be fine.

Having only looked very briefly at the C++ server, my opinion is that you should technically be able to use either DevSerReadString(SL_RAW) or DevSerReadChar(SL_RAW). But, as @andygotz pointed, probably it would be more correct to use DevSerReadChar(SL_RAW).

tiagocoutinho · 2019-02-22T19:56:48Z

I have assigned the bug to me.

Here are my findings so far:

server side	py2	py3
cmd() -> DevString: return bytes	OK	OK
cmd() -> DevString: return unicode	ERR	ERR
cmd(DevString)	OK	ERR
read DevString: return bytes	OK	OK
read DevString: return unicode	OK	OK
write DevString	OK	ERR

client side	py2	py3
cmd() -> DevString	OK	ERR
cmd(DevString) pass bytes	OK	OK
cmd(DevString) pass unicode	OK	OK
read DevString	OK	ERR
write DevString: pass bytes	OK	OK
write DevString: pass unicode	ERR	ERR

In all tests I used the equivalent of a python 3 bytes(range(1,256) or the equivalent string variant when appropriate.

One thing I noticed also is that you cannot put a '\x00'. It will cut the DevString. Need to check if it is on PyTango or C++ side.

Need to check what is happening with TANGO properties as well

tiagocoutinho self-assigned this Feb 22, 2019

tiagocoutinho modified the milestone: 9.2.2 Feb 26, 2019

tiagocoutinho pinned this issue Feb 28, 2019

tiagocoutinho unpinned this issue Mar 11, 2019

tiagocoutinho closed this as completed in d61a363 Mar 13, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Python 2 vs Python 3: DevString with bytes #251

Python 2 vs Python 3: DevString with bytes #251

mguijarr commented Jan 25, 2019

AntoineDupre commented Jan 29, 2019 •

edited

Loading

mguijarr commented Jan 30, 2019

ajoubertza commented Jan 30, 2019

mguijarr commented Jan 31, 2019

grm84 commented Jan 31, 2019

tiagocoutinho commented Jan 31, 2019 •

edited

Loading

vxgmichel commented Jan 31, 2019

andygotz commented Feb 1, 2019

mguijarr commented Feb 1, 2019

tiagocoutinho commented Feb 1, 2019

tiagocoutinho commented Feb 22, 2019 •

edited

Loading

Python 2 vs Python 3: DevString with bytes #251

Python 2 vs Python 3: DevString with bytes #251

Comments

mguijarr commented Jan 25, 2019

Scenario

Problem

Minimal example to reproduce the problem

Additional information

AntoineDupre commented Jan 29, 2019 • edited Loading

Test report

Command arg_out

What about command arg_in ?

Attributes

Overview

command argout:

command argin:

Write attribute

Read attribute:

Conclusion.

mguijarr commented Jan 30, 2019

ajoubertza commented Jan 30, 2019

mguijarr commented Jan 31, 2019

grm84 commented Jan 31, 2019

tiagocoutinho commented Jan 31, 2019 • edited Loading

vxgmichel commented Jan 31, 2019

andygotz commented Feb 1, 2019

mguijarr commented Feb 1, 2019

tiagocoutinho commented Feb 1, 2019

tiagocoutinho commented Feb 22, 2019 • edited Loading

AntoineDupre commented Jan 29, 2019 •

edited

Loading

tiagocoutinho commented Jan 31, 2019 •

edited

Loading

tiagocoutinho commented Feb 22, 2019 •

edited

Loading