Skip to content
This repository has been archived by the owner on Apr 7, 2022. It is now read-only.

Python 2 vs Python 3: DevString with bytes #251

Closed
mguijarr opened this issue Jan 25, 2019 · 11 comments
Closed

Python 2 vs Python 3: DevString with bytes #251

mguijarr opened this issue Jan 25, 2019 · 11 comments
Assignees

Comments

@mguijarr
Copy link
Contributor

Hi all,

I noticed an intriguing behavior of PyTango running within a Python 3 environment.

Scenario

An old, existing C++ server to control a RS232 serial line. The server has a DevSerReadString
method to read chars from the serial line ; Tango output data type is DevString.
The equipment connected to the serial line sends binary data.

Problem

When calling the command from a Python 2 client, everything works fine -- the returned string corresponds to the bytes sent by the equipment.
When calling the command from a Python 3 client, None is returned.

Minimal example to reproduce the problem

  • Server in Python 2

    from tango.server import run
    from tango.server import Device
    from tango.server import attribute, command
    
    class BytesTest(Device):
        @command(dtype_out=str)
        def bytes_str1(self):
            return "binary_data\xee\xff"
    
    if __name__ == "__main__":
        run((BytesTest,))
  • Python 3 client

     Python 3.7.1 (default, Dec 14 2018, 19:28:38) 
     [GCC 7.3.0] :: Anaconda, Inc. on linux
     Type "help", "copyright", "credits" or "license" for more information.
     >>> import tango; dev=tango.DeviceProxy("id00/tango/test")
     >>> print(dev.bytes_str1())
     None
     >>> 

Additional information

Of course the problem comes from bytes/unicode handling in Python. It is also probably a
bad idea to use DevString for this kind of data transfer, maybe DevVarCharArray would
have been better...

According to the PyTango documentation, DevString is encoded as latin-1, but it does not seem to be the case in reality.

>>> s = "binary_data\xee\xff"
>>> print(s)
'binary_dataîÿ'
>>> print(s.encode("latin-1"))
b'binary_data\xee\xff'
@AntoineDupre
Copy link
Contributor

AntoineDupre commented Jan 29, 2019

Hi Matias,

Tanks for the report, I feel that you have raised some inconsistencies issues in PyTango
I run some extra test and here is my report:

Test report

Command arg_out

I run your example with Python3 Server and a Python3 client and it's working well.

# Client python version 3.7
In [3]: ds37.bytes_str1()
Out[3]: 'cmd_outîÿ'

Python2 server with python2 client is working fine.
Python3 server with pyhon3 client is working fine
As you pointed it, the issue seems to be a compatibility between Python2 server and Python3.

I extended the test with a python3 server and python2 client:

# Client python version 2.7
In [28]: ds37.bytes_str1()
Out[28]: 'cmd_out\xc3\xae\xc3\xbf'

The output fit well with the utf-8 encoding:

In [4]: "binary_data\xee\xff".encode("utf-8")
Out[4]: b'binary_data\xc3\xae\xc3\xbf'

It seems that the tango documentation is some how not telling the truth ...

By looking inside the python boost implementation, I got the following points:

  • The output of the cpp tango core is a c++ std::String without decoding.
  • There is nothing in the CppCore that encode or decode in latin-1. (So I guess the encoding is usually manage by the client application layer.)
  • In command output case, the decoding might be defined by the standard boost decoder (utf-8)

One of issue raised by your example is that PyTango is failing silently:

# code from connection.py

def __Connection__command_inout(self, name, *args, **kwds):
    r = Connection.command_inout_raw(self, name, *args, **kwds)
    if isinstance(r, DeviceData):
        try:
            return r.extract(self.defaultCommandExtractAs)
        except Exception:
            return None
    else:
        return r

In fact, a UnicodeError is raised on the client side because we are using the default boost encoding (Utf-8), and `b"\xee\xff"`` can not be decode in utf-8 (which is quite confusing because any bytes send under b"\x80" would have worked).

What about command arg_in ?

# Python version 3.7
# Let's keep this in mind:
In [1]: "Café".encode("utf-8")
Out[1]: b'Caf\xc3\xa9'

In [2]: "Café".encode("latin-1")
Out[2]: b'Caf\xe9'

My server code

from tango.server import Device
from tango.server import command

class BytesTest(Device):

    @command(dtype_in=str)
    def bytes_in1(self, arg_in):
        print("Server stdout:: arg_in", arg_in)

I tried the same approach with the command argument.
Let's start with python2 only.

# python version 2.7
# first let see how "Café" looks like in python2
In [3]: print(repr("Café"))
'Caf\xc3\xa9'

In [3]: ds27.bytes_in1("Café")
('Server stdout:: arg_in', 'Caf\xc3\xa9')

In [4]: ds27.bytes_in1(" \xee")
('Server stdout:: arg_in', '\xee')

Now let's see between python3 client and python2 server

python version 3.7
In [15]: ds27.bytes_in1("Café")
('Server stdout:: arg_in', 'Caf\xe9')

 ds27.bytes_in1("\xee")
('Server stdout:: arg_in', '\xee')

Python3 client is sending arg_in not in utf-8 format but in latin-1.

Now let's try python3 client with python3 client

# python version 3.7
In [15]: ds37.bytes_in1("\xee")
DevFailed: DevFailed[
DevError[
    desc = UnicodeDecodeError: 'utf-8' codec can't decode byte 0xee in position 0: unexpected end of data
           ...

It's a devfailed, so the error come from the server side.
So ... arg_in are encoded in the client in latin-1 but decoded by the server in utf-8 ...

python version 3.7
# If we pre-encode it is working
ds37.bytes_in1("\xee".encode("utf-8"))
Server stdout:: arg_in î

Attributes

Now let see how attributes behave.

from tango.server import run
from tango.server import Device
from tango.server import attribute
from tango  import AttrWriteType

class BytesTest(Device):
    attr = ""

    bytes_attr_w = attribute(dtype=str, access=AttrWriteType.READ_WRITE)

    def read_bytes_attr_w(self):
        print("read, ",repr(self.attr))
        return self.attr

    def write_bytes_attr_w(self, value):
        print("write, ", repr(value))
        self.attr = value

Python 2.7 server and python3.7 client.

# python version 3.7
In [1]: ds27.bytes_attr_w = "\xee"
Server stdout: ('write, ', "'\\xc3\\xae'")   # client encode in utf-8 
In [2]: ds27.bytes_attr_w
Out[2]: 'î'    # client decode in utf-8
Server stdout: ('read, ', "'\\xc3\\xae'")

Python 3.7 server and python3.7 client.

# python version 3.7
In [1]:  ds37.bytes_attr_w = "\xee"
Server stdout:  write,  'î'     # server decode in utf-8

In [2]: ds37.bytes_attr_w
Server stdout:  read,  'î'                                     
client: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xee in position 0: unexpected end of dat

Failing on the client side, it means that the exception is on the client side.

And python 3.7 server and python 2 client:

In [6]: ds37.bytes_attr_w
Out[6]: '\xee'   # Server encode in latin-1
Server stdout: read,  'î'

Overview

command argout:

  • Server encode in utf-8
  • Client decode in utf-8

command argin:

  • Client encode in latin-1
  • Server decode in utf-8

Write attribute

  • Client encode in utf-8
  • Server decode in utf-8

Read attribute:

  • Server encode in latin-1
  • Client decode in utf-8

Conclusion.

First of all, command decoding shall not fail silently.

There is some quiet big inconsistencies in the way of how PyTango deals with string encoding. It is not only a issue between python2 and python3.

Pytango is not always using the default boost encoding/decoding (like in device_data.cpp/insert) and sometimes PyTango force the encoding itself.

This PR can be usefull: #180

I guess for python2 compatibility, we shall probably use latin-1 as a standard (as it is defined in the documentation). It means that python3 user will have to be aware about it (No Unicode code_point above \u00ff can be used).

@mguijarr
Copy link
Contributor Author

@AntoineDupre Thanks a lot for the detailed explanation and for the time you took writing all this !

I agree with your conclusion.

At the moment this is not a blocking issue for us, since we can use direct connection to the serial
line I was talking about -- but it is certainly something to fix in PyTango so let's see if someone takes
the issue.

@ajoubertza
Copy link
Member

@AntoineDupre Great investigation and report back - thanks! I started looked at it, but got a bit lost in all the boost/C++ wrapper magic. I agree with you conclusion too. The most urgent change is raising an error if the decoding fails.

It would be nice to have automated tests that validate the interaction between different Python versions like this - at least 2.7 and 3.7, but that seems tricky to do with our Travis build...

@mguijarr
Copy link
Contributor Author

@ajoubertza , @AntoineDupre
About boost and C++: are you aware of a pybind11 binding for PyTango ? It is written/currently being written by @grm84 but I have no news about it, I don't know the state. For the future...

@grm84
Copy link
Collaborator

grm84 commented Jan 31, 2019

@mguijarr, @ajoubertza , @AntoineDupre
Sorry, There's been little progress on pybind11, since the last Tango meeting, due to other work commitments. I hope this will be completed in the future.

@tiagocoutinho
Copy link
Contributor

tiagocoutinho commented Jan 31, 2019

Hi all,
Sorry to join in so late.

First, I don't see how DevSerReadString could work for binary data. I think you should use DevSerReadChar. If you use DevSerReadString to get binary data, if PyTango was working as it should, you risk having exceptions randomly depending if the characters you receive from the serial line fit or not the range 0-127.

Now, some facts first to see if I get this right:

  • Tango uses c type char as DEV_STRING (not wchar)
  • As far as I understand from here, CORBA (or at least omniORB) expects c strings (char*) to be delivered as latin-1 encoded.

My understanding is that:

  • PyTango should just make sure any string that comes from python (in a client or a server, command or attribute) is properly encoded in latin-1 to pass it to the C++ layer:
    • in python 2, if a str is given it should assume it is latin-1 from the user
    • in python 2, if a unicode is given it should encode as latin-1 str
    • in python 3, if bytes is given it should assume it is latin-1 from the user
    • in python 3, if a str is given it should encode as latin-1 str
  • When PyTango receives a DEV_STRING (client or server, attribute or command) from the C++ layer it should assume it is latin-1 encoded and:
    • in python 2, create a python str from the direct latin-1, fail if there are non latin-1 valid characters
    • in python 3, create a python str decode from latin-1, fail if there are non latin-1 valid characters

I always get confused with this so there might be a mistake above. Please check carefully.
Anyway this is going to be a tough one. I remember when I first did this it involved intercepting the boost standard decoder from char* to python str which was quite tricky.

Any volunteer :-)?

@vxgmichel
Copy link
Contributor

fail if there are non latin-1 valid characters

There is no such thing as an invalid latin-1 bytestring :)

>>> bytes(range(256)).decode('latin-1')                                     
'\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f !"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0¡¢£¤¥¦§¨©ª«¬\xad®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ'

However, most unicode code points are not included in latin-1:

>>> "€".encode('latin-1')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'latin-1' codec can't encode character '\u20ac' in position 0: ordinal not in range(256)

Sounds good otherwise!

@andygotz
Copy link

andygotz commented Feb 1, 2019

Hi,
thanks to all (Anton, Tiago, Vincent) who have commented on this issue. I am not sure there is a problem following Tiago's comments. I propose we take a look at this issue during the kernel meeting next week.

I fully agree with Tiago that any client trying to read binary data from the serial device server MUST use the DevSeReadrChar command. Using strings will cause the first null character to terminate the string.

@mguijarr
Copy link
Contributor Author

mguijarr commented Feb 1, 2019

It's the all-star game @AntoineDupre @ajoubertza @tiagocoutinho @vxgmichel @andygotz 😄

So, I checked more carefully the calls that we make today to the serial device server I was talking
about first:

  • DevSerReadLine()
  • DevSerReadNChar()
  • DevSerReadRaw()

According to the documentation here, the 3 commands return DevString...

As far as I understand, following @tiagocoutinho proposal, we may be able to replace those calls with:

  • DevSerReadChar(SL_LINE)
  • DevSerReadChar(SL_NCHAR) (but then, how do I specify number of characters to read?)
  • DevSerReadChar(SL_RAW)

Indeed, DevSerReadChar will return bytes in Python 3.

Is it right ?

This does not change the fact that there is a bug/weird behaviour with PyTango as demonstrated
by @AntoineDupre and my issue report.

Thanks for your help guys !

@tiagocoutinho
Copy link
Contributor

You are right of course @vxgmichel! Always mess latin-1 with UTF-7 (don't ask me why :-).

@mguijarr, I agree there is a bug in PyTango. Both you and the complete @AntoineDupre analysis make absolute sense.
I made a mistake saying you probably run into trouble with DevSerReadString. Even if the binary data has '\0' in the middle it should work since the corba string is like a C++ string. It contains the buffer and the size in its structure. You would run into trouble only if the DevSerReadString implementation would calculate the string length with something like strlen() which doesn't seem to be the case. Also, as @vxgmichel pointed out there is never an invalid latin-1 so it should be fine.

Having only looked very briefly at the C++ server, my opinion is that you should technically be able to use either DevSerReadString(SL_RAW) or DevSerReadChar(SL_RAW). But, as @andygotz pointed, probably it would be more correct to use DevSerReadChar(SL_RAW).

@tiagocoutinho tiagocoutinho self-assigned this Feb 22, 2019
@tiagocoutinho
Copy link
Contributor

tiagocoutinho commented Feb 22, 2019

I have assigned the bug to me.

Here are my findings so far:

server side py2 py3
cmd() -> DevString: return bytes OK OK
cmd() -> DevString: return unicode ERR ERR
cmd(DevString) OK ERR
read DevString: return bytes OK OK
read DevString: return unicode OK OK
write DevString OK ERR
client side py2 py3
cmd() -> DevString OK ERR
cmd(DevString) pass bytes OK OK
cmd(DevString) pass unicode OK OK
read DevString OK ERR
write DevString: pass bytes OK OK
write DevString: pass unicode ERR ERR

In all tests I used the equivalent of a python 3 bytes(range(1,256) or the equivalent string variant when appropriate.

One thing I noticed also is that you cannot put a '\x00'. It will cut the DevString. Need to check if it is on PyTango or C++ side.

Need to check what is happening with TANGO properties as well

@tiagocoutinho tiagocoutinho modified the milestone: 9.2.2 Feb 26, 2019
@tiagocoutinho tiagocoutinho pinned this issue Feb 28, 2019
@tiagocoutinho tiagocoutinho unpinned this issue Mar 11, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants