Skip to content

Commit

Permalink
Merge pull request #2039 from minrk/pyobj
Browse files Browse the repository at this point in the history
[DOC] warn about and de-emphasize send/recv_pyobj
  • Loading branch information
minrk authored Oct 22, 2024
2 parents b084632 + f4e9f17 commit 30e3189
Show file tree
Hide file tree
Showing 8 changed files with 292 additions and 135 deletions.
16 changes: 7 additions & 9 deletions docs/source/api/zmq.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,10 +10,10 @@
## Basic Classes

````{note}
For typing purposes, `zmq.Context` and `zmq.Socket` are Generics,
For typing purposes, {class}`.zmq.Context` and {class}`.zmq.Socket` are Generics,
which means they will accept any Context or Socket implementation.
The base `zmq.Context()` constructor returns the type
The base {class}`zmq.Context()` constructor returns the type
`zmq.Context[zmq.Socket[bytes]]`.
If you are using type annotations and want to _exclude_ the async subclasses,
use the resolved types instead of the base Generics:
Expand All @@ -32,7 +32,7 @@ sock: zmq.SyncSocket
````

### {class}`Context`
## {class}`Context`

```{eval-rst}
.. autoclass:: Context
Expand All @@ -47,7 +47,7 @@ sock: zmq.SyncSocket
```

### {class}`Socket`
## {class}`Socket`

```{eval-rst}
.. autoclass:: Socket
Expand Down Expand Up @@ -81,7 +81,7 @@ sock: zmq.SyncSocket
```

### {class}`Frame`
## {class}`Frame`

```{eval-rst}
.. autoclass:: Frame
Expand All @@ -90,7 +90,7 @@ sock: zmq.SyncSocket
```

### {class}`MessageTracker`
## {class}`MessageTracker`

```{eval-rst}
.. autoclass:: MessageTracker
Expand All @@ -99,9 +99,7 @@ sock: zmq.SyncSocket
```

## Polling

### {class}`Poller`
## {class}`Poller`

```{eval-rst}
.. autoclass:: Poller
Expand Down
19 changes: 16 additions & 3 deletions docs/source/howto/morethanbindings.md
Original file line number Diff line number Diff line change
Expand Up @@ -122,11 +122,24 @@ as first-class methods to the {class}`~.zmq.Socket` class. A socket has the meth
{meth}`~.zmq.Socket.send_json` and {meth}`~.zmq.Socket.send_pyobj`, which correspond to sending an
object over the wire after serializing with {mod}`json` and {mod}`pickle` respectively,
and any object sent via those methods can be reconstructed with the
{meth}`~.zmq.Socket.recv_json` and {meth}`~.zmq.Socket.recv_pyobj` methods. Unicode strings are
other objects that are not unambiguously sendable over the wire, so we include
{meth}`~.zmq.Socket.send_string` and {meth}`~.zmq.Socket.recv_string` that simply send bytes
{meth}`~.zmq.Socket.recv_json` and {meth}`~.zmq.Socket.recv_pyobj` methods.

```{warning}
Deserializing with pickle grants the message sender access to arbitrary code execution on the receiver.
Never use `recv_pyobj` on a socket that might receive messages from untrusted sources
before authenticating the sender.
It's always a good idea to enable CURVE security if you can,
or authenticate messages with e.g. HMAC digests or other signing mechanisms.
```

Text strings are other objects that are not unambiguously sendable over the wire, so we include
{meth}`~.zmq.Socket.send_string` and {meth}`~.zmq.Socket.recv_string` that send bytes
after encoding the message ('utf-8' is the default).

These are all convenience methods, and users are encouraged to build their own serialization that best suits their applications needs,
especially concerning performance and security.

```{seealso}
- {ref}`Further information <serialization>` on serialization in pyzmq.
```
Expand Down
151 changes: 118 additions & 33 deletions docs/source/howto/serialization.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,85 +8,170 @@ When sending messages over a network, you often need to marshall your data into

## Builtin serialization

PyZMQ is primarily bindings for libzmq, but we do provide three builtin serialization
PyZMQ is primarily bindings for libzmq, but we do provide some builtin serialization
methods for convenience, to help Python developers learn libzmq. Python has two primary
packages for serializing objects: {py:mod}`json` and {py:mod}`pickle`, so we provide
simple convenience methods for sending and receiving objects serialized with these
modules. A socket has the methods {meth}`~.Socket.send_json` and
modules for serializing objects in the standard library: {py:mod}`json` and {py:mod}`pickle`,
so pyzmq provides simple convenience methods for sending and receiving objects serialized with these modules.
A socket has the methods {meth}`~.Socket.send_json` and
{meth}`~.Socket.send_pyobj`, which correspond to sending an object over the wire after
serializing with json and pickle respectively, and any object sent via those
methods can be reconstructed with the {meth}`~.Socket.recv_json` and
{meth}`~.Socket.recv_pyobj` methods.

These methods designed for convenience, not for performance, so developers who want
to emphasize performance should use their own serialized send/recv methods.
```{note}
These methods are meant more for convenience and demonstration purposes, not for performance or safety.
Applications should usually define their own serialized send/recv functions.
```

```{warning}
`send/recv_pyobj` are very basic wrappers around `send(pickle.dumps(obj))` and `pickle.loads(recv())`.
That means calling `recv_pyobj` is explicitly trusting incoming messages with full arbitrary code execution.
Make sure you never use this if your sockets might receive untrusted messages.
You can protect your sockets by e.g.:
- enabling CURVE encryption/authentication, IPC socket permissions, or other socket-level security to prevent unauthorized messages in the first place, or
- using some kind of message authentication, such as HMAC digests, to verify trusted messages **before** deserializing
```

## Using your own serialization

In general, you will want to provide your own serialization that is optimized for your
application or library availability. This may include using your own preferred
serialization ([^cite_msgpack], [^cite_protobuf]), or adding compression via [^cite_zlib] in the standard
library, or the super fast [^cite_blosc] library.
application goals or library availability. This may include using your own preferred
serialization such as [msgpack] or [msgspec],
or adding compression via {py:mod}`zlib` in the standard library,
or the super fast [blosc] library.

```{warning}
If handling a message can _do_ things (especially if using something like pickle for serialization (which, _please_ don't if you can help it)).
Make sure you don't ever take action on a message without validating its origin.
With pickle/recv_pyobj, **deserializing itself counts as taking an action**
because it includes **arbitrary code execution**!
```

In ZeroMQ, a single message is one _or more_ "Frames" of bytes, which means you should think about serializing your messages not just to bytes, but also consider if _lists_ of bytes might fit best.
Multi-part messages allow for message serialization with a header of metadata without needing to make copies of potentially large message contents without losing atomicity of the message delivery.

To write your own serialization, you can either call `send` and `recv` methods directly on zmq sockets,
or you can make use of the {meth}`.Socket.send_serialized` / {meth}`.Socket.recv_serialized` methods.
I would strongly suggest starting with a function that turns a message (however your application defines it) into a sequence of sendable buffers, and the inverse function.

For example:

```python
socket.send_json(msg)
msg = socket.recv_json()
```

is equivalent to

```python
def json_dump_bytes(msg: Any) -> list[bytes]:
return [json.dumps(msg).encode("utf8")]

There are two simple models for implementing your own serialization: write a function
that takes the socket as an argument, or subclass Socket for use in your own apps.

def json_load_bytes(msg_list: list[bytes]) -> Any:
return json.loads(msg_list[0].decode("utf8"))


socket.send_multipart(json_dump_bytes(msg))
msg = json_load_bytes(socket.recv_multipart())
# or
socket.send_serialized(msg, serialize=json_dump_bytes)
msg = socket.recv_serialized(json_load_bytes)
```

### Example: pickling Python objects

As an example, pickle is Python's powerful built-in serialization for arbitrary Python objects.
Two potential issues you might face:

1. sometimes it is inefficient, and
1. `pickle.loads` enables arbitrary code execution

For instance, pickles can often be reduced substantially in size by compressing the data.
The following will send *compressed* pickles over the wire:
We also want to make sure we don't call `pickle.loads` on any untrusted messages.
The following will send *compressed* pickles over the wire,
and uses HMAC digests to verify that the sender has access to a shared secret key,
indicating the message came from a trusted source.

```python
import haslib
import hmac
import pickle
import zlib


def send_zipped_pickle(socket, obj, flags=0, protocol=pickle.HIGHEST_PROTOCOL):
"""pickle an object, and zip the pickle before sending it"""
def sign(self, key: bytes, msg: bytes) -> bytes:
"""Compute the HMAC digest of msg, given signing key `key`"""
return hmac.HMAC(
key,
msg,
digestmod=hashlib.sha256,
).digest()


def send_signed_zipped_pickle(
socket, obj, flags=0, *, key, protocol=pickle.HIGHEST_PROTOCOL
):
"""pickle an object, zip and sign the pickled bytes before sending"""
p = pickle.dumps(obj, protocol)
z = zlib.compress(p)
return socket.send(z, flags=flags)
signature = sign(key, zobj)
return socket.send_multipart([signature, z], flags=flags)


def recv_zipped_pickle(socket, flags=0):
"""inverse of send_zipped_pickle"""
z = socket.recv(flags)
def recv_signed_zipped_pickle(socket, flags=0, *, key):
"""inverse of send_signed_zipped_pickle"""
sig, z = socket.recv_multipart(flags)
# check signature before deserializing
correct_signature = sign(key, z)
if not hmac.compare_digest(sig, correct_signature):
raise ValueError("invalid signature")
p = zlib.decompress(z)
return pickle.loads(p)
```

### Example: numpy arrays

A common data structure in Python is the numpy array. PyZMQ supports sending
numpy arrays without copying any data, since they provide the Python buffer interface.
However just the buffer is not enough information to reconstruct the array on the
receiving side. Here is an example of a send/recv that allow non-copying
However, just the buffer is not enough information to reconstruct the array on the
receiving side because it arrives as a 1-D array of bytes.
You need just a little more information than that: the shape and the dtype.

Here is an example of a send/recv that allow non-copying
sends/recvs of numpy arrays including the dtype/shape data necessary for reconstructing
the array.
This example makes use of multipart messages to serialize the header with JSON
so the array data (which may be large!) doesn't need any unnecessary copies.

```python
import numpy


def send_array(socket, A, flags=0, copy=True, track=False):
def send_array(
socket: zmq.Socket,
A: numpy.ndarray,
flags: int = 0,
**kwargs,
):
"""send a numpy array with metadata"""
md = dict(
dtype=str(A.dtype),
shape=A.shape,
)
socket.send_json(md, flags | zmq.SNDMORE)
return socket.send(A, flags, copy=copy, track=track)
return socket.send(A, flags, **kwargs)


def recv_array(socket, flags=0, copy=True, track=False):
def recv_array(socket: zmq.Socket, flags: int = 0, **kwargs) -> numpy.array:
"""recv a numpy array"""
md = socket.recv_json(flags=flags)
msg = socket.recv(flags=flags, copy=copy, track=track)
buf = memoryview(msg)
A = numpy.frombuffer(buf, dtype=md["dtype"])
msg = socket.recv(flags=flags, **kwargs)
A = numpy.frombuffer(msg, dtype=md["dtype"])
return A.reshape(md["shape"])
```

[^cite_msgpack]: Message Pack serialization library <https://msgpack.org>

[^cite_protobuf]: Google Protocol Buffers <https://github.com/protocolbuffers/protobuf>

[^cite_zlib]: Python stdlib module for zip compression: {py:mod}`zlib`

[^cite_blosc]: Blosc: A blocking, shuffling and loss-less (and crazy-fast) compression library <https://www.blosc.org>
[blosc]: https://www.blosc.org
[msgpack]: https://msgpack.org
[msgspec]: https://jcristharif.com/msgspec/
18 changes: 9 additions & 9 deletions examples/gevent/simple.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
from typing import Optional
from __future__ import annotations

from gevent import spawn, spawn_later

Expand All @@ -10,13 +10,13 @@
sock = ctx.socket(zmq.PUSH)
sock.bind('ipc:///tmp/zmqtest')

spawn(sock.send_pyobj, ('this', 'is', 'a', 'python', 'tuple'))
spawn_later(1, sock.send_pyobj, {'hi': 1234})
spawn(sock.send_json, ['this', 'is', 'a', 'list'])
spawn_later(1, sock.send_json, {'hi': 1234})
spawn_later(
2, sock.send_pyobj, ({'this': ['is a more complicated object', ':)']}, 42, 42, 42)
2, sock.send_json, ({'this': ['is a more complicated object', ':)']}, 42, 42, 42)
)
spawn_later(3, sock.send_pyobj, 'foobar')
spawn_later(4, sock.send_pyobj, 'quit')
spawn_later(3, sock.send_json, 'foobar')
spawn_later(4, sock.send_json, 'quit')


# client
Expand All @@ -27,14 +27,14 @@

def get_objs(sock: zmq.Socket):
while True:
o = sock.recv_pyobj()
print('received python object:', o)
o = sock.recv_json()
print('received:', o)
if o == 'quit':
print('exiting.')
break


def print_every(s: str, t: Optional[float] = None):
def print_every(s: str, t: float | None = None):
print(s)
if t:
spawn_later(t, print_every, s, t)
Expand Down
Loading

0 comments on commit 30e3189

Please sign in to comment.