Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(t) replication spawn error #2766 #2772

Closed

Conversation

phillxnet
Copy link
Member

Update replication code re Py3.*

  • Modernise previously missed replication imports re Py3.*
  • Force bytes format for replication messages and commands. Required as zmq needs bytes format for these.
  • Minor modification re Pythnon 3 behaviour re dict.keys(), we replied on an implicit Python 2 behaviour.
  • Move to Fstrings.
  • Parameter/return type hinting.
  • Removed an unused local variable.
  • black format update

Closes #2766

Supersedes prior draft PR #2769 to accommodate required re-base.

Update replication code re Py3.*
- Modernise previously missed replication imports re Py3.*
- Force bytes format for replication messages and commands.
Required as zmq needs bytes format for these.
- Minor modification re Pythnon 3 behaviour re dict.keys(),
we replied on an implicit Python 2 behaviour.
- Move to Fstrings.
- Parameter/return type hinting.
- Removed an unused local variable.
- black format update
- complete fstrings conversion for all files modified.
@phillxnet
Copy link
Member Author

phillxnet commented Dec 29, 2023

Ongoing dev notes:

To date we now have:

Sender:

[29/Dec/2023 15:50:02] DEBUG [smart_manager.replication.listener_broker:285] new-send request received for 1
[29/Dec/2023 15:50:02] DEBUG [smart_manager.replication.listener_broker:116] Starting a new Sender(029ea547-da0b-4c23-b4f9-53c02bb7c283_1).
[29/Dec/2023 15:50:02] DEBUG [smart_manager.replication.sender:108] Id: 029ea547-da0b-4c23-b4f9-53c02bb7c283-1 Initial greeting: {'pool': 'test-pool', 'share': 'rep-source-share', 'snap': 'rep-source-share_1_replication_1', 'incremental': False, 'uuid': '029ea547-da0b-4c23-b4f9-53c02bb7c283'}
[29/Dec/2023 15:50:03] DEBUG [system.osi:235] Running command: /usr/bin/systemctl --lines=0 status sshd
[29/Dec/2023 15:50:04] DEBUG [system.osi:235] Running command: /usr/sbin/btrfs subvolume list /mnt2/rock-pool
[29/Dec/2023 15:50:10] DEBUG [smart_manager.replication.sender:291] Id: 029ea547-da0b-4c23-b4f9-53c02bb7c283-1. No response from receiver. Number of retry attempts left: 9
[29/Dec/2023 15:50:10] DEBUG [smart_manager.replication.sender:108] Id: 029ea547-da0b-4c23-b4f9-53c02bb7c283-1 Initial greeting: {'pool': 'test-pool', 'share': 'rep-source-share', 'snap': 'rep-source-share_1_replication_1', 'incremental': False, 'uuid': '029ea547-da0b-4c23-b4f9-53c02bb7c283'}
[29/Dec/2023 15:50:10] ERROR [smart_manager.replication.sender:75] Id: 029ea547-da0b-4c23-b4f9-53c02bb7c283-1. b'b\'receiver-init-error\' received for 029ea547-da0b-4c23-b4f9-53c02bb7c283-1. extended reply: b"Receiver(b\'029ea547-da0b-4c23-b4f9-53c02bb7c283-1\') already exists. Will not start a new one.". Aborting.'. Exception: b'b\'receiver-init-error\' received for 029ea547-da0b-4c23-b4f9-53c02bb7c283-1. extended reply: b"Receiver(b\'029ea547-da0b-4c23-b4f9-53c02bb7c283-1\') already exists. Will not start a new one.". Aborting.'

Receiver:

[edited]

[29/Dec/2023 15:50:02] DEBUG [smart_manager.replication.listener_broker:237] initial greeting from b'029ea547-da0b-4c23-b4f9-53c02bb7c283-1'
[29/Dec/2023 15:50:02] DEBUG [smart_manager.replication.listener_broker:268] New Receiver(b'029ea547-da0b-4c23-b4f9-53c02bb7c283-1') started.
[29/Dec/2023 15:50:02] DEBUG [smart_manager.replication.receiver:153] Id: b'029ea547-da0b-4c23-b4f9-53c02bb7c283-1'. Starting a new Receiver for meta: {'pool': 'test-pool', 'share': 'rep-source-share', 'snap': 'rep-source-share_1_replication_1', 'incremental': False, 'uuid': '029ea547-da0b-4c23-b4f9-53c02bb7c283'}
[29/Dec/2023 15:50:02] ERROR [smart_manager.replication.receiver:86] b'Failed to get the sender ip for appliance: 029ea547-da0b-4c23-b4f9-53c02bb7c283'. Exception: Appliance matching query does not exist.

...
[29/Dec/2023 15:50:10] DEBUG [smart_manager.replication.listener_broker:237] initial greeting from b'029ea547-da0b-4c23-b4f9-53c02bb7c283-1'
[29/Dec/2023 15:50:10] ERROR [smart_manager.replication.listener_broker:252] Receiver(b'029ea547-da0b-4c23-b4f9-53c02bb7c283-1') already exists. Will not start a new one.

- Improve diagnostic content of receiver failing to
retrieve senders IP address from sent appliance ID.
- more debug logging
- remove receiver 'latest_snap or b""' to simplify for debug.
@phillxnet
Copy link
Member Author

Sender

[29/Dec/2023 19:50:02] DEBUG [smart_manager.replication.listener_broker:285] new-send request received for 6
[29/Dec/2023 19:50:02] DEBUG [smart_manager.replication.listener_broker:116] Starting a new Sender(029ea547-da0b-4c23-b4f9-53c02bb7c283_6).
[29/Dec/2023 19:50:02] DEBUG [smart_manager.replication.sender:108] Id: 029ea547-da0b-4c23-b4f9-53c02bb7c283-6 Initial greeting: {'pool': 'test-pool', 'share': 'rep-source-share', 'snap': 'rep-source-share_6_replication_1', 'incremental': False, 'uuid': '029ea547-da0b-4c23-b4f9-53c02bb7c283'}
[29/Dec/2023 19:50:03] DEBUG [system.osi:235] Running command: /usr/sbin/btrfs subvolume list /mnt2/rock-pool
[29/Dec/2023 19:50:09] DEBUG [smart_manager.replication.sender:253] Sender socks dict = {}
[29/Dec/2023 19:50:09] DEBUG [smart_manager.replication.sender:292] Id: 029ea547-da0b-4c23-b4f9-53c02bb7c283-6. No response from receiver. Number of retry attempts left: 9
[29/Dec/2023 19:50:09] DEBUG [smart_manager.replication.sender:108] Id: 029ea547-da0b-4c23-b4f9-53c02bb7c283-6 Initial greeting: {'pool': 'test-pool', 'share': 'rep-source-share', 'snap': 'rep-source-share_6_replication_1', 'incremental': False, 'uuid': '029ea547-da0b-4c23-b4f9-53c02bb7c283'}
[29/Dec/2023 19:50:09] DEBUG [smart_manager.replication.sender:253] Sender socks dict = {<zmq.Socket(zmq.DEALER) at 0x7fcdcb809cc0>: 1}
[29/Dec/2023 19:50:09] ERROR [smart_manager.replication.sender:75] Id: 029ea547-da0b-4c23-b4f9-53c02bb7c283-6. b'b\'receiver-init-error\' received for 029ea547-da0b-4c23-b4f9-53c02bb7c283-6. extended reply: b"Receiver(b\'029ea547-da0b-4c23-b4f9-53c02bb7c283-6\') already exists. Will not start a new one.". Aborting.'. Exception: b'b\'receiver-init-error\' received for 029ea547-da0b-4c23-b4f9-53c02bb7c283-6. extended reply: b"Receiver(b\'029ea547-da0b-4c23-b4f9-53c02bb7c283-6\') already exists. Will not start a new one.". Aborting.'

Receiver

[29/Dec/2023 19:50:02] DEBUG [smart_manager.replication.listener_broker:237] initial greeting from b'029ea547-da0b-4c23-b4f9-53c02bb7c283-6'
[29/Dec/2023 19:50:02] DEBUG [smart_manager.replication.listener_broker:268] New Receiver(b'029ea547-da0b-4c23-b4f9-53c02bb7c283-6') started.
[29/Dec/2023 19:50:02] DEBUG [smart_manager.replication.receiver:155] Id: b'029ea547-da0b-4c23-b4f9-53c02bb7c283-6'. Starting a new Receiver for meta: {'pool': 'test-pool', 'share': 'rep-source-share', 'snap': 'rep-source-share_6_replication_1', 'incremental': False, 'uuid': '029ea547-da0b-4c23-b4f9-53c02bb7c283'}
[29/Dec/2023 19:50:03] DEBUG [smart_manager.replication.receiver:129] _send_recv called with command: b'receiver-ready', msg: b''.
[29/Dec/2023 19:50:09] DEBUG [smart_manager.replication.listener_broker:237] initial greeting from b'029ea547-da0b-4c23-b4f9-53c02bb7c283-6'
[29/Dec/2023 19:50:09] ERROR [smart_manager.replication.listener_broker:252] Receiver(b'029ea547-da0b-4c23-b4f9-53c02bb7c283-6') already exists. Will not start a new one.

- More debug logging.
- reduce retry iterations from 10 to 3.
- more type hints.
- Yet more debug logging.
- Remove use of None from within zmq command/message passing:
to help with stricter type hinting.
- return prior latest_snap name send as message in receiver-ready.
- remove libzmq socker.set_hwm.
- refactor poll -> poller socks -> events for readability.
- additional explanatory comments re sockets etc.
- formatting typo re fstrings.
- additional typing.
- Enable tracker on receiver's response: to assist debug.
- Enable tracker on listender_broker and sender:
- add zmq_version and libzmq_version properties to sender,
receiver also now has these.
- Enable trackers on listener_broker.
- more type-hinting.
@phillxnet
Copy link
Member Author

Sender

[04/Jan/2024 20:04:57] DEBUG [smart_manager.replication.listener_broker:242] EVENT_LIST poll = []
[04/Jan/2024 20:04:59] DEBUG [smart_manager.replication.listener_broker:242] EVENT_LIST poll = []
[04/Jan/2024 20:05:01] DEBUG [smart_manager.replication.listener_broker:242] EVENT_LIST poll = []
[04/Jan/2024 20:05:02] DEBUG [smart_manager.replication.listener_broker:242] EVENT_LIST poll = [(<zmq.Socket(zmq.ROUTER) at 0x7f6881a04670>, 1)]
[04/Jan/2024 20:05:02] DEBUG [smart_manager.replication.listener_broker:311] new-send request received for 7
[04/Jan/2024 20:05:02] DEBUG [smart_manager.replication.listener_broker:122] Starting a new Sender(029ea547-da0b-4c23-b4f9-53c02bb7c283_7).
[04/Jan/2024 20:05:02] DEBUG [smart_manager.replication.sender:101] _init_greeting() CALLED
[04/Jan/2024 20:05:02] DEBUG [smart_manager.replication.sender:127] SENDER: _send_recv(command=b'sender-ready', msg=b'{"pool": "test-pool", "share": "rep-share", "snap": "rep-share_7_replication_1", "incremental": false, "uuid": "029ea547-da0b-4c23-b4f9-53c02bb7c283"}')
[04/Jan/2024 20:05:02] DEBUG [smart_manager.replication.sender:132] Waiting max 2 seconds for send of commmand (b'sender-ready')
[04/Jan/2024 20:05:02] DEBUG [smart_manager.replication.sender:122] _send_recv(command=b'sender-ready', msg=b'{"pool": "test-pool", "share": "rep-share", "snap": "rep-share_7_replication_1", "incremental": false, "uuid": "029ea547-da0b-4c23-b4f9-53c02bb7c283"}') -> b'sender-ready', b'send_only-succeeded'
[04/Jan/2024 20:05:02] DEBUG [smart_manager.replication.sender:123] Id: 029ea547-da0b-4c23-b4f9-53c02bb7c283-7 Initial greeting Done
[04/Jan/2024 20:05:02] DEBUG [smart_manager.replication.util:69] Replication create_replica_trail(rid=7, snap_name=rep-share_7_replication_1)
[04/Jan/2024 20:05:03] DEBUG [smart_manager.replication.sender:154] Sender _delete_old_snaps(share_path=/mnt2/rock-pool/.snapshots/rep-share)
[04/Jan/2024 20:05:03] DEBUG [system.osi:235] Running command: /usr/sbin/btrfs subvolume list /mnt2/rock-pool
[04/Jan/2024 20:05:04] DEBUG [smart_manager.replication.listener_broker:242] EVENT_LIST poll = []
[04/Jan/2024 20:05:06] DEBUG [smart_manager.replication.listener_broker:242] EVENT_LIST poll = []
[04/Jan/2024 20:05:08] DEBUG [smart_manager.replication.listener_broker:242] EVENT_LIST poll = []
[04/Jan/2024 20:05:09] DEBUG [smart_manager.replication.sender:274] Sender: EVENT_LIST poll = []
[04/Jan/2024 20:05:09] DEBUG [smart_manager.replication.sender:276] SENDER events dict = {}
[04/Jan/2024 20:05:09] DEBUG [smart_manager.replication.sender:281] EVENTS EMPTY
[04/Jan/2024 20:05:09] DEBUG [smart_manager.replication.sender:320] Id: 029ea547-da0b-4c23-b4f9-53c02bb7c283-7. No response from receiver. Number of retry attempts left: 9
[04/Jan/2024 20:05:09] DEBUG [smart_manager.replication.sender:101] _init_greeting() CALLED
[04/Jan/2024 20:05:09] DEBUG [smart_manager.replication.sender:127] SENDER: _send_recv(command=b'sender-ready', msg=b'{"pool": "test-pool", "share": "rep-share", "snap": "rep-share_7_replication_1", "incremental": false, "uuid": "029ea547-da0b-4c23-b4f9-53c02bb7c283"}')
[04/Jan/2024 20:05:09] DEBUG [smart_manager.replication.sender:132] Waiting max 2 seconds for send of commmand (b'sender-ready')
[04/Jan/2024 20:05:10] DEBUG [smart_manager.replication.listener_broker:242] EVENT_LIST poll = []
[04/Jan/2024 20:05:11] ERROR [smart_manager.replication.sender:80] Id: 029ea547-da0b-4c23-b4f9-53c02bb7c283-7. b"Failed while send-recv-ing command(b'sender-ready')". Exception: 
[04/Jan/2024 20:05:11] DEBUG [smart_manager.replication.util:41] update_replica_status(rtid=17, data={'status': 'failed', 'error': 'b"Failed while send-recv-ing command(b\'sender-ready\')". Exception: '})
[04/Jan/2024 20:05:12] DEBUG [smart_manager.replication.listener_broker:242] EVENT_LIST poll = []
[04/Jan/2024 20:05:12] DEBUG [smart_manager.replication.listener_broker:67] Sender(029ea547-da0b-4c23-b4f9-53c02bb7c283_7) exited. exitcode: 3
[04/Jan/2024 20:05:14] DEBUG [smart_manager.replication.listener_broker:242] EVENT_LIST poll = []
[04/Jan/2024 20:05:16] DEBUG [smart_manager.replication.listener_broker:242] EVENT_LIST poll = []
[04/Jan/2024 20:05:18] DEBUG [smart_manager.replication.listener_broker:242] EVENT_LIST poll = []
...
[04/Jan/2024 20:09:56] DEBUG [smart_manager.replication.listener_broker:242] EVENT_LIST poll = []
[04/Jan/2024 20:09:58] DEBUG [smart_manager.replication.listener_broker:242] EVENT_LIST poll = []
[04/Jan/2024 20:10:00] DEBUG [smart_manager.replication.listener_broker:242] EVENT_LIST poll = []
[04/Jan/2024 20:10:02] DEBUG [smart_manager.replication.listener_broker:242] EVENT_LIST poll = [(<zmq.Socket(zmq.ROUTER) at 0x7f6881a04670>, 1)]
[04/Jan/2024 20:10:02] DEBUG [smart_manager.replication.listener_broker:311] new-send request received for 7
[04/Jan/2024 20:10:02] DEBUG [smart_manager.replication.listener_broker:159] previous backup failed for Sender(029ea547-da0b-4c23-b4f9-53c02bb7c283_7). Starting a new one. Attempt 1/10.
[04/Jan/2024 20:10:02] DEBUG [smart_manager.replication.listener_broker:168] No record of last successful ReplicaTrail for Sender(029ea547-da0b-4c23-b4f9-53c02bb7c283_7). Will start a new Full Sender.
[04/Jan/2024 20:10:02] DEBUG [smart_manager.replication.sender:101] _init_greeting() CALLED
[04/Jan/2024 20:10:02] DEBUG [smart_manager.replication.sender:127] SENDER: _send_recv(command=b'sender-ready', msg=b'{"pool": "test-pool", "share": "rep-share", "snap": "rep-share_7_replication_1", "incremental": false, "uuid": "029ea547-da0b-4c23-b4f9-53c02bb7c283"}')
[04/Jan/2024 20:10:02] DEBUG [smart_manager.replication.sender:132] Waiting max 2 seconds for send of commmand (b'sender-ready')
[04/Jan/2024 20:10:04] DEBUG [smart_manager.replication.listener_broker:242] EVENT_LIST poll = []
[04/Jan/2024 20:10:04] DEBUG [smart_manager.replication.listener_broker:69] Active Senders: dict_keys(['029ea547-da0b-4c23-b4f9-53c02bb7c283_7'])
[04/Jan/2024 20:10:04] ERROR [smart_manager.replication.sender:80] Id: 029ea547-da0b-4c23-b4f9-53c02bb7c283-7. b"Failed while send-recv-ing command(b'sender-ready')". Exception: 
[04/Jan/2024 20:10:06] DEBUG [smart_manager.replication.listener_broker:242] EVENT_LIST poll = []
[04/Jan/2024 20:10:08] DEBUG [smart_manager.replication.listener_broker:242] EVENT_LIST poll = []
[04/Jan/2024 20:10:10] DEBUG [smart_manager.replication.listener_broker:242] EVENT_LIST poll = []
[04/Jan/2024 20:10:12] DEBUG [smart_manager.replication.listener_broker:242] EVENT_LIST poll = []
[04/Jan/2024 20:10:14] DEBUG [smart_manager.replication.listener_broker:242] EVENT_LIST poll = []
[04/Jan/2024 20:10:14] DEBUG [smart_manager.replication.listener_broker:67] Sender(029ea547-da0b-4c23-b4f9-53c02bb7c283_7) exited. exitcode: 3
[04/Jan/2024 20:10:16] DEBUG [smart_manager.replication.listener_broker:242] EVENT_LIST poll = []
[04/Jan/2024 20:10:18] DEBUG [smart_manager.replication.listener_broker:242] EVENT_LIST poll = []
[04/Jan/2024 20:10:20] DEBUG [smart_manager.replication.listener_broker:242] EVENT_LIST poll = []
...

Receiver

Note: in the following (no Web-UI login) there are no further logs. I.e. we have a likely hard crash of the replication subsystem:

  • after: listener_broker:333] command: b'receiver-ready', sending 'ACK' to backend. (Successfully received by local receiver).
  • just after: listener_broker:341] command: b'receiver-ready', sending to frontend.
[04/Jan/2024 20:05:00] DEBUG [smart_manager.replication.listener_broker:242] EVENT_LIST poll = []
[04/Jan/2024 20:05:02] DEBUG [smart_manager.replication.listener_broker:242] EVENT_LIST poll = []
[04/Jan/2024 20:05:02] DEBUG [smart_manager.replication.listener_broker:242] EVENT_LIST poll = [(<zmq.Socket(zmq.ROUTER) at 0x7f93c8dfcad0>, 1)]
[04/Jan/2024 20:05:02] DEBUG [smart_manager.replication.listener_broker:248] frontend.recv_multipart() -> command=b'sender-ready', msg=b'{"pool": "test-pool", "share": "rep-share", "snap": "rep-share_7_replication_1", "incremental": false, "uuid": "029ea547-da0b-4c23-b4f9-53c02bb7c283"}'
[04/Jan/2024 20:05:02] DEBUG [smart_manager.replication.listener_broker:262] initial greeting command 'b'sender-ready'' received from b'029ea547-da0b-4c23-b4f9-53c02bb7c283-7'
[04/Jan/2024 20:05:02] DEBUG [smart_manager.replication.listener_broker:293] New Receiver(b'029ea547-da0b-4c23-b4f9-53c02bb7c283-7') started.
[04/Jan/2024 20:05:02] DEBUG [smart_manager.replication.receiver:171] Id: b'029ea547-da0b-4c23-b4f9-53c02bb7c283-7'. Starting a new Receiver for meta: {'pool': 'test-pool', 'share': 'rep-share', 'snap': 'rep-share_7_replication_1', 'incremental': False, 'uuid': '029ea547-da0b-4c23-b4f9-53c02bb7c283'}
[04/Jan/2024 20:05:03] DEBUG [smart_manager.replication.util:86] create_rshare(data={'share': '029ea547-da0b-4c23-b4f9-53c02bb7c283_rep-share', 'appliance': '192.168.2.196', 'src_share': 'rep-share'})
[04/Jan/2024 20:05:03] DEBUG [smart_manager.replication.util:100] create_receive_trail(rid=7, data={'snap_name': 'rep-share_7_replication_1'})
[04/Jan/2024 20:05:03] DEBUG [smart_manager.replication.util:103] create_receive_trail() -> 7
[04/Jan/2024 20:05:03] DEBUG [smart_manager.replication.receiver:138] RECEIVER: _send_recv called with command: b'receiver-ready', msg: b'place-holder'.
[04/Jan/2024 20:05:03] DEBUG [smart_manager.replication.listener_broker:242] EVENT_LIST poll = [(<zmq.Socket(zmq.ROUTER) at 0x7f93c8dfc670>, 1)]
[04/Jan/2024 20:05:03] DEBUG [smart_manager.replication.listener_broker:327] Identity/address b'029ea547-da0b-4c23-b4f9-53c02bb7c283-7', found in remove_senders.keys()
[04/Jan/2024 20:05:03] DEBUG [smart_manager.replication.listener_broker:333] command: b'receiver-ready', sending 'ACK' to backend.
[04/Jan/2024 20:05:03] DEBUG [smart_manager.replication.receiver:144] Waiting max 2 seconds for send of commmand (b'receiver-ready')
[04/Jan/2024 20:05:03] DEBUG [smart_manager.replication.receiver:150] Id: b'029ea547-da0b-4c23-b4f9-53c02bb7c283-7' RECEIVER: _send_recv command: b'receiver-ready' rcommand: b'ACK'
[04/Jan/2024 20:05:03] DEBUG [smart_manager.replication.receiver:153] remote message: b''
[04/Jan/2024 20:05:03] DEBUG [smart_manager.replication.listener_broker:336] Waiting max 2 seconds for send of commmand (b'receiver-ready')
[04/Jan/2024 20:05:03] DEBUG [smart_manager.replication.listener_broker:341] command: b'receiver-ready', sending to frontend.

Again we have no further log entries here: but expect continued polling log entries on this receiver. Before adding tracking we did not have this hard failure, but we also did not receive (at the sender machine) the 'receiver-ready' command.

This initially suggest we have a threading issue with the updated Py3.11 & updated pyzmq. Pyzmq contexts are thread safe, but sockets are not. This situation may relate to:

@phillxnet
Copy link
Member Author

This initially suggest we have a threading issue with the updated Py3.11 & updated pyzmq.

Not so quick, there was a remaining Py3 byte/str encode in the newly added track: now fixed. And we have finally progressed past the sender & receiver setup/nogotiation-chat and onto the still buggy (Py2.7/Py3.11 transition) send fsdata stage:

[05/Jan/2024 10:50:04] DEBUG [smart_manager.replication.sender:291] Id: 029ea547-da0b-4c23-b4f9-53c02bb7c283-8. command(b'receiver-ready') and message(b'place-holder') received. Proceeding to send fsdata.

Sender

[05/Jan/2024 10:50:02] DEBUG [smart_manager.replication.listener_broker:242] EVENT_LIST poll = [(<zmq.Socket(zmq.ROUTER) at 0x7f6c5ce00670>, 1)]
[05/Jan/2024 10:50:02] DEBUG [smart_manager.replication.listener_broker:314] new-send request received for 8
[05/Jan/2024 10:50:02] DEBUG [smart_manager.replication.listener_broker:122] Starting a new Sender(029ea547-da0b-4c23-b4f9-53c02bb7c283_8).
[05/Jan/2024 10:50:02] DEBUG [smart_manager.replication.sender:101] _init_greeting() CALLED
[05/Jan/2024 10:50:02] DEBUG [smart_manager.replication.sender:127] SENDER: _send_recv(command=b'sender-ready', msg=b'{"pool": "test-pool", "share": "rep-share", "snap": "rep-share_8_replication_1", "incremental": false, "uuid": "029ea547-da0b-4c23-b4f9-53c02bb7c283"}')
[05/Jan/2024 10:50:02] DEBUG [smart_manager.replication.sender:132] Waiting max 2 seconds for send of commmand (b'sender-ready')
[05/Jan/2024 10:50:02] DEBUG [smart_manager.replication.sender:122] _send_recv(command=b'sender-ready', msg=b'{"pool": "test-pool", "share": "rep-share", "snap": "rep-share_8_replication_1", "incremental": false, "uuid": "029ea547-da0b-4c23-b4f9-53c02bb7c283"}') -> b'sender-ready', b'send_only-succeeded'
[05/Jan/2024 10:50:02] DEBUG [smart_manager.replication.sender:123] Id: 029ea547-da0b-4c23-b4f9-53c02bb7c283-8 Initial greeting Done
[05/Jan/2024 10:50:02] DEBUG [smart_manager.replication.util:69] Replication create_replica_trail(rid=8, snap_name=rep-share_8_replication_1)
[05/Jan/2024 10:50:04] DEBUG [smart_manager.replication.sender:154] Sender _delete_old_snaps(share_path=/mnt2/rock-pool/.snapshots/rep-share)
[05/Jan/2024 10:50:04] DEBUG [system.osi:235] Running command: /usr/sbin/btrfs subvolume list /mnt2/rock-pool
[05/Jan/2024 10:50:04] DEBUG [smart_manager.replication.sender:274] Sender: EVENT_LIST poll = [(<zmq.Socket(zmq.DEALER) at 0x7f6c5ce01320>, 1)]
[05/Jan/2024 10:50:04] DEBUG [smart_manager.replication.sender:276] SENDER events dict = {<zmq.Socket(zmq.DEALER) at 0x7f6c5ce01320>: 1}
[05/Jan/2024 10:50:04] DEBUG [smart_manager.replication.sender:279] events index (<zmq.Socket(zmq.DEALER) at 0x7f6c5ce01320>), has value 1
[05/Jan/2024 10:50:04] DEBUG [smart_manager.replication.sender:286] command = b'receiver-ready', of type <class 'bytes'>
[05/Jan/2024 10:50:04] DEBUG [smart_manager.replication.sender:291] Id: 029ea547-da0b-4c23-b4f9-53c02bb7c283-8. command(b'receiver-ready') and message(b'place-holder') received. Proceeding to send fsdata.
[05/Jan/2024 10:50:04] DEBUG [smart_manager.replication.sender:335] init btrfs 'send' cmd ['/sbin/btrfs', 'send', '/mnt2/rock-pool/.snapshots/rep-share/rep-share_8_replication_1']
[05/Jan/2024 10:50:04] INFO [smart_manager.replication.sender:344] Id: 029ea547-da0b-4c23-b4f9-53c02bb7c283-8. Sending full replica: /mnt2/rock-pool/.snapshots/rep-share/rep-share_8_replication_1
[05/Jan/2024 10:50:04] DEBUG [smart_manager.replication.sender:127] SENDER: _send_recv(command=b'', msg=None)
[05/Jan/2024 10:50:04] ERROR [smart_manager.replication.sender:80] Id: 029ea547-da0b-4c23-b4f9-53c02bb7c283-8. b"Failed while send-recv-ing command(b'')". Exception: Frame 1 (None) does not support the buffer interface.
[05/Jan/2024 10:50:04] DEBUG [smart_manager.replication.util:41] update_replica_status(rtid=18, data={'status': 'failed', 'error': 'b"Failed while send-recv-ing command(b\'\')". Exception: Frame 1 (None) does not support the buffer interface.'})
[05/Jan/2024 10:50:04] DEBUG [smart_manager.replication.listener_broker:242] EVENT_LIST poll = []
[05/Jan/2024 10:50:06] DEBUG [smart_manager.replication.listener_broker:242] EVENT_LIST poll = []
[05/Jan/2024 10:50:08] DEBUG [smart_manager.replication.listener_broker:242] EVENT_LIST poll = []
[05/Jan/2024 10:50:10] DEBUG [smart_manager.replication.listener_broker:242] EVENT_LIST poll = []
[05/Jan/2024 10:50:10] DEBUG [smart_manager.replication.listener_broker:67] Sender(029ea547-da0b-4c23-b4f9-53c02bb7c283_8) exited. exitcode: 3

- more str/byte issues.
- iostream behaviour differs, in-dev modifications.
@phillxnet
Copy link
Member Author

With yet more tinkering we have our first Py3.11 'btrfs send' byte stream (captured via a Popen.stdout PIPE instance) sent over Pyzmq:

[05/Jan/2024 18:40:02] DEBUG [smart_manager.replication.sender:123] _send_recv(command=b'sender-ready', msg=b'{"pool": "test-pool", "share": "rep-share", "snap": "rep-share_16_replication_1", "incremental": false, "uuid": "029ea547-da0b-4c23-b4f9-53c02bb7c283"}') -> b'sender-ready', b'send_only-succeeded'
[05/Jan/2024 18:40:02] DEBUG [smart_manager.replication.sender:124] Id: 029ea547-da0b-4c23-b4f9-53c02bb7c283-16 Initial greeting Done
[05/Jan/2024 18:40:02] DEBUG [smart_manager.replication.util:69] Replication create_replica_trail(rid=16, snap_name=rep-share_16_replication_1)
[05/Jan/2024 18:40:04] DEBUG [smart_manager.replication.sender:154] Sender _delete_old_snaps(share_path=/mnt2/rock-pool/.snapshots/rep-share)
[05/Jan/2024 18:40:04] DEBUG [system.osi:235] Running command: /usr/sbin/btrfs subvolume list /mnt2/rock-pool
[05/Jan/2024 18:40:04] DEBUG [smart_manager.replication.sender:274] Sender: EVENT_LIST poll = [(<zmq.Socket(zmq.DEALER) at 0x7fb32da012b0>, 1)]
[05/Jan/2024 18:40:04] DEBUG [smart_manager.replication.sender:276] SENDER events dict = {<zmq.Socket(zmq.DEALER) at 0x7fb32da012b0>: 1}
[05/Jan/2024 18:40:04] DEBUG [smart_manager.replication.sender:279] events index (<zmq.Socket(zmq.DEALER) at 0x7fb32da012b0>), has value 1
[05/Jan/2024 18:40:04] DEBUG [smart_manager.replication.sender:286] command = b'receiver-ready', of type <class 'bytes'>
[05/Jan/2024 18:40:04] DEBUG [smart_manager.replication.sender:291] Id: 029ea547-da0b-4c23-b4f9-53c02bb7c283-16. command(b'receiver-ready') and message(b'place-holder') received. Proceeding to send fsdata.
[05/Jan/2024 18:40:04] DEBUG [smart_manager.replication.sender:335] Initial btrfs 'send' cmd ['/sbin/btrfs', 'send', '/mnt2/rock-pool/.snapshots/rep-share/rep-share_16_replication_1']
[05/Jan/2024 18:40:04] INFO [smart_manager.replication.sender:344] Id: 029ea547-da0b-4c23-b4f9-53c02bb7c283-16. Sending full replica: /mnt2/rock-pool/.snapshots/rep-share/rep-share_16_replication_1
[05/Jan/2024 18:40:04] DEBUG [smart_manager.replication.sender:127] SENDER: _send_recv(command=b'', msg=b'')
[05/Jan/2024 18:40:04] DEBUG [smart_manager.replication.sender:132] Waiting max 2 seconds for send of commmand (b'')
[05/Jan/2024 18:40:04] DEBUG [smart_manager.replication.sender:377] Id: 029ea547-da0b-4c23-b4f9-53c02bb7c283-16. send process finished for 029ea547-da0b-4c23-b4f9-53c02bb7c283_rep-share_16_replication_1. rc: 0. stderr: b'At subvol /mnt2/rock-pool/.snapshots/rep-share/rep-share_16_replication_1\n'
[05/Jan/2024 18:40:04] DEBUG [smart_manager.replication.sender:127] SENDER: _send_recv(command=b'', msg=b'btrfs-stream\x00\x01\x00\x00\x00>\x00\x00\x00\x01\x00\x1f=\xf9k\x0f\x00\x1a\x00rep-share_16_replication_1\x01\x00\x10\x00\xec2\xe4\\\xceizF\xb7\x8d\n\xecmeo)\x02\x00\x08\x00\t\x01\x00\x00\x00\x00\x00\x00\x1c\x00\x00\x00\x13\x00\x027-\x8c\x0f\x00\x00\x00\x06\x00\x08\x00\x00\x00\x00\x00\x00\x00\x00\x00\x07\x00\x08\x00\x00\x00\x00\x00\x00\x00\x00\x00\x10\x00\x00\x00\x12\x00YUu5\x0f\x00\x00\x00\x05\x00\x08\x00\xed\x01\x00\x00\x00\x00\x00\x004\x00\x00\x00\x14\x00w\xd9\x05\x7f\x0f\x00\x00\x00\x0b\x00\x0c\x00\r\x97\x96e\x00\x00\x00\x00{\xcd\xd5(\n\x00\x0c\x00\r\x97\x96e\x00\x00\x00\x00{\xcd\xd5(\t\x00\x0c\x00\r\x97\x96e\x00\x00\x00\x00{\xcd\xd5(\x00\x00\x00\x00\x15\x00Pl\xc9\x9d')
[05/Jan/2024 18:40:04] DEBUG [smart_manager.replication.sender:132] Waiting max 2 seconds for send of commmand (b'')
[05/Jan/2024 18:40:04] DEBUG [smart_manager.replication.sender:127] SENDER: _send_recv(command=b'btrfs-send-stream-finished', msg=b'')
[05/Jan/2024 18:40:04] DEBUG [smart_manager.replication.sender:132] Waiting max 2 seconds for send of commmand (b'btrfs-send-stream-finished')
[05/Jan/2024 18:40:04] DEBUG [smart_manager.replication.sender:148] Id: 029ea547-da0b-4c23-b4f9-53c02bb7c283-16 Server: 192.168.2.199:10002 scommand: b'btrfs-send-stream-finished' rcommand: b'receiver-error'
[05/Jan/2024 18:40:04] DEBUG [smart_manager.replication.util:41] update_replica_status(rtid=26, data={'status': 'succeeded', 'kb_sent': 0.2197265625})
[05/Jan/2024 18:40:04] ERROR [storageadmin.util:45] Exception: Field 'kb_sent' expected a number but got '0.2197265625'.
Traceback (most recent call last):
  File "/opt/rockstor/.venv/lib/python3.11/site-packages/django/db/models/fields/__init__.py", line 2053, in get_prep_value
    return int(value)
           ^^^^^^^^^^
ValueError: invalid literal for int() with base 10: '0.2197265625'

The above exception was the direct cause of the following exception:

With a little more tidy up to do re yet more type issues.

- additional type-hinting and fix for very low send byte count.
- harmonize on btrfs binary location to fs.btrfs for replication.
- readability refactoring.
- more byte/str fixes re Py2.7/Py3
@phillxnet
Copy link
Member Author

So prior kb_sent related type handling now resolved, and we have our first successful/error-free full first send, but on the second replication event (incremental=True this time) we have a remaining type issue re:

Sender

[08/Jan/2024 18:50:02] DEBUG [smart_manager.replication.sender:120] _send_recv(command=b'sender-ready', msg=b'{"pool": "test-pool", "share": "rep-share", "snap": "rep-share_3_replication_4", "incremental": true, "uuid": "029ea547-da0b-4c23-b4f9-53c02bb7c283"}') -> b'sender-ready', b'send_only-succeeded'
...
[08/Jan/2024 18:50:03] DEBUG [smart_manager.replication.sender:283] command = b'receiver-ready', of type <class 'bytes'>
[08/Jan/2024 18:50:03] ERROR [smart_manager.replication.sender:78] Id: 029ea547-da0b-4c23-b4f9-53c02bb7c283-3. b"Mismatch on starting snapshot for btrfs-send. Sender picked rep-share_3_replication_1 but Receiver wants b'rep-share_3_replication_1', which takes precedence.". Exception: b"Mismatch on starting snapshot for btrfs-send. Sender picked rep-share_3_replication_1 but Receiver wants b'rep-share_3_replication_1', which takes precedence.". No succeeded trail found for b'rep-share_3_replication_1'.
[08/Jan/2024 18:50:03] DEBUG [smart_manager.replication.util:43] update_replica_status(rtid=4, data={'status': 'failed', 'error': 'b"Mismatch on starting snapshot for btrfs-send. Sender picked rep-share_3_replication_1 but Receiver wants b\'rep-share_3_replication_1\', which takes precedence.". Exception: b"Mismatch on starting snapshot for btrfs-send. Sender picked rep-share_3_replication_1 but Receiver wants b\'rep-share_3_replication_1\', which takes precedence.". No succeeded trail found for b\'rep-share_3_replication_1\'.'})
[08/Jan/2024 18:50:04] DEBUG [smart_manager.replication.listener_broker:242] EVENT_LIST poll = []
[08/Jan/2024 18:50:04] DEBUG [smart_manager.replication.listener_broker:67] Sender(029ea547-da0b-4c23-b4f9-53c02bb7c283_3) exited. exitcode: 3
[08/Jan/2024 18:50:06] DEBUG [smart_manager.replication.listener_broker:242] EVENT_LIST poll = []
[08/Jan/2024 18:50:08] DEBUG [smart_manager.replication.listener_broker:242] EVENT_LIST poll = []

I.e. we are likely comparing str with bytes: but otherwise this looks to be working as intended given they are equal otherwise!

Sender picked rep-share_3_replication_1 but Receiver wants b'rep-share_3_replication_1'

Next session I address this and we are into testing the full sequence of 3 events prior to replication settling into its stable state.

Almost there now.

- minor additional refactoring for clarity.
- keep receiver self.share/snap naming as str, encode before
send only.
- more type-hints.
- During debug logging we only show the first 180 bytes of the
message, this avoids log-spamming MBs of btrfs-send stream data.
- Send btrfs-send byte stream in 10MB chunks.
@phillxnet
Copy link
Member Author

This PR branch now has functional replication, but has only been tested on extremely trivial subvol data size.
In my next session I will test more real-life data set sizes and assess if the initial 10MB send stream chunking size is at all workable. We also have some as yet untested logging on that front.

- Avoid logging btrfs data stream contents entirely.
- Set read1() bytes read to 100MB max.
@phillxnet
Copy link
Member Author

This development branch is now basically as-was functional, i.e. we have what looks like our prior function before the Py2.7 to Py3.11 moves of late. I'll move to squash and re-present so we can merge and get this out to a wider testing via the now overdue testing branch release.

@phillxnet phillxnet closed this Jan 15, 2024
@phillxnet phillxnet deleted the 2766-(t)-replication-spawn-error branch January 15, 2024 11:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant