Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(t) replication spawn error #2766 #2777

Merged

Conversation

phillxnet
Copy link
Member

Update replication code re Py2.7 to Py3.11.

  • Modernise previously missed replication imports re Py3.*
  • Force bytes format for replication messages and commands. Zmq requires bytes format.
  • Minor modification re Pythnon 3 behaviour re dict.keys(), we previously relied on an implicit Python 2 behaviour.
  • Move to Fstrings for all issue focused files.
  • Parameter/return type hinting.
  • Removed an unused local variable.
  • black format update
  • Improve error diagnostic content of receiver failing to retrieve senders IP address from sent appliance ID.
  • Improve debug logging.
  • Remove receiver 'latest_snap or b""' argument to improve readability.
  • reduce retry iterations from 10 to 3.
  • Remove use of None from within zmq command/message passing: to help with stricter type hinting.
  • remove libzmq socker.set_hwm.
  • refactor poll -> poller socks -> events for readability.
  • additional explanatory comments re sockets etc.
  • Enable tracker on listender_broker, sender, and receiver's response: improves robustness, and aids in debugging.
  • add zmq_version and libzmq_version properties to sender and receiver.
  • adapt iostream behaviour: this differs between Py2.7 & Py3.*.
  • Fix existing bug re very low send byte count.
  • harmonize on btrfs binary location to fs.btrfs for replication.
  • readability refactoring improvements.
  • keep receiver self.share/snap naming as str, encode before send only.
  • Avoid logging btrfs data stream contents.
  • Set read1() bytes read to 100MB max.

Fixes #2766
Closes #2748

Caveats

This PR essentially restores our prior replication behaviour with only a few minor fixes. That is the intention: we first have to restore our prior functional level, before improving upon it. So we are just moving from a working Py2.7 implementation to a working (within prior limitations) Py3.11 accounting the given updates to to our other underlying dependencies here re: "Update pyzmq dependency to latest #2746" #2747. We also likely have an excess of debug logging, especially within the fast iterations of the send/receive while loops that manage out stdout -> to stdin of the 'btrfs send' and 'btrfs receive' commands. This does end up straining our logging subsystem with anything other than trivial payloads. But given this is in debug mode only, it is proposed as acceptable. We also need far more field testing: hence the push to publish what we have to-date.

Likely in the future we should add a proper stream manager to our btrfs send byte-stream. But again: we are here restoring what we had functionally: with a view to enabling improvements under our new dependencies of Py3.* and the way newer ZMQ libraries.

Update replication code re Py2.7 to Py3.11.
- Modernise previously missed replication imports re Py3.*
- Force bytes format for replication messages and commands.
Zmq requires bytes format.
- Minor modification re Pythnon 3 behaviour re dict.keys(),
we previously relied on an implicit Python 2 behaviour.
- Move to Fstrings for all issue focused files.
- Parameter/return type hinting.
- Removed an unused local variable.
- black format update
- Improve error diagnostic content of receiver failing to
retrieve senders IP address from sent appliance ID.
- Improve debug logging.
- Remove receiver 'latest_snap or b""' argument to improve
readability.
- reduce retry iterations from 10 to 3.
- Remove use of None from within zmq command/message passing:
to help with stricter type hinting.
- remove libzmq socker.set_hwm.
- refactor poll -> poller socks -> events for readability.
- additional explanatory comments re sockets etc.
- Enable tracker on listender_broker, sender, and receiver's
response: improves robustness, and aids in debugging.
- add zmq_version and libzmq_version properties to sender and
receiver.
- adapt iostream behaviour: this differs between Py2.7 & Py3.*.
- Fix existing bug re very low send byte count.
- harmonize on btrfs binary location to fs.btrfs for replication.
- readability refactoring improvements.
- keep receiver self.share/snap naming as str, encode before
send only.
- Avoid logging btrfs data stream contents.
- Set read1() bytes read to 100MB max.
@phillxnet
Copy link
Member Author

Testing info to follow.

@phillxnet
Copy link
Member Author

Rpmbuilds completed successfully on 15.4 and 15.5 target OS's in X86_64 arch and are now being tested for prior replication function.

@phillxnet
Copy link
Member Author

phillxnet commented Jan 15, 2024

Full replication cycle logs:

A reciprocal "Appliance" relationship was established as follows:

reciprocal-appliance-entries

Using the default cliapp credentials.

A replicated was configured from Leap 15.4 to 15.5 as configured below.

5-minute-replicatino-task

This resulted in the followed edited logs:

15.4 (sender)

[15/Jan/2024 16:30:04] INFO [smart_manager.replication.sender:342] Id: 029ea547-da0b-4c23-b4f9-53c02bb7c283-1. Sending full replica: /mnt2/rock-pool/.snapshots/rep-share/rep-share_1_replication_1
[15/Jan/2024 16:35:04] INFO [smart_manager.replication.sender:336] Id: 029ea547-da0b-4c23-b4f9-53c02bb7c283-1. Sending incremental replica between /mnt2/rock-pool/.snapshots/rep-share/rep-share_1_replication_1 -- /mnt2/rock-pool/.snapshots/rep-share/rep-share_1_replication_2
[15/Jan/2024 16:40:03] INFO [smart_manager.replication.sender:336] Id: 029ea547-da0b-4c23-b4f9-53c02bb7c283-1. Sending incremental replica between /mnt2/rock-pool/.snapshots/rep-share/rep-share_1_replication_2 -- /mnt2/rock-pool/.snapshots/rep-share/rep-share_1_replication_3
[15/Jan/2024 16:45:04] INFO [smart_manager.replication.sender:336] Id: 029ea547-da0b-4c23-b4f9-53c02bb7c283-1. Sending incremental replica between /mnt2/rock-pool/.snapshots/rep-share/rep-share_1_replication_3 -- /mnt2/rock-pool/.snapshots/rep-share/rep-share_1_replication_4
[15/Jan/2024 16:50:04] INFO [smart_manager.replication.sender:336] Id: 029ea547-da0b-4c23-b4f9-53c02bb7c283-1. Sending incremental replica between /mnt2/rock-pool/.snapshots/rep-share/rep-share_1_replication_4 -- /mnt2/rock-pool/.snapshots/rep-share/rep-share_1_replication_5
[15/Jan/2024 16:55:04] INFO [smart_manager.replication.sender:336] Id: 029ea547-da0b-4c23-b4f9-53c02bb7c283-1. Sending incremental replica between /mnt2/rock-pool/.snapshots/rep-share/rep-share_1_replication_5 -- /mnt2/rock-pool/.snapshots/rep-share/rep-share_1_replication_6
[15/Jan/2024 17:00:04] INFO [smart_manager.replication.sender:336] Id: 029ea547-da0b-4c23-b4f9-53c02bb7c283-1. Sending incremental replica between /mnt2/rock-pool/.snapshots/rep-share/rep-share_1_replication_6 -- /mnt2/rock-pool/.snapshots/rep-share/rep-share_1_replication_7

15.5 (receiver)

[15/Jan/2024 16:45:03] INFO [storageadmin.views.snapshot:62] Supplanting share (029ea547-da0b-4c23-b4f9-53c02bb7c283_rep-share) with snapshot (.snapshots/029ea547-da0b-4c23-b4f9-53c02bb7c283_rep-share/rep-share_1_replication_1).
[15/Jan/2024 16:45:03] INFO [storageadmin.views.snapshot:100] Moving snapshot (/mnt2/test-pool/.snapshots/029ea547-da0b-4c23-b4f9-53c02bb7c283_rep-share/rep-share_1_replication_1) to prior share's pool location (/mnt2/test-pool/029ea547-da0b-4c23-b4f9-53c02bb7c283_rep-share)
[15/Jan/2024 16:50:03] INFO [storageadmin.views.snapshot:62] Supplanting share (029ea547-da0b-4c23-b4f9-53c02bb7c283_rep-share) with snapshot (rep-share_1_replication_2).
[15/Jan/2024 16:50:04] INFO [storageadmin.views.snapshot:100] Moving snapshot (/mnt2/test-pool/.snapshots/029ea547-da0b-4c23-b4f9-53c02bb7c283_rep-share/rep-share_1_replication_2) to prior share's pool location (/mnt2/test-pool/029ea547-da0b-4c23-b4f9-53c02bb7c283_rep-share)
[15/Jan/2024 16:55:03] INFO [storageadmin.views.snapshot:62] Supplanting share (029ea547-da0b-4c23-b4f9-53c02bb7c283_rep-share) with snapshot (rep-share_1_replication_3).
[15/Jan/2024 16:55:04] INFO [storageadmin.views.snapshot:100] Moving snapshot (/mnt2/test-pool/.snapshots/029ea547-da0b-4c23-b4f9-53c02bb7c283_rep-share/rep-share_1_replication_3) to prior share's pool location (/mnt2/test-pool/029ea547-da0b-4c23-b4f9-53c02bb7c283_rep-share)
[15/Jan/2024 17:00:04] INFO [storageadmin.views.snapshot:62] Supplanting share (029ea547-da0b-4c23-b4f9-53c02bb7c283_rep-share) with snapshot (rep-share_1_replication_4).
[15/Jan/2024 17:00:04] INFO [storageadmin.views.snapshot:100] Moving snapshot (/mnt2/test-pool/.snapshots/029ea547-da0b-4c23-b4f9-53c02bb7c283_rep-share/rep-share_1_replication_4) to prior share's pool location (/mnt2/test-pool/029ea547-da0b-4c23-b4f9-53c02bb7c283_rep-share)

Web-UI feedback

Slightly later on as more replication events had occured:

Overviews:

Web-UI-replication-report

Replication events tables

replication-eventd-report-tables

@phillxnet
Copy link
Member Author

The stable state achieved by the test procedure above results in the sending and receiving systems each having 3 replication snapshots on their associated replication shares:

3-snaps-on-sender-and-receiver

@phillxnet
Copy link
Member Author

Given the above, we look to have restored our prior replication behaviour. As such I'll move to merging this pull request as it is currently holding up our next testing channel/git-branch package release; planed to be 5..0.6-0: hopefully to be made available in the next 24 hours.

Note that the above test only involved a trivial 1.7 GB replication share payload, with a likely impractical 5 minute interval: but this test procedure was intended only to demonstrate a return of our prior behaviour (full happy-path replication cycle). There are many improvements (i.e. performance related) to be made in this code/capability area but that is not the focus/intent of this pull request; only to restore what function we previously had: but under Py3.11 and fully updated ZMQ.

@phillxnet phillxnet merged commit 08fbecc into rockstor:testing Jan 15, 2024
@phillxnet phillxnet deleted the 2766-(t)-replication-spawn-error branch January 15, 2024 17:37
This was referenced Jan 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant