Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Relaxed --wait <number-of-confirmations> for use on CI #1427

Closed
lidel opened this issue Jul 26, 2021 · 3 comments · Fixed by #1444
Closed

Relaxed --wait <number-of-confirmations> for use on CI #1427

lidel opened this issue Jul 26, 2021 · 3 comments · Fixed by #1444
Assignees
Labels
kind/bug A bug in existing code (including security flaws) P1 High: Likely tackled by core team if no one steps up

Comments

@lidel
Copy link
Contributor

lidel commented Jul 26, 2021

Current behavior

Current behavior of pin add --wait seems to be waiting for "pinned" status everywhere (no matter what replication factor was set? I tried to pass --replication-min|max and those seem to be ignored as well).

This effectively breaks CI build if any of the cluster peers is slow/unresponsive, and even when all of them are online, makes build unnecessarily long.

In recent days, it was causing issues for both project teams and stewards (dist.ipfs.io).

I ended up not using --wait and decided to poll for status in userland instead

Proposed enhancement

I believe the most useful feature for CI would be to allow for "relaxed" --wait where we block only until "n" replica confirmations are received.

I propose we add a feature where a number of replica confirmations can be passed, and then we wait until that number of "pinned" confirmations is received:

ipfs-cluster-ctl pin add --wait 2 --replication 4 <CID>

The above would pin with replication factor 4 but wait only for 2 confirmations before unblocking CI job.

I imagine that when the number is missing, we would default to either --replication[-min] (if provided), and then fallback to number of peers when no replication is passed.

Alternative approach

Alternative is to make --wait just follow implicit/explicit --replication[-min], but I believe decoupling waiting from replication factor makes it more useful / flexible.

cc @olizilla @hsanjuan @dholms (#1311)

@lidel lidel added the need/triage Needs initial labeling and prioritization label Jul 26, 2021
@ipfs-cluster ipfs-cluster deleted a comment from welcome bot Jul 26, 2021
@hsanjuan
Copy link
Collaborator

--wait should wait until all peers report either pinned or remote (which should be the equivalent of reaching rep-factor-fax.

https://github.com/ipfs/ipfs-cluster/blob/ea4dd7ed983d56c81f780fc84e150deda9e68500/api/rest/client/methods.go#L353-L360

Adding an option to wait sounds good, but it seems something else was happening, which suggest a bug in current logic or an issue at some other level. The wait should have returned on errors too.

Are you sure that you saw what you think you saw? The item was probably not yet pinned in all the places it should have been pinned.

@lidel
Copy link
Contributor Author

lidel commented Jul 30, 2021

I remember trying multiple things and it looked like reproducible: had pin add --wait --replication 2 waiting forever, even tho --debug responses used for status polling revealed that >2 peers had the CID in pinned status.

There was only one peer in unpinned state (infra confirmed we had issues with cluster at that time), and everything else was pinned, which feels like a bug in situation where cluster is partially out of sync perhaps?

@hsanjuan
Copy link
Collaborator

There was only one peer in unpinned state

Ok, that was probably the problem. I'll call this a bug.

@hsanjuan hsanjuan added kind/bug A bug in existing code (including security flaws) P1 High: Likely tackled by core team if no one steps up and removed need/triage Needs initial labeling and prioritization labels Aug 11, 2021
@hsanjuan hsanjuan self-assigned this Aug 11, 2021
@hsanjuan hsanjuan added this to the Release v0.14.1 milestone Aug 11, 2021
hsanjuan added a commit that referenced this issue Aug 13, 2021
Fixes #1427. Currently, if --wait is used when pinning it will wait until all
statuses reported for a pin are either Pinned or Remote. If a peer was lagging
behind and not syncing the state properly (reporting "unpinned" for example),
that would be enough to block waiting.

This modifies the behaviour of wait to return when replication_factor_min is
reached, regardless of what other statuses are.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug A bug in existing code (including security flaws) P1 High: Likely tackled by core team if no one steps up
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants