-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Retrieval Error: error generated by data transfer: deal data transfer failed #6299
Comments
Possible dup of #6297, but the error message at the end is different. |
@dirkmc this seems like a go-data transfer issue..wdyt? |
I believe that "timed out waiting 30s for Complete message from remote peer" is fixed by #6300 @mgoelzer could you please update to the latest master and let us know if it works for you |
@dirkmc I upgraded to the latest master (commit The problem still happens. The error message is identical except that it now says Here is the full console output:
Here is the lotus daemon log file covering just the period of the retrieval attempt: new-2min-error.log EDIT: In case it's significant, the retrieval proceeds rapidly until it gets to the console output line |
Here's another example of this same problem with a different miner. Command Output $ lotus client retrieve --miner f023977 bafykbzacebhjwmbodmo4etb2mv7ufphbupzg5cf4tnjbxghmjqtukso4wzewg /dev/null > Recv: 0 B, Paid 0 FIL, ClientEventOpen (DealStatusNew) > Recv: 0 B, Paid 0 FIL, ClientEventDealProposed (DealStatusWaitForAcceptance) > Recv: 960 B, Paid 0 FIL, ClientEventBlocksReceived (DealStatusWaitForAcceptance) > Recv: 52.95 KiB, Paid 0 FIL, ClientEventBlocksReceived (DealStatusWaitForAcceptance) > Recv: 1.052 MiB, Paid 0 FIL, ClientEventBlocksReceived (DealStatusWaitForAcceptance) > Recv: 1.052 MiB, Paid 0 FIL, ClientEventDataTransferError (DealStatusErrored) > Recv: 0 B, Paid 0 FIL, ClientEventOpen (DealStatusNew) ERROR: retrieval failed: Retrieve: Retrieval Error: error generated by data transfer: deal data transfer failed: 12D3KooWSuUQ1Mx97EbMxRNAvPCYab9XstBxT6iSJkqL97zUnVh5-12D3KooWSykmgDXcge3xSHc5Lf3p5dxVRvKbToJgkJ6LFcjQ9h7V-1621630583961168147: timed out waiting 2m0s for Accept message from remote peer Logs Lotus Version
|
I don't think latency/timeouts are the problem here, unless there's a deadlock or performance issue on the miner at the finalisation stage, that eventually clears, and thus increasing the client timeout helps. However, the logs don't suggest that. Latency is sub-second. Here's what I observe:
The client (receiver of data) knows the transfer is complete and records it in the data transfer FSM. This happens here: At this point, the client is awaiting a Complete message from the miner. Thirty (30) seconds after, the client hasn't received it:
It gives up and sends a Cancel request to the miner. We receive the Cancel response from the miner in ~400ms (rough round trip, accounting for processing on both ends too), and terminate the channel. Next step: we've established that the miner actually fails to send the complete message. Investigate why this happens. |
Here is where the client awaits the Complete message from the miner, and aborts the channel if it doesn't arrive within the timeout: https://github.com/filecoin-project/go-data-transfer/blob/77b948c4eb91aac1638cdd2759d8aff39216ffe2/channelmonitor/channelmonitor.go#L299 |
@mgoelzer What Lotus version is the miner running ? I'd like to reproduce this on the Sofia miner if possible. |
I think we covered this in our sync chat yesterday, but just for posterity: we don't know what version of Lotus the miners are running in any of these issues (this one, #6321, #6297) because these are just random miners on the network. We don't have any |
@mgoelzer I believe we are already collecting this info, it's just not exposed over the API. You could even use the libp2p repl to try to identify the peer off band. Here's a proposal for a new JSON-RPC operation: #6327 |
@mgoelzer There are two separate issues mention in this thread:
I'm going to disable the |
Note: Need to confirm that this is indeed a Markets v1.3.0/data-transfer v1.5.0 regression by seeing repeated success running a Lotus v1.9.0 client against a Lotus v1.9.0 miner. |
I retried the retrieval in the issue description above using a build of Lotus based on PR #6332. I notice that I now get different output:
I've seen this @aarshkshah1992 I will send you the debug log file out of band, to avoid inadvertently leaking any wallet keys or other secrets. |
@mgoelzer This error is misleading and probably a non-issue. The answer lies here: From your comment here:
From the logs:
The Retrieval Data transfer module uses a libp2p I'll look into why we aren't able to unmarshal that response message but that is NOT the issue here. |
Closing this ticket as the Legacy Lotus/Lotus-Miner Markets sub-system, which reached EOL at the end of the 31st January 2023, and has been completely removed from the code: |
Describe the bug
Trying to do a retrieval of a particular miner,CID pair from a machine that has already performed that retrieval unsuccessfully returns this error:
Version (run
lotus version
):This is a custom build of Lotus built from PR #6149. It contains markets 1.3.
To Reproduce
dealbot-mainnet
(I can provide host and credentials out of band)lotus client retrieve --miner f0215497 mAXCg5AIgcwUbnEkjzoZgSVAbcO3KXldc/i+2Q3IoGYojEnq1w4s /dev/null
Expected behavior
I expect the retrieval to complete successfully.
Logs
Here is the daemon log from only the time period when the
lotus retrieve
command was running: loglines.txtThe text was updated successfully, but these errors were encountered: