Retry on HTTP 50x errors #603

TomAugspurger · 2025-01-29T17:16:26Z

This updates our remote IO HTTP handler to check the status code of the response. If we get a 50x error, we'll retry up to some limit.

Closes #601

copy-pr-bot · 2025-01-29T17:16:30Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

cpp/src/shim/libcurl.cpp

python/kvikio/kvikio/utils.py

cpp/src/shim/libcurl.cpp

copy-pr-bot · 2025-01-29T20:57:36Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

cpp/doxygen/main_page.md

cpp/include/kvikio/defaults.hpp

cpp/src/shim/libcurl.cpp

madsbk

Looks good @TomAugspurger.
I agree, it would be good to make the code list configurable. Or at least, define it as a constexpr somewhere.

cpp/src/shim/libcurl.cpp

python/kvikio/tests/test_http_io.py

cpp/src/defaults.cpp

This updates our remote IO HTTP handler to check the status code of the response. If we get a 50x error, we'll retry up to some limit.

TomAugspurger · 2025-02-03T13:56:09Z

Apologies for the force-commit. The commits made from my devcontainer last week weren't being signed for some reason.

TomAugspurger · 2025-02-03T22:26:12Z

The two CI failures appear to be from the 6-hour timeout on the github action: https://github.com/rapidsai/kvikio/actions/runs/13117029669/job/36594707434?pr=603#step:9:1556

context canceled
python/kvikio/tests/test_benchmarks.py::test_http_io[cupy] 
Error: The operation was canceled.

I assume that's unrelated to the changes here. If possible, it might be best to rerun just those failed jobs?

bdice · 2025-02-03T23:19:29Z

If there are jobs with hangs, we need to diagnose those offline and not rerun them. Consuming a GPU runner for 6 hours is not good, especially with our limited supply of ARM nodes.

TomAugspurger · 2025-02-04T02:45:42Z

Makes sense. https://github.com/rapidsai/kvikio/actions/runs/13117029669 (from #1465) also took much longer than normal on these same two matrix entries.

https://github.com/rapidsai/kvikio/actions/runs/13117029669/job/36594707886?pr=603 is one of the slow jobs. That

16:23:01 started Run tests
16:23:53-16:26:25 compiling numcodecs at
- something to look into: Why are we compiling numcodecs? See if it can provide a wheel and save us some time.
16:27:13 started pytest
16:28:05 last successful test finished
22:20:06 Run canceled while running python/kvikio/tests/test_benchmarks.py::test_http_io[cupy]
~1 second ago: I realized my code is almost surely to blame :)

A bit strange it passed on conda though. I'll take a look.

TomAugspurger · 2025-02-04T12:43:07Z

That said, https://github.com/rapidsai/kvikio/actions/runs/13117029669/job/36594707434 (testing #608) also timed out after 6 hours on the same test, and it was running at around the same time.

That test seems to use the run_cmd fixture to run a benchmark in a subprocess. I don't think we have logs to confirm it, but it's almost surely hanging while starting that subprocess or within it. I'll look into adding a timeout mechanism to run_cmd (cc @kingcrimsontianyu, just in case your PR hits that timeout again, no need for you to investigate too).

TomAugspurger · 2025-02-04T14:01:48Z

I wish I were more confident, but the hang is probably happening in

kvikio/python/kvikio/tests/conftest.py

Lines 22 to 24 in 74653a3

    
           res: subprocess.CompletedProcess = subprocess.run( 
        
               cmd, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, cwd=cwd 
        
           )  # type: ignore

. We could probably catch most of these by setting a timeout in that subprocess.call. However, that's not the easiest to integrate into the rest of that run_cmd fixture, since it's using blocking calls to .send() and .recv() to send test commands and receive results, and those don't have timeout parameters. If we raise a TimeoutError there, run_cmd would hang on the .recv() since the server never writes anything to the pipe.

I'd recommend two things

Add pytest-timeout as a test dependency, and ensure that these tests have a timeout. With small timeouts and added time.sleep commands in the http_io.py file I've confirmed that pytest-timeout does interrupt the individual tests and the test process finishes.
Investigate the cause of the hangs in the first place. I'm like 99% sure that we should be setting CURLOPT_TIMEOUT somewhere in libcurl.cpp before we perform any requests. Which means we would need to pick a default and expose that up through to the user as a configuration value / parameter for requests made by kvikio. That should probably be done as a separate PR (Set timeouts for HTTP requests #613).

TomAugspurger · 2025-02-04T16:18:58Z

The two wheel test failures are from segfaults, somewhere in the call to open_http while running python/kvikio/tests/test_examples.py::test_http_io: https://github.com/rapidsai/kvikio/actions/runs/13137420435/job/36656808281?pr=603#step:9:1578

Looking into it.

Edit: I'm not able to reproduce this locally. pytest-timeout works by setting a SIGALRM timer at test start and clearing it at test end. The only thing related to signals I see in kvikio is us setting CURLOPT_NOSIGNAL =1 at

kvikio/cpp/src/shim/libcurl.cpp

Lines 101 to 104 in 74653a3

    
           // Need CURLOPT_NOSIGNAL to support threading, see 
        
           // <https://curl.se/libcurl/c/CURLOPT_NOSIGNAL.html> 
        
           setopt(CURLOPT_NOSIGNAL, 1L);

. Based on the docs, it sounds like there's a risk for clashing in the use of SIGALRM

This option may cause libcurl to use the SIGALRM signal to timeout system calls on builds not using asynch DNS. In Unix-like systems, this might cause signals to be used unless CURLOPT_NOSIGNAL is set.

but we are using CURLOPT_NOSIGNAL so I'm not sure.

b641240 updates the timeout to use threads instead.

TomAugspurger · 2025-02-07T14:39:38Z

Thanks for the reviews and help everyone! All the comments should be addressed.

The branch is targeting 25.04 and CI is passing.

Edit: this should wait for #626.

cpp/src/defaults.cpp

cpp/src/shim/libcurl.cpp

cpp/src/defaults.cpp

kingcrimsontianyu · 2025-02-11T14:48:36Z

cpp/src/shim/libcurl.cpp

+      // Retry only if one of the specified status codes is returned
+      // TODO: Parse the Retry-After header, if it exists.
+      // TODO: configurable maximum wait.
+      ss << "HTTP " << http_code << std::endl;


In the dev branch, the string stream is used to collect pieces of information to form an ensemble message for the exception. I'm wondering what is the intended action here. Don't we want to print the http_code to the standard output (std::cout)?

I'm not sure. If this were in Python I'd recommend using a logger so that end-user applications have control over what happens to the log messages, including printing them to stdout. Do you know whether we have something similar at the C++ layer?

The C++ solution would be either implementing our own logger class, or using a good third-party logging library such as spdlog or glog. We don't have a dedicated logger yet in KvikIO, except this simplistic macro here that uses the standard error stream:

kvikio/cpp/include/kvikio/error.hpp

Line 118 in c38038e

#define KVIKIO_LOG_ERROR(err_msg) kvikio::detail::log_error(err_msg, __LINE__, __FILE__)

kvikio/cpp/src/error.cpp

Line 25 in c38038e

void log_error(std::string_view err_msg, int line_number, char const* filename)

For this PR, I think we can/should make do with std::cout, and we will ponder the logger design later. 😃

Also cc @madsbk about the idea of having a logger in KvikIO that can output to basic sinks such as stdout, stderr streams or files.

Thanks. I've gone with std::cout for now. I have a test at the python level asserting a few things about the output.

We have https://github.com/rapidsai/rapids-logger for logging. It wraps spdlog, which we do very carefully to avoid exposing spdlog symbols in our libraries.

Any objections to holding off on rapids-logger for this PR? It probably makes sense to make that change across the library.

Nothing is needed for this PR. Just wanted to make you aware of the new tool in the toolbox!

cpp/src/shim/libcurl.cpp

bdice

I'd like for our choices of HTTP retry delay times to be more reasonable. (Apologies for missing this earlier.)

See: https://github.com/rapidsai/kvikio/pull/603/files#r1951116338

- Better values for inital and max delay - shorten the test

kingcrimsontianyu

Approved the C++ code. Thanks for the work. Great feature to have!

bdice

Some suggestions for the retry logic.

cpp/src/defaults.cpp

cpp/src/shim/libcurl.cpp

bdice · 2025-02-11T17:31:52Z

cpp/src/shim/libcurl.cpp

+  int attempt_count       = 1;
+  int base_delay          = 500;   // milliseconds
+  int max_delay           = 4000;  // milliseconds


We should start from attempt_count = 0 so that our first delay is 500ms (500 * (1 << 0) = 500).

Suggested change

int attempt_count = 1;

int base_delay = 500; // milliseconds

int max_delay = 4000; // milliseconds

auto attempt_count = 0;

auto constexpr base_delay = 500; // milliseconds

auto constexpr max_delay = 4000; // milliseconds

Thanks for catching that. I'll implement this by adjusting the backoff to be base_delay * (1 << std::min(attempt_count - 1, 4));, so subtracting 1 there, but keeping the initial attempt at 1. Then the loop logic can still be compared directly against the user-provided max_attempts.

cpp/src/shim/libcurl.cpp

python/kvikio/tests/test_http_io.py

- redo backoff computation - bounds check - adjust attempt_count - reformat error messages - removed upper bound check

cpp/src/defaults.cpp

kingcrimsontianyu · 2025-02-13T15:40:04Z

cpp/src/shim/libcurl.cpp

@@ -116,19 +119,55 @@ CURL* CurlHandle::handle() noexcept { return _handle.get(); }

 void CurlHandle::perform()


I would suggest reorganizing the logic below to improve readability, such as separating handling of error code from http code, early breaking to reduce the indentation level. This is what I have in mind. Let me know of your thought:

// Untested code. Please check! void CurlHandle::perform() { long http_code = 0; auto constexpr base_delay_ms = 500; auto delay_multiplier = 1; auto backoff_delay_ms = 0; auto max_delay_ms = kvikio::defaults::http_max_delay_ms(); auto& http_status_codes = kvikio::defaults::http_status_codes(); auto attempt_count = 0; while (true) { ++attempt_count; auto err = curl_easy_perform(handle()); if (err != CURLE_OK) { std::string msg(_errbuf); // We can do this because we always initialize `_errbuf` as empty. std::stringstream ss; ss << "curl_easy_perform() error near " << _source_file << ":" << _source_line; if (msg.empty()) { ss << "(" << curl_easy_strerror(err) << ")"; } else { ss << "(" << msg << ")"; } throw std::runtime_error(ss.str()); } curl_easy_getinfo(handle(), CURLINFO_RESPONSE_CODE, &http_code); // Check if we should retry based on HTTP status code if (std::find(http_status_codes.begin(), http_status_codes.end(), http_code) == http_status_codes.end()) { // No retry needed break; } // Retry only if one of the specified status codes is returned // TODO: Parse the Retry-After header, if it exists. // TODO: configurable maximum wait. // Current status report std::cout << "KvikIO: Retrying HTTP request. Got HTTP code " << http_code << " after " << backoff_delay_ms << "ms (attempt " << attempt_count << ")." << std::endl; // Prepare for the next attempt // backoff and retry again. With a base value of 500ms, we retry after // 500ms, 1s, 2s, 4s, ... backoff_delay_ms = base_delay_ms * delay_multiplier; delay_multiplier <<= 1; if (backoff_delay_ms > max_delay_ms) { std::stringstream ss; ss << "KvikIO: HTTP request reached maximum delay (" << max_delay_ms << "). Got HTTP code " << http_code << "."; throw std::runtime_error(ss.str()); } std::this_thread::sleep_for(std::chrono::milliseconds(backoff_delay_ms)); } }

I was wondering whether a refactor like this made sense now. Let me take a look.

17d5fe0 has something, if you're able to take another look. That's a bit of a compromise between the earlier setup and your suggestion:

It does use the attempt_count in the while loop condition, instead of while (true). But the case where we've exceeded our maximum attempts is moved out of the loop, and runs when we break

I've added the early return for the case where things are OK, reducing the indentation level

Yes. The changes look good to me! Please do test it a bit.

Side note: Hopefully we will improve the way of testing in the future through mocking (#634).

I think we have OK testing for this through the Python tests here.

That checks that we fail after two attempts, and the the expected message printed after the first attempt failed.

cpp/src/shim/libcurl.cpp

TomAugspurger commented Jan 29, 2025

View reviewed changes

cpp/src/shim/libcurl.cpp Outdated Show resolved Hide resolved

TomAugspurger marked this pull request as ready for review January 29, 2025 22:31

TomAugspurger requested review from a team as code owners January 29, 2025 22:31

bdice reviewed Jan 29, 2025

View reviewed changes

cpp/doxygen/main_page.md Outdated Show resolved Hide resolved

cpp/include/kvikio/defaults.hpp Outdated Show resolved Hide resolved

cpp/src/shim/libcurl.cpp Outdated Show resolved Hide resolved

madsbk added improvement Improves an existing functionality non-breaking Introduces a non-breaking change labels Jan 30, 2025

madsbk reviewed Jan 30, 2025

View reviewed changes

cpp/src/shim/libcurl.cpp Outdated Show resolved Hide resolved

python/kvikio/tests/test_http_io.py Outdated Show resolved Hide resolved

TomAugspurger commented Jan 30, 2025

View reviewed changes

cpp/src/defaults.cpp Outdated Show resolved Hide resolved

madsbk reviewed Jan 31, 2025

View reviewed changes

cpp/src/defaults.cpp Outdated Show resolved Hide resolved

Retry on HTTP 50x errors

e2934fa

This updates our remote IO HTTP handler to check the status code of the response. If we get a 50x error, we'll retry up to some limit.

TomAugspurger force-pushed the tom/retry-http branch from 4913430 to e2934fa Compare February 3, 2025 13:55

TomAugspurger added 2 commits February 3, 2025 15:00

Throw 500 errors on HEAD too

2ef47e4

Added C++ tests for parse_http_status_codes

bf33697

Added timeouts to benchmark tests

d7d377b

TomAugspurger mentioned this pull request Feb 4, 2025

Set timeouts for HTTP requests #613

Open

TomAugspurger requested a review from a team as a code owner February 4, 2025 14:14

TomAugspurger requested a review from AyodeAwe February 4, 2025 14:14

TomAugspurger added 2 commits February 4, 2025 17:15

docfix

d89bf5e

Use threads

b641240

Merge branch 'branch-25.04' into tom/retry-http

ec31221

kingcrimsontianyu reviewed Feb 10, 2025

View reviewed changes

cpp/src/defaults.cpp Outdated Show resolved Hide resolved

kingcrimsontianyu reviewed Feb 10, 2025

View reviewed changes

cpp/src/defaults.cpp Outdated Show resolved Hide resolved

TomAugspurger added 2 commits February 10, 2025 20:37

Apply fixes

1644485

back to const

7edb900

kingcrimsontianyu reviewed Feb 11, 2025

View reviewed changes