-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Retry on HTTP 50x errors #603
base: branch-25.04
Are you sure you want to change the base?
Retry on HTTP 50x errors #603
Conversation
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good @TomAugspurger.
I agree, it would be good to make the code list configurable. Or at least, define it as a constexpr somewhere.
This updates our remote IO HTTP handler to check the status code of the response. If we get a 50x error, we'll retry up to some limit.
4913430
to
e2934fa
Compare
Apologies for the force-commit. The commits made from my devcontainer last week weren't being signed for some reason. |
The two CI failures appear to be from the 6-hour timeout on the github action: https://github.com/rapidsai/kvikio/actions/runs/13117029669/job/36594707434?pr=603#step:9:1556
I assume that's unrelated to the changes here. If possible, it might be best to rerun just those failed jobs? |
If there are jobs with hangs, we need to diagnose those offline and not rerun them. Consuming a GPU runner for 6 hours is not good, especially with our limited supply of ARM nodes. |
Makes sense. https://github.com/rapidsai/kvikio/actions/runs/13117029669 (from #1465) also took much longer than normal on these same two matrix entries. https://github.com/rapidsai/kvikio/actions/runs/13117029669/job/36594707886?pr=603 is one of the slow jobs. That
A bit strange it passed on conda though. I'll take a look. |
That said, https://github.com/rapidsai/kvikio/actions/runs/13117029669/job/36594707434 (testing #608) also timed out after 6 hours on the same test, and it was running at around the same time. That test seems to use the |
I wish I were more confident, but the hang is probably happening in kvikio/python/kvikio/tests/conftest.py Lines 22 to 24 in 74653a3
subprocess.call . However, that's not the easiest to integrate into the rest of that run_cmd fixture, since it's using blocking calls to .send() and .recv() to send test commands and receive results, and those don't have timeout parameters. If we raise a TimeoutError there, run_cmd would hang on the .recv() since the server never writes anything to the pipe.
I'd recommend two things
|
The two wheel test failures are from segfaults, somewhere in the call to Looking into it. Edit: I'm not able to reproduce this locally. kvikio/cpp/src/shim/libcurl.cpp Lines 101 to 104 in 74653a3
but we are using b641240 updates the timeout to use threads instead. |
Thanks for the reviews and help everyone! All the comments should be addressed. The branch is targeting 25.04 and CI is passing. Edit: this should wait for #626. |
cpp/src/shim/libcurl.cpp
Outdated
// Retry only if one of the specified status codes is returned | ||
// TODO: Parse the Retry-After header, if it exists. | ||
// TODO: configurable maximum wait. | ||
ss << "HTTP " << http_code << std::endl; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the dev branch, the string stream is used to collect pieces of information to form an ensemble message for the exception. I'm wondering what is the intended action here. Don't we want to print the http_code
to the standard output (std::cout
)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure. If this were in Python I'd recommend using a logger so that end-user applications have control over what happens to the log messages, including printing them to stdout. Do you know whether we have something similar at the C++ layer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The C++ solution would be either implementing our own logger class, or using a good third-party logging library such as spdlog or glog. We don't have a dedicated logger yet in KvikIO, except this simplistic macro here that uses the standard error stream:
kvikio/cpp/include/kvikio/error.hpp
Line 118 in c38038e
#define KVIKIO_LOG_ERROR(err_msg) kvikio::detail::log_error(err_msg, __LINE__, __FILE__) |
Line 25 in c38038e
void log_error(std::string_view err_msg, int line_number, char const* filename) |
For this PR, I think we can/should make do with std::cout
, and we will ponder the logger design later. 😃
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also cc @madsbk about the idea of having a logger in KvikIO that can output to basic sinks such as stdout, stderr streams or files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I've gone with std::cout
for now. I have a test at the python level asserting a few things about the output.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have https://github.com/rapidsai/rapids-logger for logging. It wraps spdlog, which we do very carefully to avoid exposing spdlog symbols in our libraries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any objections to holding off on rapids-logger for this PR? It probably makes sense to make that change across the library.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing is needed for this PR. Just wanted to make you aware of the new tool in the toolbox!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like for our choices of HTTP retry delay times to be more reasonable. (Apologies for missing this earlier.)
See: https://github.com/rapidsai/kvikio/pull/603/files#r1951116338
- Better values for inital and max delay - shorten the test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved the C++ code. Thanks for the work. Great feature to have!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some suggestions for the retry logic.
cpp/src/shim/libcurl.cpp
Outdated
int attempt_count = 1; | ||
int base_delay = 500; // milliseconds | ||
int max_delay = 4000; // milliseconds |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should start from attempt_count = 0
so that our first delay is 500ms (500 * (1 << 0) = 500
).
int attempt_count = 1; | |
int base_delay = 500; // milliseconds | |
int max_delay = 4000; // milliseconds | |
auto attempt_count = 0; | |
auto constexpr base_delay = 500; // milliseconds | |
auto constexpr max_delay = 4000; // milliseconds |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for catching that. I'll implement this by adjusting the backoff to be base_delay * (1 << std::min(attempt_count - 1, 4));
, so subtracting 1 there, but keeping the initial attempt at 1. Then the loop logic can still be compared directly against the user-provided max_attempts.
@@ -116,19 +119,55 @@ CURL* CurlHandle::handle() noexcept { return _handle.get(); } | |||
|
|||
void CurlHandle::perform() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest reorganizing the logic below to improve readability, such as separating handling of error code from http code, early breaking to reduce the indentation level. This is what I have in mind. Let me know of your thought:
// Untested code. Please check!
void CurlHandle::perform()
{
long http_code = 0;
auto constexpr base_delay_ms = 500;
auto delay_multiplier = 1;
auto backoff_delay_ms = 0;
auto max_delay_ms = kvikio::defaults::http_max_delay_ms();
auto& http_status_codes = kvikio::defaults::http_status_codes();
auto attempt_count = 0;
while (true) {
++attempt_count;
auto err = curl_easy_perform(handle());
if (err != CURLE_OK) {
std::string msg(_errbuf); // We can do this because we always initialize `_errbuf` as empty.
std::stringstream ss;
ss << "curl_easy_perform() error near " << _source_file << ":" << _source_line;
if (msg.empty()) {
ss << "(" << curl_easy_strerror(err) << ")";
} else {
ss << "(" << msg << ")";
}
throw std::runtime_error(ss.str());
}
curl_easy_getinfo(handle(), CURLINFO_RESPONSE_CODE, &http_code);
// Check if we should retry based on HTTP status code
if (std::find(http_status_codes.begin(), http_status_codes.end(), http_code) ==
http_status_codes.end()) {
// No retry needed
break;
}
// Retry only if one of the specified status codes is returned
// TODO: Parse the Retry-After header, if it exists.
// TODO: configurable maximum wait.
// Current status report
std::cout << "KvikIO: Retrying HTTP request. Got HTTP code " << http_code << " after "
<< backoff_delay_ms << "ms (attempt " << attempt_count << ")." << std::endl;
// Prepare for the next attempt
// backoff and retry again. With a base value of 500ms, we retry after
// 500ms, 1s, 2s, 4s, ...
backoff_delay_ms = base_delay_ms * delay_multiplier;
delay_multiplier <<= 1;
if (backoff_delay_ms > max_delay_ms) {
std::stringstream ss;
ss << "KvikIO: HTTP request reached maximum delay (" << max_delay_ms << "). Got HTTP code "
<< http_code << ".";
throw std::runtime_error(ss.str());
}
std::this_thread::sleep_for(std::chrono::milliseconds(backoff_delay_ms));
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wondering whether a refactor like this made sense now. Let me take a look.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
17d5fe0 has something, if you're able to take another look. That's a bit of a compromise between the earlier setup and your suggestion:
- It does use the attempt_count in the
while
loop condition, instead ofwhile (true)
. But the case where we've exceeded our maximum attempts is moved out of the loop, and runs when webreak
- I've added the early return for the case where things are OK, reducing the indentation level
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. The changes look good to me! Please do test it a bit.
Side note: Hopefully we will improve the way of testing in the future through mocking (#634).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we have OK testing for this through the Python tests here.
That checks that we fail after two attempts, and the the expected message printed after the first attempt failed.
This updates our remote IO HTTP handler to check the status code of the response. If we get a 50x error, we'll retry up to some limit.
Closes #601