Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry on HTTP 50x errors #603

Open
wants to merge 25 commits into
base: branch-25.04
Choose a base branch
from

Conversation

TomAugspurger
Copy link

This updates our remote IO HTTP handler to check the status code of the response. If we get a 50x error, we'll retry up to some limit.

Closes #601

Copy link

copy-pr-bot bot commented Jan 29, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

cpp/src/shim/libcurl.cpp Outdated Show resolved Hide resolved
cpp/src/shim/libcurl.cpp Outdated Show resolved Hide resolved
cpp/src/shim/libcurl.cpp Outdated Show resolved Hide resolved
cpp/src/shim/libcurl.cpp Outdated Show resolved Hide resolved
cpp/src/shim/libcurl.cpp Outdated Show resolved Hide resolved
python/kvikio/kvikio/utils.py Show resolved Hide resolved
Copy link

copy-pr-bot bot commented Jan 29, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@TomAugspurger TomAugspurger marked this pull request as ready for review January 29, 2025 22:31
@TomAugspurger TomAugspurger requested review from a team as code owners January 29, 2025 22:31
cpp/doxygen/main_page.md Outdated Show resolved Hide resolved
cpp/include/kvikio/defaults.hpp Outdated Show resolved Hide resolved
cpp/src/shim/libcurl.cpp Outdated Show resolved Hide resolved
@madsbk madsbk added improvement Improves an existing functionality non-breaking Introduces a non-breaking change labels Jan 30, 2025
Copy link
Member

@madsbk madsbk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good @TomAugspurger.
I agree, it would be good to make the code list configurable. Or at least, define it as a constexpr somewhere.

cpp/src/shim/libcurl.cpp Outdated Show resolved Hide resolved
python/kvikio/tests/test_http_io.py Outdated Show resolved Hide resolved
cpp/src/defaults.cpp Outdated Show resolved Hide resolved
cpp/src/defaults.cpp Outdated Show resolved Hide resolved
This updates our remote IO HTTP handler to check the status code of the
response. If we get a 50x error, we'll retry up to some limit.
@TomAugspurger
Copy link
Author

Apologies for the force-commit. The commits made from my devcontainer last week weren't being signed for some reason.

@TomAugspurger
Copy link
Author

The two CI failures appear to be from the 6-hour timeout on the github action: https://github.com/rapidsai/kvikio/actions/runs/13117029669/job/36594707434?pr=603#step:9:1556

context canceled
python/kvikio/tests/test_benchmarks.py::test_http_io[cupy] 
Error: The operation was canceled.

I assume that's unrelated to the changes here. If possible, it might be best to rerun just those failed jobs?

@bdice
Copy link
Contributor

bdice commented Feb 3, 2025

If there are jobs with hangs, we need to diagnose those offline and not rerun them. Consuming a GPU runner for 6 hours is not good, especially with our limited supply of ARM nodes.

@TomAugspurger
Copy link
Author

Makes sense. https://github.com/rapidsai/kvikio/actions/runs/13117029669 (from #1465) also took much longer than normal on these same two matrix entries.

https://github.com/rapidsai/kvikio/actions/runs/13117029669/job/36594707886?pr=603 is one of the slow jobs. That

  • 16:23:01 started Run tests
  • 16:23:53-16:26:25 compiling numcodecs at
    • something to look into: Why are we compiling numcodecs? See if it can provide a wheel and save us some time.
  • 16:27:13 started pytest
  • 16:28:05 last successful test finished
  • 22:20:06 Run canceled while running python/kvikio/tests/test_benchmarks.py::test_http_io[cupy]
  • ~1 second ago: I realized my code is almost surely to blame :)

A bit strange it passed on conda though. I'll take a look.

@TomAugspurger
Copy link
Author

That said, https://github.com/rapidsai/kvikio/actions/runs/13117029669/job/36594707434 (testing #608) also timed out after 6 hours on the same test, and it was running at around the same time.

That test seems to use the run_cmd fixture to run a benchmark in a subprocess. I don't think we have logs to confirm it, but it's almost surely hanging while starting that subprocess or within it. I'll look into adding a timeout mechanism to run_cmd (cc @kingcrimsontianyu, just in case your PR hits that timeout again, no need for you to investigate too).

@TomAugspurger
Copy link
Author

TomAugspurger commented Feb 4, 2025

I wish I were more confident, but the hang is probably happening in

res: subprocess.CompletedProcess = subprocess.run(
cmd, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, cwd=cwd
) # type: ignore
. We could probably catch most of these by setting a timeout in that subprocess.call. However, that's not the easiest to integrate into the rest of that run_cmd fixture, since it's using blocking calls to .send() and .recv() to send test commands and receive results, and those don't have timeout parameters. If we raise a TimeoutError there, run_cmd would hang on the .recv() since the server never writes anything to the pipe.

I'd recommend two things

  1. Add pytest-timeout as a test dependency, and ensure that these tests have a timeout. With small timeouts and added time.sleep commands in the http_io.py file I've confirmed that pytest-timeout does interrupt the individual tests and the test process finishes.
  2. Investigate the cause of the hangs in the first place. I'm like 99% sure that we should be setting CURLOPT_TIMEOUT somewhere in libcurl.cpp before we perform any requests. Which means we would need to pick a default and expose that up through to the user as a configuration value / parameter for requests made by kvikio. That should probably be done as a separate PR (Set timeouts for HTTP requests #613).

@TomAugspurger
Copy link
Author

TomAugspurger commented Feb 4, 2025

The two wheel test failures are from segfaults, somewhere in the call to open_http while running python/kvikio/tests/test_examples.py::test_http_io: https://github.com/rapidsai/kvikio/actions/runs/13137420435/job/36656808281?pr=603#step:9:1578

Looking into it.

Edit: I'm not able to reproduce this locally. pytest-timeout works by setting a SIGALRM timer at test start and clearing it at test end. The only thing related to signals I see in kvikio is us setting CURLOPT_NOSIGNAL =1 at

// Need CURLOPT_NOSIGNAL to support threading, see
// <https://curl.se/libcurl/c/CURLOPT_NOSIGNAL.html>
setopt(CURLOPT_NOSIGNAL, 1L);
. Based on the docs, it sounds like there's a risk for clashing in the use of SIGALRM

This option may cause libcurl to use the SIGALRM signal to timeout system calls on builds not using asynch DNS. In Unix-like systems, this might cause signals to be used unless CURLOPT_NOSIGNAL is set.

but we are using CURLOPT_NOSIGNAL so I'm not sure.

b641240 updates the timeout to use threads instead.

@TomAugspurger
Copy link
Author

TomAugspurger commented Feb 7, 2025

Thanks for the reviews and help everyone! All the comments should be addressed.

The branch is targeting 25.04 and CI is passing.

Edit: this should wait for #626.

// Retry only if one of the specified status codes is returned
// TODO: Parse the Retry-After header, if it exists.
// TODO: configurable maximum wait.
ss << "HTTP " << http_code << std::endl;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the dev branch, the string stream is used to collect pieces of information to form an ensemble message for the exception. I'm wondering what is the intended action here. Don't we want to print the http_code to the standard output (std::cout)?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure. If this were in Python I'd recommend using a logger so that end-user applications have control over what happens to the log messages, including printing them to stdout. Do you know whether we have something similar at the C++ layer?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The C++ solution would be either implementing our own logger class, or using a good third-party logging library such as spdlog or glog. We don't have a dedicated logger yet in KvikIO, except this simplistic macro here that uses the standard error stream:

#define KVIKIO_LOG_ERROR(err_msg) kvikio::detail::log_error(err_msg, __LINE__, __FILE__)

void log_error(std::string_view err_msg, int line_number, char const* filename)

For this PR, I think we can/should make do with std::cout, and we will ponder the logger design later. 😃

Copy link
Contributor

@kingcrimsontianyu kingcrimsontianyu Feb 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also cc @madsbk about the idea of having a logger in KvikIO that can output to basic sinks such as stdout, stderr streams or files.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I've gone with std::cout for now. I have a test at the python level asserting a few things about the output.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have https://github.com/rapidsai/rapids-logger for logging. It wraps spdlog, which we do very carefully to avoid exposing spdlog symbols in our libraries.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any objections to holding off on rapids-logger for this PR? It probably makes sense to make that change across the library.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing is needed for this PR. Just wanted to make you aware of the new tool in the toolbox!

Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like for our choices of HTTP retry delay times to be more reasonable. (Apologies for missing this earlier.)

See: https://github.com/rapidsai/kvikio/pull/603/files#r1951116338

- Better values for inital and max delay
- shorten the test
Copy link
Contributor

@kingcrimsontianyu kingcrimsontianyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved the C++ code. Thanks for the work. Great feature to have!

Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some suggestions for the retry logic.

cpp/src/defaults.cpp Outdated Show resolved Hide resolved
cpp/src/shim/libcurl.cpp Outdated Show resolved Hide resolved
Comment on lines 123 to 125
int attempt_count = 1;
int base_delay = 500; // milliseconds
int max_delay = 4000; // milliseconds
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should start from attempt_count = 0 so that our first delay is 500ms (500 * (1 << 0) = 500).

Suggested change
int attempt_count = 1;
int base_delay = 500; // milliseconds
int max_delay = 4000; // milliseconds
auto attempt_count = 0;
auto constexpr base_delay = 500; // milliseconds
auto constexpr max_delay = 4000; // milliseconds

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching that. I'll implement this by adjusting the backoff to be base_delay * (1 << std::min(attempt_count - 1, 4));, so subtracting 1 there, but keeping the initial attempt at 1. Then the loop logic can still be compared directly against the user-provided max_attempts.

cpp/src/shim/libcurl.cpp Outdated Show resolved Hide resolved
cpp/src/shim/libcurl.cpp Outdated Show resolved Hide resolved
cpp/src/shim/libcurl.cpp Outdated Show resolved Hide resolved
python/kvikio/tests/test_http_io.py Outdated Show resolved Hide resolved
- redo backoff computation
    - bounds check
    - adjust attempt_count
- reformat error messages
- removed upper bound check
@@ -116,19 +119,55 @@ CURL* CurlHandle::handle() noexcept { return _handle.get(); }

void CurlHandle::perform()
Copy link
Contributor

@kingcrimsontianyu kingcrimsontianyu Feb 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest reorganizing the logic below to improve readability, such as separating handling of error code from http code, early breaking to reduce the indentation level. This is what I have in mind. Let me know of your thought:

// Untested code. Please check!
void CurlHandle::perform()
{
  long http_code               = 0;
  auto constexpr base_delay_ms = 500;
  auto delay_multiplier        = 1;
  auto backoff_delay_ms        = 0;
  auto max_delay_ms            = kvikio::defaults::http_max_delay_ms();
  auto& http_status_codes      = kvikio::defaults::http_status_codes();
  auto attempt_count           = 0;

  while (true) {
    ++attempt_count;
    auto err = curl_easy_perform(handle());
    if (err != CURLE_OK) {
      std::string msg(_errbuf);  // We can do this because we always initialize `_errbuf` as empty.
      std::stringstream ss;
      ss << "curl_easy_perform() error near " << _source_file << ":" << _source_line;
      if (msg.empty()) {
        ss << "(" << curl_easy_strerror(err) << ")";
      } else {
        ss << "(" << msg << ")";
      }
      throw std::runtime_error(ss.str());
    }

    curl_easy_getinfo(handle(), CURLINFO_RESPONSE_CODE, &http_code);

    // Check if we should retry based on HTTP status code
    if (std::find(http_status_codes.begin(), http_status_codes.end(), http_code) ==
        http_status_codes.end()) {
      // No retry needed
      break;
    }

    // Retry only if one of the specified status codes is returned
    // TODO: Parse the Retry-After header, if it exists.
    // TODO: configurable maximum wait.

    // Current status report
    std::cout << "KvikIO: Retrying HTTP request. Got HTTP code " << http_code << " after "
              << backoff_delay_ms << "ms (attempt " << attempt_count << ")." << std::endl;

    // Prepare for the next attempt
    // backoff and retry again. With a base value of 500ms, we retry after
    // 500ms, 1s, 2s, 4s, ...
    backoff_delay_ms = base_delay_ms * delay_multiplier;
    delay_multiplier <<= 1;

    if (backoff_delay_ms > max_delay_ms) {
      std::stringstream ss;
      ss << "KvikIO: HTTP request reached maximum delay (" << max_delay_ms << "). Got HTTP code "
         << http_code << ".";
      throw std::runtime_error(ss.str());
    }

    std::this_thread::sleep_for(std::chrono::milliseconds(backoff_delay_ms));
  }
}

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering whether a refactor like this made sense now. Let me take a look.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

17d5fe0 has something, if you're able to take another look. That's a bit of a compromise between the earlier setup and your suggestion:

  • It does use the attempt_count in the while loop condition, instead of while (true). But the case where we've exceeded our maximum attempts is moved out of the loop, and runs when we break
  • I've added the early return for the case where things are OK, reducing the indentation level

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. The changes look good to me! Please do test it a bit.

Side note: Hopefully we will improve the way of testing in the future through mocking (#634).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have OK testing for this through the Python tests here.

That checks that we fail after two attempts, and the the expected message printed after the first attempt failed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Improves an existing functionality non-breaking Introduces a non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Remote IO should retry certain error codes
5 participants