Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simple source rate limiting #2149

Open
wants to merge 5 commits into
base: devel
Choose a base branch
from
Open

Simple source rate limiting #2149

wants to merge 5 commits into from

Conversation

sh-rp
Copy link
Collaborator

@sh-rp sh-rp commented Dec 15, 2024

Description

This PR adds a simple rate limiting mechanism controlled by the PipeIterator which should be very useful for certain api based use-cases.

Copy link

netlify bot commented Dec 15, 2024

Deploy Preview for dlt-hub-docs canceled.

Name Link
🔨 Latest commit 9b9c9ec
🔍 Latest deploy log https://app.netlify.com/sites/dlt-hub-docs/deploys/675f4962dd7ecd000859ce8c

# Conflicts:
#	docs/website/docs/general-usage/source.md
@sh-rp sh-rp marked this pull request as ready for review December 15, 2024 19:14
@sh-rp sh-rp force-pushed the feat/simple-rate-limiting branch from bac447c to 9b9c9ec Compare December 15, 2024 21:25
@sh-rp
Copy link
Collaborator Author

sh-rp commented Dec 16, 2024

The timing on mac is again working quite differently from the other platforms. Maybe we should add sth to exec_info to be able to determine mac. and specifically allow more generous timeframes if tests are run there.

@sh-rp sh-rp requested a review from rudolfix December 16, 2024 09:35
@sh-rp sh-rp self-assigned this Dec 16, 2024
@@ -52,6 +54,8 @@ class PipeIteratorConfiguration(BaseConfiguration):
futures_poll_interval: float = 0.01
copy_on_fork: bool = False
next_item_mode: str = "round_robin"
rate_limit: Optional[float] = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could this be a callback? Or even better a generic RateLimiter instance for example that can be shared between different pipes and dynamically updated?

I am thinking about #1485 (comment)

which describes an API that defines a global rate limit across all api endpoints.

If I have different resources (real-world example: https://github.com/dlt-hub/verified-sources/pull/587/files#diff-0fc4db143e89ab087ef737341bc4d93a631141a0bb633c31831ca22bb072c146R100-R105) that are in different pipes but share one limit, then I can't easily express this with this API as I wouldn't know how many other requests are made in other pipes (you can see from the code above that the pipes even may depend on the user configuration).
It also wouldn't be possible to make the rate limit dynamic. E.g. let's say I start a pipeline run that takes 2h but during that time an external system uses a certain amount of the resource for a while, then I'd like to adjust the rate limiting here. I.e.:

t0: I start pipe 1, I have 900 requests/s
t1: another system starts working, uses 100 requests/s, I want to reduce pipe 1 to 800r/s
t2: the other system stops it's work, I want to up the requests of pipe 1 to 900 aain

or

t0: I start pipe1 and pipe2 parallel, I have 900 req/s in total, so they share 450req/s each
t1: pipe2 is much faster and finishes, I want to increase the req/s for pipe1 to 900

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@joscha could you maybe collect your requirements for rate limiting in a ticket. I'm not sure if this change here will make it into the code soon, but it is meant as fairly simple rate limiting mechanism to address requirements we hear a lot from the community. The idea to use exponential backoff and to respect well defined rate limiting header also is under consideration and would go into the rest api layer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can most certainly put it in a ticket. I am also not suggesting that having rate liming here is bad, I am just not sure the current implementation is flexible enough for more complex systems. Having one way to define rate limits and exponential backoff would be amazing, as it then would mean we only have one way to define it and next time we add the feature to another layer (like the RESTClient for example) we can just reuse what's already known.

Thinking in terms of how sources and/or pipes are structured and limiting them does also not always easily translate to the underlying system these source(s)/resource(s) pull data from. For resources based on an external REST API for example we possibly wouldn't be interested in limiting the work in the resource, but only how many requests are leaving the system (possibly even based on the endpoint as often APIs define different limits for different endpoints but a dlt resource might be using more than one type)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants