-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cache + minimize registry requests in dbt deps
#4983
Conversation
Thank you for your pull request! We could not find a changelog entry for this change. For details on how to document a change, see the contributing guide. |
fire_event(RegistryProgressMakingGETRequest(url=url)) | ||
resp = requests.get(url, timeout=30) | ||
fire_event(RegistryProgressGETResponse(url=url, resp_code=resp.status_code)) | ||
resp.raise_for_status() | ||
|
||
json_response = ( | ||
resp.json() # if this fails, it should raise requests.exceptions.JSONDecodeError and trigger retry |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One observation here is that if we have standards about not just that the response is JSON, but that the JSON is constructed in a very specific way, we would need those logic checks here.
e.g. the def package(name, registry_base_url=None):
block makes some assumptions about the structure of the json itself that resp.json()
won't catch if the response is JSON, but not the right JSON. My hypothesis is that ~60% of the current failures we see in Cloud logging, that's exactly what's happening. We get a JSON back, just not the JSON we want, hence the failure with the NoneType in the package()
call.
In the case of separate invocations of deps, we would still, I would imagine, want to retry if the cache was empty -- so I am gathering url = _get_url(path, registry_base_url)
is doing that with the changes above in if url in REGISTRY_CACHE.keys()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not the right JSON. My hypothesis is that ~60% of the current failures we see in Cloud logging, that's exactly what's happening. We get a JSON back, just not the JSON we want, hence the failure with the NoneType in the
package()
call.
@barberscott That's illuminating!
@emmyoop It sounds to me like we might need, in addition to validating whether the JSON is a valid list/dict (#4982), to be able to either:
- Pass a set of "expected" keys into
_get
(e.g.versions
,namespace
,name
,downloads
, ...), and have it raise aContentDecodingError
if a valid-but-incorrect/incomplete JSON body is returned - Move more validation logic, such as
RegistryPackageMetadata.from_dict()
, fromdeps/registry.py
toclients/registry.py
, within the methods where we can retry those requests. It's less-clean code, and worse separation of concerns... but it's also an acknowledgment that these concerns are not easily separated.
The appeal of the change in this PR is that it narrows us down to making only two types of requests: api/v1/index.json
and api/v1/{packagename}.json
. Some cluttering of the code for two doesn't feel quite as costly as it would for many.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the case of separate invocations of deps, we would still, I would imagine, want to retry if the cache was empty
Yeah, this cache won't be persisted at all between separate invocations. I feel even better saying that after having switched out my janky "global" for the Python stdlib approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this. Speeds things up and also reduces the chance of errors.
I think, as you pointed out, you would be able to use it without the global. If you define _get_cached = memoized(_get)
and replace the _get
in _get_with_retries()
with get_cached
. I did a quick test and it seems to work.
core/dbt/events/types.py
Outdated
@dataclass | ||
class RegistryCacheHit(DebugLevel): | ||
url: str | ||
code: str = "M010" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
super minor but M010
lives somewhere else. I think it's up to M022
now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call. I think the memoized
approach preempts the log message entirely. No need to take a victory lap for a thing we should have been doing anyway :)
core/dbt/clients/registry.py
Outdated
@@ -49,6 +64,7 @@ def index(registry_base_url=None): | |||
index_cached = memoized(index) | |||
|
|||
|
|||
# not used, and this "endpoint" returns 'null'. here for backwards compatibility? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed this in #4942 since it wasn't called.
RegistryProgressMakingGETRequest(url="") | ||
RegistryProgressGETResponse(url="", resp_code=1234) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd noticed that these were missing from the mypy
type checking list. No longer relevant to this PR, happy to leave in or split out as a separate (tiny) tech debt ticket, i.e. ensure complete type checking coverage for all event types defined in the file
return json_response | ||
|
||
|
||
_get_cached = memoized(_get) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the indenting right here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so? Could alternatively try using the @cache
decorator on _get
above: https://docs.python.org/3/library/functools.html#functools.cache
index_cached = memoized(index) | ||
|
||
|
||
def packages(registry_base_url=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this just not used anywhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not used anywhere, and this API doesn't return anything: https://hub.getdbt.com/api/v1/packages.json
@@ -80,7 +83,8 @@ def package(name, registry_base_url=None): | |||
|
|||
|
|||
def package_version(name, version, registry_base_url=None): | |||
return _get_with_retries("api/v1/{}/{}.json".format(name, version), registry_base_url) | |||
response = package(name, registry_base_url) | |||
return response["versions"][version] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the API return all the available versions when we don't pass a version to it? If the version doesn't exist (is that possible?), do we handle that correctly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- The API returns the full package specification when no version is passed. See e.g. https://hub.getdbt.com/api/v1/dbt-labs/dbt_utils.json
- I've tested this — if the version doesn't exist, that's caught earlier, in
deps.registry.resolved()
:
dbt-core/core/dbt/deps/registry.py
Lines 140 to 141 in 204d535
if not target: | |
package_version_not_found(self.package, range_, installable) |
That gets the set of available versions from get_available_versions
(defined below), which is hitting the same endpoint as this method. E.g.
09:12:54 Encountered an error:
Could not find a matching version for package dbt-labs/dbt_utils
Requested range: =0.141.24, =0.141.24
Available versions: ['0.0.1', '0.1.0', '0.1.1', '0.1.2', '0.1.3', '0.1.4', '0.1.5', '0.1.6', '0.1.7', '0.1.8', '0.1.9', '0.1.10', '0.1.11', '0.1.12', '0.1.13', '0.1.14', '0.1.15', '0.1.16', '0.1.17', '0.1.18', '0.1.19', '0.1.20', '0.1.21', '0.1.22', '0.1.23', '0.1.24', '0.1.25', '0.2.0', '0.2.1', '0.2.2', '0.2.3', '0.2.4', '0.2.5', '0.3.0', '0.4.0', '0.4.1', '0.5.0', '0.5.1', '0.6.0', '0.6.1', '0.6.2', '0.6.3', '0.6.4', '0.6.5', '0.6.6', '0.7.0', '0.7.1', '0.7.2', '0.7.3', '0.7.4', '0.7.5', '0.7.6', '0.8.0', '0.8.1', '0.8.2']
|
||
|
||
def index(registry_base_url=None): | ||
return _get_with_retries("api/v1/index.json", registry_base_url) | ||
|
||
|
||
# is this redundant, now that all _get responses are being cached? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch!
I've pulled this branch into #4982 since it was very intertwined. |
Happily closing in favor of #4982! |
resolves #4342
Toplines: Up to 80% faster + 80% fewer requests
Description
dbt deps
by caching responses in memoryThoughts:
dbt_utils
(https://hub.getdbt.com/api/v1/dbt-labs/dbt_utils.json) is ~25KB, if my back-of-google-search arithmetic is to be believedmemoize
(a.k.a.functools.cache
) to cache the registry index here, and explicitly call the cached version here. I'm sure that's the better the way to do this, rather than the jankglobal
I'm using :)Anecdotal findings
Using our internal analytics project
Before:
logs/dbt.log
After:
logs/dbt.log
That's almost 80% fewer requests :)
This would be most effective in projects that:
dbt_utils
is a common favorite)Checklist