-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cache + minimize registry requests in dbt deps
#4983
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
|
@@ -20,7 +20,7 @@ def _get_url(url, registry_base_url=None): | |||||
|
||||||
|
||||||
def _get_with_retries(path, registry_base_url=None): | ||||||
get_fn = functools.partial(_get, path, registry_base_url) | ||||||
get_fn = functools.partial(_get_cached, path, registry_base_url) | ||||||
return connection_exception_retry(get_fn, 5) | ||||||
|
||||||
|
||||||
|
@@ -31,28 +31,31 @@ def _get(path, registry_base_url=None): | |||||
fire_event(RegistryProgressGETResponse(url=url, resp_code=resp.status_code)) | ||||||
resp.raise_for_status() | ||||||
|
||||||
json_response = ( | ||||||
resp.json() # if this fails, it should raise requests.exceptions.JSONDecodeError and trigger retry | ||||||
) | ||||||
# It is unexpected for the content of the response to be None so if it is, raising this error | ||||||
# will cause this function to retry (if called within _get_with_retries) and hopefully get | ||||||
# a response. This seems to happen when there's an issue with the Hub. | ||||||
# See https://github.com/dbt-labs/dbt-core/issues/4577 | ||||||
if resp.json() is None: | ||||||
if json_response is None: | ||||||
raise requests.exceptions.ContentDecodingError( | ||||||
"Request error: The response is None", response=resp | ||||||
) | ||||||
return resp.json() | ||||||
return json_response | ||||||
|
||||||
|
||||||
_get_cached = memoized(_get) | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is the indenting right here? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think so? Could alternatively try using the |
||||||
|
||||||
|
||||||
def index(registry_base_url=None): | ||||||
return _get_with_retries("api/v1/index.json", registry_base_url) | ||||||
|
||||||
|
||||||
# is this redundant, now that all _get responses are being cached? | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good catch! |
||||||
index_cached = memoized(index) | ||||||
|
||||||
|
||||||
def packages(registry_base_url=None): | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this just not used anywhere? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not used anywhere, and this API doesn't return anything: https://hub.getdbt.com/api/v1/packages.json |
||||||
return _get_with_retries("api/v1/packages.json", registry_base_url) | ||||||
|
||||||
|
||||||
def package(name, registry_base_url=None): | ||||||
response = _get_with_retries("api/v1/{}.json".format(name), registry_base_url) | ||||||
|
||||||
|
@@ -80,7 +83,8 @@ def package(name, registry_base_url=None): | |||||
|
||||||
|
||||||
def package_version(name, version, registry_base_url=None): | ||||||
return _get_with_retries("api/v1/{}/{}.json".format(name, version), registry_base_url) | ||||||
response = package(name, registry_base_url) | ||||||
return response["versions"][version] | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Does the API return all the available versions when we don't pass a version to it? If the version doesn't exist (is that possible?), do we handle that correctly? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
dbt-core/core/dbt/deps/registry.py Lines 140 to 141 in 204d535
That gets the set of available versions from
|
||||||
|
||||||
|
||||||
def get_available_versions(name): | ||||||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -2422,6 +2422,8 @@ def message(self) -> str: | |
GitNothingToDo(sha="") | ||
GitProgressUpdatedCheckoutRange(start_sha="", end_sha="") | ||
GitProgressCheckedOutAt(end_sha="") | ||
RegistryProgressMakingGETRequest(url="") | ||
RegistryProgressGETResponse(url="", resp_code=1234) | ||
Comment on lines
+2425
to
+2426
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd noticed that these were missing from the |
||
SystemErrorRetrievingModTime(path="") | ||
SystemCouldNotWrite(path="", reason="", exc=Exception("")) | ||
SystemExecutingCmd(cmd=[""]) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One observation here is that if we have standards about not just that the response is JSON, but that the JSON is constructed in a very specific way, we would need those logic checks here.
e.g. the
def package(name, registry_base_url=None):
block makes some assumptions about the structure of the json itself thatresp.json()
won't catch if the response is JSON, but not the right JSON. My hypothesis is that ~60% of the current failures we see in Cloud logging, that's exactly what's happening. We get a JSON back, just not the JSON we want, hence the failure with the NoneType in thepackage()
call.In the case of separate invocations of deps, we would still, I would imagine, want to retry if the cache was empty -- so I am gathering
url = _get_url(path, registry_base_url)
is doing that with the changes above inif url in REGISTRY_CACHE.keys()
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jtcohen6 your recommended changes in #4982 probably sort that and would get pulled in here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@barberscott That's illuminating!
@emmyoop It sounds to me like we might need, in addition to validating whether the JSON is a valid list/dict (#4982), to be able to either:
_get
(e.g.versions
,namespace
,name
,downloads
, ...), and have it raise aContentDecodingError
if a valid-but-incorrect/incomplete JSON body is returnedRegistryPackageMetadata.from_dict()
, fromdeps/registry.py
toclients/registry.py
, within the methods where we can retry those requests. It's less-clean code, and worse separation of concerns... but it's also an acknowledgment that these concerns are not easily separated.The appeal of the change in this PR is that it narrows us down to making only two types of requests:
api/v1/index.json
andapi/v1/{packagename}.json
. Some cluttering of the code for two doesn't feel quite as costly as it would for many.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this cache won't be persisted at all between separate invocations. I feel even better saying that after having switched out my janky "global" for the Python stdlib approach.