Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache + minimize registry requests in dbt deps #4983

Closed
wants to merge 2 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 12 additions & 8 deletions core/dbt/clients/registry.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ def _get_url(url, registry_base_url=None):


def _get_with_retries(path, registry_base_url=None):
get_fn = functools.partial(_get, path, registry_base_url)
get_fn = functools.partial(_get_cached, path, registry_base_url)
return connection_exception_retry(get_fn, 5)


Expand All @@ -31,28 +31,31 @@ def _get(path, registry_base_url=None):
fire_event(RegistryProgressGETResponse(url=url, resp_code=resp.status_code))
resp.raise_for_status()

json_response = (
resp.json() # if this fails, it should raise requests.exceptions.JSONDecodeError and trigger retry
Copy link

@barberscott barberscott Mar 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One observation here is that if we have standards about not just that the response is JSON, but that the JSON is constructed in a very specific way, we would need those logic checks here.

e.g. the def package(name, registry_base_url=None): block makes some assumptions about the structure of the json itself that resp.json() won't catch if the response is JSON, but not the right JSON. My hypothesis is that ~60% of the current failures we see in Cloud logging, that's exactly what's happening. We get a JSON back, just not the JSON we want, hence the failure with the NoneType in the package() call.

In the case of separate invocations of deps, we would still, I would imagine, want to retry if the cache was empty -- so I am gathering url = _get_url(path, registry_base_url) is doing that with the changes above in if url in REGISTRY_CACHE.keys().

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jtcohen6 your recommended changes in #4982 probably sort that and would get pulled in here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not the right JSON. My hypothesis is that ~60% of the current failures we see in Cloud logging, that's exactly what's happening. We get a JSON back, just not the JSON we want, hence the failure with the NoneType in the package() call.

@barberscott That's illuminating!

@emmyoop It sounds to me like we might need, in addition to validating whether the JSON is a valid list/dict (#4982), to be able to either:

  • Pass a set of "expected" keys into _get (e.g. versions, namespace, name, downloads, ...), and have it raise a ContentDecodingError if a valid-but-incorrect/incomplete JSON body is returned
  • Move more validation logic, such as RegistryPackageMetadata.from_dict(), from deps/registry.py to clients/registry.py, within the methods where we can retry those requests. It's less-clean code, and worse separation of concerns... but it's also an acknowledgment that these concerns are not easily separated.

The appeal of the change in this PR is that it narrows us down to making only two types of requests: api/v1/index.json and api/v1/{packagename}.json. Some cluttering of the code for two doesn't feel quite as costly as it would for many.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case of separate invocations of deps, we would still, I would imagine, want to retry if the cache was empty

Yeah, this cache won't be persisted at all between separate invocations. I feel even better saying that after having switched out my janky "global" for the Python stdlib approach.

)
# It is unexpected for the content of the response to be None so if it is, raising this error
# will cause this function to retry (if called within _get_with_retries) and hopefully get
# a response. This seems to happen when there's an issue with the Hub.
# See https://github.com/dbt-labs/dbt-core/issues/4577
if resp.json() is None:
if json_response is None:
raise requests.exceptions.ContentDecodingError(
"Request error: The response is None", response=resp
)
return resp.json()
return json_response


_get_cached = memoized(_get)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the indenting right here?

Copy link
Contributor Author

@jtcohen6 jtcohen6 Apr 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so? Could alternatively try using the @cache decorator on _get above: https://docs.python.org/3/library/functools.html#functools.cache



def index(registry_base_url=None):
return _get_with_retries("api/v1/index.json", registry_base_url)


# is this redundant, now that all _get responses are being cached?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

index_cached = memoized(index)


def packages(registry_base_url=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this just not used anywhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not used anywhere, and this API doesn't return anything: https://hub.getdbt.com/api/v1/packages.json

@emmyoop found the same thing in #4982

return _get_with_retries("api/v1/packages.json", registry_base_url)


def package(name, registry_base_url=None):
response = _get_with_retries("api/v1/{}.json".format(name), registry_base_url)

Expand Down Expand Up @@ -80,7 +83,8 @@ def package(name, registry_base_url=None):


def package_version(name, version, registry_base_url=None):
return _get_with_retries("api/v1/{}/{}.json".format(name, version), registry_base_url)
response = package(name, registry_base_url)
return response["versions"][version]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the API return all the available versions when we don't pass a version to it? If the version doesn't exist (is that possible?), do we handle that correctly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if not target:
package_version_not_found(self.package, range_, installable)

That gets the set of available versions from get_available_versions (defined below), which is hitting the same endpoint as this method. E.g.

09:12:54  Encountered an error:
Could not find a matching version for package dbt-labs/dbt_utils
  Requested range: =0.141.24, =0.141.24
  Available versions: ['0.0.1', '0.1.0', '0.1.1', '0.1.2', '0.1.3', '0.1.4', '0.1.5', '0.1.6', '0.1.7', '0.1.8', '0.1.9', '0.1.10', '0.1.11', '0.1.12', '0.1.13', '0.1.14', '0.1.15', '0.1.16', '0.1.17', '0.1.18', '0.1.19', '0.1.20', '0.1.21', '0.1.22', '0.1.23', '0.1.24', '0.1.25', '0.2.0', '0.2.1', '0.2.2', '0.2.3', '0.2.4', '0.2.5', '0.3.0', '0.4.0', '0.4.1', '0.5.0', '0.5.1', '0.6.0', '0.6.1', '0.6.2', '0.6.3', '0.6.4', '0.6.5', '0.6.6', '0.7.0', '0.7.1', '0.7.2', '0.7.3', '0.7.4', '0.7.5', '0.7.6', '0.8.0', '0.8.1', '0.8.2']



def get_available_versions(name):
Expand Down
2 changes: 2 additions & 0 deletions core/dbt/events/types.py
Original file line number Diff line number Diff line change
Expand Up @@ -2422,6 +2422,8 @@ def message(self) -> str:
GitNothingToDo(sha="")
GitProgressUpdatedCheckoutRange(start_sha="", end_sha="")
GitProgressCheckedOutAt(end_sha="")
RegistryProgressMakingGETRequest(url="")
RegistryProgressGETResponse(url="", resp_code=1234)
Comment on lines +2425 to +2426
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd noticed that these were missing from the mypy type checking list. No longer relevant to this PR, happy to leave in or split out as a separate (tiny) tech debt ticket, i.e. ensure complete type checking coverage for all event types defined in the file

SystemErrorRetrievingModTime(path="")
SystemCouldNotWrite(path="", reason="", exc=Exception(""))
SystemExecutingCmd(cmd=[""])
Expand Down