Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consistently use the upload(path, IO) and download(path) -> IO across file-related operations #148

Merged
merged 3 commits into from
Jun 8, 2023

Conversation

nfx
Copy link
Contributor

@nfx nfx commented Jun 2, 2023

Changes

  • added w.workspace.upload & w.workspace.download
  • added w.dbfs.upload & w.dbfs.download
  • added w.files.upload & w.files.download
  • modified low-level client to work with raw streams and debug messages correctly

Fix #104

Tests

new integration tests

TODO:
- more type safety
- remove `direct_download` from ExportRequest

Fix #104
@codecov-commenter
Copy link

codecov-commenter commented Jun 2, 2023

Codecov Report

Patch coverage: 56.00% and project coverage change: +0.04 🎉

Comparison is base (136d7e1) 53.18% compared to head (2618e3c) 53.23%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #148      +/-   ##
==========================================
+ Coverage   53.18%   53.23%   +0.04%     
==========================================
  Files          29       30       +1     
  Lines       17900    17956      +56     
==========================================
+ Hits         9521     9558      +37     
- Misses       8379     8398      +19     
Impacted Files Coverage Δ
databricks/sdk/service/sql.py 55.41% <ø> (ø)
databricks/sdk/mixins/workspace.py 44.44% <44.44%> (ø)
databricks/sdk/core.py 67.33% <45.45%> (+0.61%) ⬆️
databricks/sdk/mixins/files.py 75.09% <62.50%> (ø)
databricks/sdk/__init__.py 73.14% <100.00%> (+0.50%) ⬆️
databricks/sdk/dbutils.py 79.67% <100.00%> (ø)

... and 1 file with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@nfx nfx changed the title Added w.workspace.direct_download method Consistently use the upload(path, IO) and download(path) -> IO across file-related operations Jun 7, 2023
path: str,
content: typing.BinaryIO,
*,
format: ExportFormat = ExportFormat.AUTO,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This technically has different semantics than the Workspace API, which when the format is not specified then it will default to format: AUTO.

Not opposed to this, but just want to make sure that this is a conscious choice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jerryjam-db i still have to add doc string for this, but is it okay to mention that: "if you specify a file extension, format will be determined automatically, otherwise use format=AUTO and language=PYTHON to create a notebook"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jerryjam-db i've added path extension checking in the SDK, so that by default we can have this simple code to import the notebook without having to pass in the language parameter.

    py = f'/Users/{w.current_user.me().user_name}/notebook-{random(12)}.py'

    w.workspace.upload(py, io.BytesIO(b'print(1)'))
    with w.workspace.download(py) as f:
        content = f.read()
        assert content == b'# Databricks notebook source\nprint(1)'

files are done via:

    py_file = f'/Users/{w.current_user.me().user_name}/file-{random(12)}.py'

    w.workspace.upload(py_file, io.BytesIO(b'print(1)'), format=ExportFormat.AUTO)
    with w.workspace.download(py_file) as f:
        content = f.read()
        assert content == b'print(1)'

    w.workspace.delete(py_file)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we include some information in the User Agent to indicate that these requests are coming from the Databricks SDK? This will help for tracking on our side.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, let me share you the spec

sb.append(f'< {response.status_code} {response.reason}')
if response.content:
if raw and 'Content-Type' in response.headers:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why check for Content-Type header here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Raw streams with Transfer-Encoding: chunked do not have Content-Type header

databricks/sdk/mixins/files.py Outdated Show resolved Hide resolved
databricks/sdk/core.py Show resolved Hide resolved
data = {'path': path, 'format': format.value}
if language:
data['language'] = language.value
return self._api.do('POST', '/api/2.0/workspace/import', files={'content': content}, data=data)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The data kwarg in requests appears to be for form data, not JSON body.

What is the resulting request content type here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Separate question: does this same API work for uploading regular files? I notice we use the /workspace-files//import-file API elsewhere to import regular files.

Copy link
Contributor Author

@nfx nfx Jun 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is roughly equivalent of
image

as per request of @kahing and @jerryjam-db to do less base64 on both client and control plane side

self._api.do('PUT', f'/api/2.0/fs/files{path}', data=src) # files for the workspace upload

def download(self, path: str) -> BinaryIO:
return self._api.do('GET', f'/api/2.0/fs/files{path}', raw=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is path always absolute?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

self._api = api_client

def upload(self, path: str, src: BinaryIO):
self._api.do('PUT', f'/api/2.0/fs/files{path}', data=src) # files for the workspace upload
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These endpoints don't appear yet in the open API spec, is that intended?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bogdanghita-db didn't create one yet

@@ -34,6 +35,7 @@ class WorkspaceClient:
self.config = config
self.dbutils = dbutils.RemoteDbUtils(self.config)
self.api_client = client.ApiClient(self.config)
self.files = FilesMixin(self.api_client)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be added here explicitly because there is no Files tag in the OpenAPI spec?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, temporarily

Copy link
Contributor

@mgyucht mgyucht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few questions. Nice to unify these things together.

@nfx nfx merged commit 087cf3f into main Jun 8, 2023
@nfx nfx deleted the fix/104 branch June 8, 2023 12:27
@nfx nfx mentioned this pull request Jun 9, 2023
nfx added a commit that referenced this pull request Jun 9, 2023
# Version changelog

## 0.1.9

* Added new services from OpenAPI spec
([#145](#145),
[#159](#159)).
* Added consistent usage of the `upload(path, IO)` and `download(path)
-> IO` across file-related operations
([#148](#148)).
* Added Databricks Metadata Service credential provider
([#139](#139),
[#130](#130)).
* Added exposing runtime credential provider without changing user
namespace
([#140](#140)).
* Added a check for `is not None` for primitive fields in `as_dict()`
([#147](#147)).
* Fixed bug related to boolean flags and convert `True` to `true` in
query strings
([#156](#156)).
* Fixed generation of external entities
([#146](#146)).
* Make u2m authentication work with new CLI
([#150](#150)).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Notebook export with "direct_download=True" fails
5 participants