Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add network resources #491

Open
wants to merge 13 commits into
base: main
Choose a base branch
from
Open

Conversation

ohappykust
Copy link

@ohappykust ohappykust commented Nov 17, 2024

Foreword

In my company we use Apache Superset as a general BI solution. So, we encountered the case when we need to get data (CSV file in our case) from external resource (FTP in our case) and use data in our datasets. I found Shillelagh as the main solution to solve our problem, but for full use it was necessary to add to Shillelagh the ability to communicate with external network resources. Now I'm introducing new feature in Shillelagh - Network Resources.

Summary

A network resource is a resource located on a remote server and provides access to data over a network protocol.

The implementation of a network resource is a specific implementation of the functionality for obtaining data on a specific protocol from network resources that provide access to data on this protocol.

So, a file adapter in Shillelagh is an entity that answers the question "How to process specific data?", and a network resource answers the question "How to get this specific data to process it?".

NetworkResource in Shillelagh implement a single point for receiving data from external sources, after which this data will be used by the appropriate adapter for processing them. In turn, the Adapters' implementations no longer needs to worry about how to receive data, since this process is encapsulated in NetworkResourceImplementation class.

As part of the current PR, I have written three network resource implementations, these are HTTP(S), FTP and SFTP.

So, now we can use in our SQL query URL like:

sftp://user:password@some.sftp.host.com:22/path/to/file.ext

or

ftp://user:password@some.ftp.host.com:21/path/to/file.ext

We can omit the port in the URL if they are set by default: 22 for SFTP and 21 for FTP.

Using ContextVar approach

In my implementation, I used ContextVar as the global state containing the current instance of NetworkResource. This is due to two points:

  1. I failed to throw the created NetworkResource instance through the arguments to the adapter due to their serialization process into a string.

  2. When we are looking for a suitable adapter, generalizing, we cycle through all the adapters, calling the supports() method for each, which in turn can use a NetworkResource and create a connection to an external resource. I also wanted to reuse this connection at the moment of receiving data from an external resource.

Testing instructions

In the examples directory of the repository, I created two additional examples of Shillelagh working with FTP and SFTP. Before launching them, you should run the corresponding services in docker/docker-compose.yml

You can also up your FTP/SFTP resource or use an existing one to test this functionality.

For discussion

Right now I'm using a very simple method of getting the content type in the FTP and SFTP implementations with a simple check for the extension of the requested file in the URL. The good thing would be to go to an external resource and ask for the actual type of content. But I am concerned about the possible amount of data on an external resource: for a simple check, you will need to download all the data and it may happen that no adapter will be able to process the received data, in this case we could lose a lot of time.

Should I leave the current implementation or still go to a remote resource? Just with the second option, using ContextVar to save the created instance of NetworkResource is very suitable, so as not to request data again.

examples/csvfile_ftp.py Fixed Show fixed Hide fixed
examples/csvfile_sftp.py Fixed Show fixed Hide fixed
@betodealmeida
Copy link
Owner

This is awesome! I've been thinking about decoupling transport from format for a while now, but never had the change to work on it. Will take a look this week!

@ohappykust
Copy link
Author

Hi, @betodealmeida! I'm glad you found the changes interesting!

About the failed integration test job: are you sure the Weather API key used in GitHub Actions is currently valid?

I registered with WeatherAPI and obtained an API key. After that, I created a secret SHILLELAGH_ADAPTER_KWARGS in GitHub Actions with the value {"weatherapi": {"api_key": "e9852******"}}, and the WeatherAPI integration test passed successfully.

I also made the following fixes in previous commits:

  1. Fixed connection strings in the FTP and SFTP examples, so CodeQL should now succeed.
  2. Resolved EOFError and AssertionError in the FTP and SFTP integration tests.

@betodealmeida
Copy link
Owner

About the failed integration test job: are you sure the Weather API key used in GitHub Actions is currently valid?

The integration tests always fail for non-committers, since the secrets are not exposed to them. Otherwise someone could write a PR that grabs the secrets and sends them to a host they control.

@ohappykust
Copy link
Author

About the failed integration test job: are you sure the Weather API key used in GitHub Actions is currently valid?

The integration tests always fail for non-committers, since the secrets are not exposed to them. Otherwise someone could write a PR that grabs the secrets and sends them to a host they control.

Oh, got it! Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants