Skip to content
This repository has been archived by the owner on Mar 7, 2024. It is now read-only.

Consume less resources #3

Closed
mariusandra opened this issue May 27, 2021 · 2 comments
Closed

Consume less resources #3

mariusandra opened this issue May 27, 2021 · 2 comments

Comments

@mariusandra
Copy link
Collaborator

mariusandra commented May 27, 2021

For a client using the snowflake export

After running it for a while, we're sure that we are consuming too many Snowflake credits via the current import method. We'll need to change to an hourly import. Do you think the current Snowflake plugin can be made to do that well? Last time [name] switched it to hourly he got errors.

Our export code continuously sends INSERT statements to upload data. This requires a snowflake compute node to be always present, and that's $$$.

After a call with Craig from Snowflake we found out:

As an alternative, he did suggest uploading documents one by one to their internal “S3” and then for example asking to read them all in once per hour, incurring less compute resources
Uploading files to their internal S3-like location is apparently not possible with the node integration, but is possible with the python integration… so we would have to do a bit of work to get this to work well, but it seems doable.

@Twixes
Copy link
Member

Twixes commented Jul 5, 2021

So @yakkomajuri, for some context:
I tried to use snowflakedb/snowflake-connector-nodejs#169 branch to upload buffer dumps into Snowflake internal stages, but unfortunately not possible with that – as far as I understand, when uploading to an internal stage, there isn't actually a Snowflake API endpoint, but somehow a URL to the underlying (AWS S3|GCP GCS|whatever storage Azure has) is obtained, and that is then used for upload – which requires some Snowflake-y authentication though, which is pretty opaque. And that PR (also, AFAIU) only supports S3 for now. It seems pretty much just reverse-engineered from the Python library, but much reverse-engineering still left to be done there to make the feature actually functional.

Hence… I would suggest going for that external stage route in the end – GCS and/or S3. Less user-friendly, but much less of a behemoth to engineer.
In that case the plugin would need credentials for:

  • Snowflake (like now)
  • GCS or S3 (new)

The way it could work would be:

  • execute a Snowflake query to make sure that the appropriate external stage exists, connected to the object storage bucket we know about
  • use exportEvents to regularly dump JSON files with events into object storage
  • every hour execute a Snowflake query to ingest events from these dumps in the bucket (can be less than an hour in dev mode, but has to be less frequent for prod users to keep the costs down for them)

I would actually advise against keeping the current temporary table-based way of doing things (which was discussed in #4), since it's kind of convoluted, and just not valuable for anyone. It's cool as a toy, but due to these costs concerns, no one would actually want to use it.

You can build on #4, which is also a refactor of the plugin, but feel free to work on your own fresh PR if that makes it easier.

@Twixes
Copy link
Member

Twixes commented Jul 13, 2021

Done thanks to #5 (export via AWS) and #7 (export via GCP).

@Twixes Twixes closed this as completed Jul 13, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants