Consume less resources #3

mariusandra · 2021-05-27T07:28:56Z

For a client using the snowflake export

After running it for a while, we're sure that we are consuming too many Snowflake credits via the current import method. We'll need to change to an hourly import. Do you think the current Snowflake plugin can be made to do that well? Last time [name] switched it to hourly he got errors.

Our export code continuously sends INSERT statements to upload data. This requires a snowflake compute node to be always present, and that's $$$.

After a call with Craig from Snowflake we found out:

As an alternative, he did suggest uploading documents one by one to their internal “S3” and then for example asking to read them all in once per hour, incurring less compute resources
Uploading files to their internal S3-like location is apparently not possible with the node integration, but is possible with the python integration… so we would have to do a bit of work to get this to work well, but it seems doable.

The text was updated successfully, but these errors were encountered:

Twixes · 2021-07-05T10:09:06Z

So @yakkomajuri, for some context:
I tried to use snowflakedb/snowflake-connector-nodejs#169 branch to upload buffer dumps into Snowflake internal stages, but unfortunately not possible with that – as far as I understand, when uploading to an internal stage, there isn't actually a Snowflake API endpoint, but somehow a URL to the underlying (AWS S3|GCP GCS|whatever storage Azure has) is obtained, and that is then used for upload – which requires some Snowflake-y authentication though, which is pretty opaque. And that PR (also, AFAIU) only supports S3 for now. It seems pretty much just reverse-engineered from the Python library, but much reverse-engineering still left to be done there to make the feature actually functional.

Hence… I would suggest going for that external stage route in the end – GCS and/or S3. Less user-friendly, but much less of a behemoth to engineer.
In that case the plugin would need credentials for:

Snowflake (like now)
GCS or S3 (new)

The way it could work would be:

execute a Snowflake query to make sure that the appropriate external stage exists, connected to the object storage bucket we know about
use exportEvents to regularly dump JSON files with events into object storage
every hour execute a Snowflake query to ingest events from these dumps in the bucket (can be less than an hour in dev mode, but has to be less frequent for prod users to keep the costs down for them)

I would actually advise against keeping the current temporary table-based way of doing things (which was discussed in #4), since it's kind of convoluted, and just not valuable for anyone. It's cool as a toy, but due to these costs concerns, no one would actually want to use it.

You can build on #4, which is also a refactor of the plugin, but feel free to work on your own fresh PR if that makes it easier.

Twixes · 2021-07-13T13:44:53Z

Done thanks to #5 (export via AWS) and #7 (export via GCP).

Twixes mentioned this issue Jun 3, 2021

Release 1.27.0 1/2 PostHog/plugin-server#453

Closed

Twixes self-assigned this Jun 3, 2021

timgl mentioned this issue Jun 3, 2021

Release planning 1.27.0 1/2 PostHog/posthog#4582

Closed

Twixes removed their assignment Jul 5, 2021

Twixes closed this as completed Jul 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consume less resources #3

Consume less resources #3

mariusandra commented May 27, 2021 •

edited

Loading

Twixes commented Jul 5, 2021

Twixes commented Jul 13, 2021

Consume less resources #3

Consume less resources #3

Comments

mariusandra commented May 27, 2021 • edited Loading

Twixes commented Jul 5, 2021

Twixes commented Jul 13, 2021

mariusandra commented May 27, 2021 •

edited

Loading