Skip to content
This repository has been archived by the owner on Aug 13, 2024. It is now read-only.

Configurable max event size for webhook updates #94

Closed
alexanderdean opened this issue Dec 27, 2016 · 12 comments
Closed

Configurable max event size for webhook updates #94

alexanderdean opened this issue Dec 27, 2016 · 12 comments
Assignees
Milestone

Comments

@alexanderdean
Copy link
Contributor

Waiting for webhook to finish sending events... done! Warning: some events failed to send

I think it's related to overly-large payloads (too much stdout/err). We should be truncating this farely aggressively...

@alexanderdean
Copy link
Contributor Author

@ihortom has some examples of pipelines that always successfully emit the "job started", but never the "job failed". Excessive stdout/stderr in-between is the most likely culprit methinks.

Are we successfully truncating the stdio to a safe limit?

@ninjabear
Copy link
Contributor

We're truncating it to 10kb for both (#87). Factotum will stop sending updates if it retrieves three non-200 responses (with a random backoff inbetween). I will reduce the limit further, 1kb is probably sufficient (~2kb total) unless there's an official maximum I can use (obviously preferable). Information on the responses should be available in the log file (factotum.log)

@alexanderdean
Copy link
Contributor Author

10kb should be totally fine - I think there must be another bug at work. How sure are you that the 10kb is being enforced?

@ninjabear
Copy link
Contributor

Sorry, bit misleading wording there it's 10kb for each stdout/stderr + the rest of the payload. So a message can end up at around ~20kb if both are full. I must have tested this with a collector at some point, but I will double check to be sure.

I'm fairly confident that it's being enforced as it's covered by these tests but I'll spend some time looking into it to make sure this is the case.

@alexanderdean
Copy link
Contributor Author

Hey @ninjabear -

Factotum will stop sending updates if it retrieves three non-200 responses (with a random backoff inbetween).

Aha - this is consistent with what I'm seeing, where Factotum takes a couple of minutes to exit even after all child processes are complete, and then fails to send the webhook. That will be the retries.

@ninjabear
Copy link
Contributor

ninjabear commented Jan 27, 2017

That makes sense. We can be sure that this is the case by checking the log file - this should contain the string Failed to send webhook update to followed by a HTTP response code. The log file is in ./.factotum/factotum.log

@alexanderdean
Copy link
Contributor Author

Bringing forwards as this impacts on the front-end's utility

@ninjabear ninjabear changed the title Factotum webhooks not always firing? Factotum webhook payload is too large Mar 16, 2017
@ninjabear
Copy link
Contributor

ninjabear commented Mar 16, 2017

We should probably give a better summary of which updates were failed to be transmitted and their reason too.

Something like this:

Failed to send XX update event to 255.255.255.255:8080 at <time>. The server responded with X.

@ninjabear
Copy link
Contributor

I've been thinking about this a bit more, and I'm not sure our current fix will resolve this problem completely.

The thing is, the payload will always be arbitrarily large because we attach things like the factfile base64 encoded. If I had a max 5k payload then I can only send 3k (minus the constant bits) of stdout/stderr. This number changes as sizes of other things change.

I've also worked out why later events fail, and that is because the stdout/stderr for every task is included[1] with each event. This means as a job is processed, the event size steadily increases until we hit the max event size.

It's designed like this because each update contains all information about a job in a stateless way. From any single update you can work out the full state of the job. This hugely simplifies grokking and storage later (and is the reason the UI can work as it does).

I think we need to know what exactly the limit of an event size is, and if we can be 100% sure that's what's going on. This limit could be enforced in a number of places, starting at the load balancer and running through the http framework each collector uses right into the Snowplow code that handles requests. This means testing with Snowplow Mini is out unfortunately.

@ungn as a side note we need to update iglu central on the max-sizes of these events. Don't forget there's also heartbeats and a couple of other tickets in this milestone.

/cc @jbeemster / @alexanderdean if you have any ideas about the absolute max event size (perhaps this is defined I just don't know about it)

[1] : https://github.com/snowplow/iglu-central/blob/master/schemas/com.snowplowanalytics.factotum/task_update/jsonschema/1-0-0#L126-L133 / https://github.com/snowplow/iglu-central/blob/master/schemas/com.snowplowanalytics.factotum/job_update/jsonschema/1-0-0#L116-L123

@alexanderdean
Copy link
Contributor Author

@ninjabear - this all makes total sense. A few thoughts:

  • The abs max event size is unfortunately undefined in Snowplow. I agree this ought to change
  • We ought to make the abs max event size configurable in Factotum as we can't 100% assume that the events will only be sent to Snowplow
  • Can we explore de facto max event sizes using Avalanche @jbeemster ?
  • If we can define the abs max event size, can we use this to degrade the webhook payload gracefully? I.e. we come up with two or three tiers of information by priority, and start truncating the lowest priority till we come under the limit?

@ungn ungn mentioned this issue Jun 5, 2017
16 tasks
@ninjabear
Copy link
Contributor

We have confirmed the problem here was downstream; we believe Factotum sends all webhooks correctly.

As such I'm going to remove the bug label. We should however revisit the idea of the maximum size events Factotum should send on a lower priority - I'll leave this ticket open for that.

@ninjabear ninjabear removed the bug label Jul 28, 2017
@ninjabear ninjabear changed the title Factotum webhook payload is too large Configurable max event size for webhook updates Jul 28, 2017
@jbeemster
Copy link
Contributor

Updating this issue with our latest thinking on this issue - the maximum size of an individual event should really only be limited by the maximum record size that can be handled downstream. Stream Collectors can handle up-to 1MB payloads into Kinesis which is vastly more than the current limit.

Further to this the actual truncation of the output can be quite detrimental to the usage of these webhooks events in processing downstream.

In short I think we really just need to make this fully configurable so that we can start to figure out what the maximum webhook sizes really are that can be handled and to start recommending and using that size - we shouldn't be truncating without cause!

In reality the abs limits should be:

  1. Kinesis: 1MB per event/record
  2. Elasticsearch: Handles records up to hundreds of megabytes in size - no issue here
  3. Redshift: Records will be truncated on insert if they exceeded the row definition
  • Not an error as such but could reduce the usefulness of stored records...

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants