-
Notifications
You must be signed in to change notification settings - Fork 12
Configurable max event size for webhook updates #94
Comments
@ihortom has some examples of pipelines that always successfully emit the "job started", but never the "job failed". Excessive stdout/stderr in-between is the most likely culprit methinks. Are we successfully truncating the stdio to a safe limit? |
We're truncating it to 10kb for both (#87). Factotum will stop sending updates if it retrieves three non-200 responses (with a random backoff inbetween). I will reduce the limit further, 1kb is probably sufficient (~2kb total) unless there's an official maximum I can use (obviously preferable). Information on the responses should be available in the log file ( |
10kb should be totally fine - I think there must be another bug at work. How sure are you that the 10kb is being enforced? |
Sorry, bit misleading wording there it's 10kb for each stdout/stderr + the rest of the payload. So a message can end up at around ~20kb if both are full. I must have tested this with a collector at some point, but I will double check to be sure. I'm fairly confident that it's being enforced as it's covered by these tests but I'll spend some time looking into it to make sure this is the case. |
Hey @ninjabear -
Aha - this is consistent with what I'm seeing, where Factotum takes a couple of minutes to exit even after all child processes are complete, and then fails to send the webhook. That will be the retries. |
That makes sense. We can be sure that this is the case by checking the log file - this should contain the string |
Bringing forwards as this impacts on the front-end's utility |
We should probably give a better summary of which updates were failed to be transmitted and their reason too. Something like this:
|
I've been thinking about this a bit more, and I'm not sure our current fix will resolve this problem completely. The thing is, the payload will always be arbitrarily large because we attach things like the factfile base64 encoded. If I had a max 5k payload then I can only send 3k (minus the constant bits) of stdout/stderr. This number changes as sizes of other things change. I've also worked out why later events fail, and that is because the stdout/stderr for every task is included[1] with each event. This means as a job is processed, the event size steadily increases until we hit the max event size. It's designed like this because each update contains all information about a job in a stateless way. From any single update you can work out the full state of the job. This hugely simplifies grokking and storage later (and is the reason the UI can work as it does). I think we need to know what exactly the limit of an event size is, and if we can be 100% sure that's what's going on. This limit could be enforced in a number of places, starting at the load balancer and running through the http framework each collector uses right into the Snowplow code that handles requests. This means testing with Snowplow Mini is out unfortunately. @ungn as a side note we need to update iglu central on the max-sizes of these events. Don't forget there's also heartbeats and a couple of other tickets in this milestone. /cc @jbeemster / @alexanderdean if you have any ideas about the absolute max event size (perhaps this is defined I just don't know about it) [1] : https://github.com/snowplow/iglu-central/blob/master/schemas/com.snowplowanalytics.factotum/task_update/jsonschema/1-0-0#L126-L133 / https://github.com/snowplow/iglu-central/blob/master/schemas/com.snowplowanalytics.factotum/job_update/jsonschema/1-0-0#L116-L123 |
@ninjabear - this all makes total sense. A few thoughts:
|
We have confirmed the problem here was downstream; we believe Factotum sends all webhooks correctly. As such I'm going to remove the bug label. We should however revisit the idea of the maximum size events Factotum should send on a lower priority - I'll leave this ticket open for that. |
Updating this issue with our latest thinking on this issue - the maximum size of an individual event should really only be limited by the maximum record size that can be handled downstream. Stream Collectors can handle up-to 1MB payloads into Kinesis which is vastly more than the current limit. Further to this the actual truncation of the output can be quite detrimental to the usage of these webhooks events in processing downstream. In short I think we really just need to make this fully configurable so that we can start to figure out what the maximum webhook sizes really are that can be handled and to start recommending and using that size - we shouldn't be truncating without cause! In reality the abs limits should be:
|
I think it's related to overly-large payloads (too much stdout/err). We should be truncating this farely aggressively...
The text was updated successfully, but these errors were encountered: