Prevents backlog when writing to Elasticsearch #1765

codefromthecrypt · 2017-10-10T12:32:56Z

In the past, delayed or otherwise unhealthy elasticsearch clusters could
create a backlog leading to a OOM arising from the http dispatcher ready
queue. This chooses to prevent a ready queue instead. This means we drop
spans when the backend isn't responding instead of crashing the server.

Fixes #1760

In the past, delayed or otherwise unhealthy elasticsearch clusters could create a backlog leading to a OOM arising from the http dispatcher ready queue. This chooses to prevent a ready queue instead. This means we drop spans when the backend isn't responding instead of crashing the server. Fixes #1760

codefromthecrypt · 2017-10-10T12:33:09Z

cc @openzipkin/elasticsearch

codefromthecrypt · 2017-10-10T12:49:11Z

actually I was able to crash the server even with this when setting ES_MAX_REQUESTS=2 and 5 concurrent senders of 10 spans. Took about 5m to kill the server.

zipkin                      | 2017-10-10 12:42:26.056  INFO 5 --- [nio-9411-exec-1] o.s.web.servlet.DispatcherServlet        : FrameworkServlet 'dispatcherServlet': initialization completed in 138 ms
zipkin                      | OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x00000000f8f2f000, 107704320, 0) failed; error='Cannot allocate memory' (errno=12)
zipkin                      | #
zipkin                      | # There is insufficient memory for the Java Runtime Environment to continue.
zipkin                      | # Native memory allocation (mmap) failed to map 107704320 bytes for committing reserved memory.
zipkin                      | # An error report file with more information is saved as:
zipkin                      | # /zipkin/hs_err_pid5.log
zipkin exited with code 1

… size

codefromthecrypt · 2017-10-10T14:08:20Z

ran with less logging per #1766
I simulated a surge, and definitely dropped spans works.

I watched things recover via prometheus openzipkin-attic/docker-zipkin#135 and a grafana dashboard https://grafana.com/dashboards/1598/

@Logic-32 I'm going to merge this and cut 2.2.0 (which has the prometheus setup I used). Please setup a dashboard and alerts.. also you probably want to add the elasticsearch queue length to whatever that is. There are probably many places to improve, but I hope this gets things better.

codefromthecrypt mentioned this pull request Oct 10, 2017

Investigate how to limit backlog on Elasticsearch collector #1760

Closed

Attempt to use semaphore as might be more reliable to read than queue…

69bb3d4

… size

codefromthecrypt merged commit 972072a into master Oct 10, 2017

codefromthecrypt deleted the es-backlog branch October 10, 2017 14:08

codefromthecrypt mentioned this pull request Oct 10, 2017

Moves log messages behind drop messages to debug level #1766

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prevents backlog when writing to Elasticsearch #1765

Prevents backlog when writing to Elasticsearch #1765

codefromthecrypt commented Oct 10, 2017

codefromthecrypt commented Oct 10, 2017

codefromthecrypt commented Oct 10, 2017

codefromthecrypt commented Oct 10, 2017

Prevents backlog when writing to Elasticsearch #1765

Prevents backlog when writing to Elasticsearch #1765

Conversation

codefromthecrypt commented Oct 10, 2017

codefromthecrypt commented Oct 10, 2017

codefromthecrypt commented Oct 10, 2017

codefromthecrypt commented Oct 10, 2017