-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Teraslice Elasticsearch Reader ES6 Worker Error #3962
Comments
I guess its possible this should be filed on the Elasticsearch asset instead. |
There don't appear to be any errors in the ES data node logs that correlate with these slice errors and the clusters were all in an OK state, green, no GCs bigger/longer than usual. |
I've tracked down a worker that experienced one of these slice errors and it didn't have anything else to add really. Just the procedural stuff (redacted a bit and some of these lines are clipped):
|
It's worth pointing out that the jobs that had slice errors were reading from es 6.5.4, but we have one other job reading from es 6.8.6 that has NOT had a slice failure. |
I started tracing at what lines of code were being ran through the stack trace and will list it below to give a better idea one whats going on:
Error above happens here:
TSError above occures here on line 839: teraslice/packages/elasticsearch-api/index.js Lines 829 to 848 in d442bfa
Occures in teraslice/packages/job-components/src/operations/fetcher.ts Lines 16 to 19 in 7656856
Ocurres in |
I've validated that this fix terascope/elasticsearch-assets#1365 resolves the issue above. I used chaos mesh to fail incoming http requests to elasticsearch 6.5.4 using Elasticsearch used:
Job file used: {
"name": "es-to-noop",
"lifecycle": "once",
"workers": 1,
"log_level": "info",
"assets": [
"elasticsearch:4.0.5"
],
"operations": [
{
"_op": "elasticsearch_reader",
"connection": "es6",
"index": "random-data-1",
"size": 2500,
"date_field_name": "created"
},
{
"_op": "noop"
}
]
} Worker logs using
Could not reproduce it on elasticsearch-assets:v4.2.1. |
For whats its worth here is my scheduled mesh-chaos "experiment" that I ran against it if we ever need to reproduce this: kind: Schedule
apiVersion: chaos-mesh.org/v1alpha1
metadata:
namespace: services-dev1
name: abort-3-2
annotations:
experiment.chaos-mesh.org/pause: 'false'
spec:
schedule: '*/1 * * * *'. # this is just cron syntax for "every minute do this"
startingDeadlineSeconds: null
concurrencyPolicy: Forbid
historyLimit: 1
type: HTTPChaos
httpChaos:
selector:
namespaces:
- services-dev1
labelSelectors:
app: elasticsearch
mode: all
target: Response
abort: true
port: 9200
path: '*'
duration: 500ms |
We have been doing a large re-index and have occasionally (like 1 in 200k slices) seen slice failures with the following error attached to the slice. This job is using the ES asset version
4.0.5
and the following api config (redacted) and reading from an Elasticsearch 6.5.4 cluster. It writes to Kafka but that doesn't seem relevant.I'm not sure ... is this a worker processing a slice experiencing a Transport fault talking to ES6?
The Teraslice cluster version is as follows:
Edit: Changed ES cluster version from 6.5.2 to 6.5.4.
The text was updated successfully, but these errors were encountered: