-
Notifications
You must be signed in to change notification settings - Fork 808
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
the ring never removes old ingester even if the ingester pod is evicted #1521
Comments
If the ingester shut down cleanly, even on eviction, then it would not be in the ring. So, the first task is to find out why it did not shut down cleanly, and if possible fix that. Everything else you report is deliberate. We return not-ready to halt a rolling update. |
Actually I don’t understand |
This comment has been minimized.
This comment has been minimized.
I mean the ingester went through its exit sequence, rather than being abruptly terminated from outside. There are two main cases: hand-over to another ingester, and flush to store. In both cases the time required is a function of how much data is in memory. When using an explicitly provisioned store (eg DynamoDB) it would be nice to scale up specifically for a “save everything” operation. There’s no code to do that currently. |
I try to reproduce the problem by delete pod --force. and a new ingester pod is produced by deployment controller immediately. I'm confused because my -distributor.replication-factor so is there anything I misunderstood? I wonder when the ring adds ingester and when to remove? Is consul do it by itself , or ingester tell it what to do? I notice when ingester start and shutdown, it will tell ring. But what if the ingester is shutdown unclearly ,is there any solutions to automatically clean the unhealthy pod in the ring ? by the way , after I restart my consul , the ring will only have the active one and anything works well. |
I know...
so the replicationFactor is 2 now, instead of what I set in -distributor.replication-factor. |
That sounds like the same problem as #1290 cortex/pkg/ring/replication_strategy.go Line 20 in 7cf0690
|
Actually If I deploy one ingester and replicationFactor is 1, then ingester pod was evicted because of low memory and kubelet restart another ingester pod. However the previous ingester didn't exit cleanly, actually the corresponding entry in the ring of consul will never be cleaned. So at this moment:
However Two problems here:
@bboreham @tomwilkie @csmarchbanks Any ideas ? |
The current design requires that you set Your point 1 seems the same as #1290 Point 2 because we don't have enough experience of situations that need this. We would probably add it as an option if someone was to submit a PR. |
@bboreham If ingester is killed because of OOM (Actually ingester consume a lot memory and it's very common in k8s, at least very common in my k8s environment), then it will never have For point 1, I think replicationFactor is configured by user, so set it constant maybe more reasonable. For point 2, I may need to read more code to better understand the design intent. If it's necessary, I'd like to make an PR to fix it. |
"killed because of OOM" is not the same thing as "evicted". A pod that is OOM-killed will restart with the same identity on the same node, hence pick up the same entry in the Cortex ring. |
@bboreham You are right. In our environment, the kubelet was configured with hard eviction, so the ingester pod was evicted without graceful period. However even configure kubelet with soft eviction, I have no idea of how to configure the If after configured grace period, ingester still can't exit cleanly. Then the problem still can't be solved. |
Hi Having a look at the code my assumption is the following (please correct me if im wrong):
Some questions:
After changing the aforementioned code (line 278-280) to the following, I stopped receiving
also when I set Note that cortex/pkg/ingester/client/pool.go Lines 75 to 92 in 1ca4ad0
|
I tried this. but it just removes unhealthy ingester from distirbutor pool( which holds ingester clients ) instead of removing them from consul ring. It doesn't work for me. |
Don't read too much into the words - that's removing them from one data structure in memory. There is no code to remove ingesters from the ring when they are suspected to be dead, and this was deliberate. |
@bboreham and then what is the purpose of |
Risky, easy to get wrong, not necessary day one.
that was to fix #217 |
Here's an example scenario we want to avoid: Cortex is running under Kubernetes, and a rolling update begins:
Now, if we allow the rolling update to proceed, the same thing will happen in each case and we will lose the unflushed data from all ingesters, which could be a significant proportion of all data in the last 12 hours. With the current code the rolling update is halted because there will be an "unhealthy" entry for the old ingester in the ring, and this means the new ingester will never show "ready" to Kubernetes. |
@bboreham Yes, it's exactly the scenario we encountered and it's annoying. |
I think you would find losing half the data more annoying than having to operate the system manually when there is a fault. |
I also hit this issue. I was able to work around it by completely wiping the slate clean, but it's not ideal. |
If you indeed hit the same issue please follow the steps in #1521 (comment) If your issue is different please file it separately. |
This issue has been automatically marked as stale because it has not had any activity in the past 30 days. It will be closed in 7 days if no further activity occurs. Thank you for your contributions. |
Hi -- FYI I've found this Suppose you have 3 SS replicas with "Ordered" policy:
I experienced this running with preemptible nodes (I know, I know) and confirmed with manual testing. If the "Parallel" policy is used instead then pod-1 & pod-2 start in parallel and pick up their former places in the ring. |
Why is pod-0 marked as unhealthy? I can't understand this. |
Now that chunks storage is deprecated and we use blocks storage, we no longer "hand-over" from one ingester to another. Happy to hear experience reports from people who did automate it. |
The Would it be possible to add the same functionality into Cortex? Or is there another way to facilitate the same behaviour as Loki's |
I've read through this issue and the linked issues, and it's still unclear to me whether there is a way to have the ingester ring self-heal in case of unclean shutdowns. Not needing human operator intervention would be extremely valuable to us, as we are losing much more data due to ingesters being down compared to what we would lose by auto-forgetting unhealthy ingesters from the ring. |
+1 |
ingester.autoforget_unhealthy sure will fix restart pods, or cortex pods can register itself always with same same to avoid this scenario
|
ingester.autoforget_unhealthy will be amazing as deploying to AWS with spot instances, get ingesters destroyed and re span up. Exposing the Cortex Ring Status web interface to manually remove unhealthy ingesters is not practical , and it is a security concern. |
@rafilkmp3 Thanks for your input on that... I'm using k8s for this and will switch the ingestors to a statefulset which should fix this issue (forcing the pods into a consistent name). The other approach was going to be a quick job that would query the endpoint and remove the unhealthy ingestors, but the statefulset approach feels much cleaner. |
Nobody has coded one for Cortex, to my knowledge.
We tell you not to do this in the docs. |
I would be happy to take a stab at writing |
+1 for this feature. This is useful specially in the distributor ring - distributor is totally safe to be forgotten if is unhealthy for a long time (2 day). In this case is safe to assume it was an unclean shutdown and it will never come back. Another thing is in the newest cortex release a we introduced the |
Implementation adapted from grafana/loki#3919. Related to cortexproject#1521. Signed-off-by: Josh Carp <jm.carp@gmail.com>
How you did this ? can you share your conf ? |
Would be nice |
Is there any way to auto forget unhealthy rings in Cortex? |
In a Kubernetes & Helm based scenario, these Helm values could be a workaround: ingester:
initContainers:
- name: cleanup-unhealthy-ingesters
image: alpine
command:
- sh
- -c
- 'apk add curl jq && curl -H "Accept: application/json" http://cortex-distributor:8080/ingester/ring | jq ".shards[] | select(.state==\"UNHEALTHY\") | .id" | xargs -I{} curl -d "forget={}" -H "Accept: application/json" http://cortex-distributor:8080/ingester/ring' Please be aware that you need to change the two urls in conformance to your Helm release name. Here it is |
We ended up adding these Kubernetes resources for an automatic cleanup of unhealthy ingesters: apiVersion: v1
kind: ConfigMap
metadata:
name: cortex-ingester-cleanup-script
namespace: cortex
data:
script: |
while true; do
which curl > /dev/null 2>&1
if [ $? -eq 1 ]; then
apk add curl
fi
which jq > /dev/null 2>&1
if [ $? -eq 1 ]; then
apk add jq
fi
curl -H "Accept: application/json" http://cortex-distributor:8080/ingester/ring |
jq ".shards[] | select(.state==\"Unhealthy\") | .id" |
sed 's|"||g' |
xargs -I{} curl -d "forget={}" -d 'csrf_token=$__CSRF_TOKEN_PLACEHOLDER__' -H "Accept: application/json" http://cortex-distributor:8080/ingester/ring
sleep 3
done
true
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: cortex-ingester-cleanup
namespace: cortex
labels:
app: cortex-ingester-cleanup
spec:
replicas: 1
selector:
matchLabels:
app: cortex-ingester-cleanup
template:
metadata:
labels:
app: cortex-ingester-cleanup
revision: '1'
spec:
containers:
- name: cortex-ingester-cleanup
image: alpine
resources:
limits:
cpu: 500m
memory: 512Mi
command:
- sh
- -c
- "apk add bash && exec bash /cortex-ingester-cleanup.sh"
volumeMounts:
- name: cortex-ingester-cleanup-script
mountPath: /cortex-ingester-cleanup.sh
subPath: script
volumes:
- name: cortex-ingester-cleanup-script
configMap:
name: cortex-ingester-cleanup-script |
I think the reason is when 2 pods are terminated at the same time, then with the ordered policy, one pod will start first. That pod will be shown at ACTIVE in the ring but in the k8s side, it is not ready. I checked the log of that pod and it showed this log |
Got bitten bit this terribly several times now, and lost a lot of time and data :-(, would really love to see |
Where do I find the value for |
I have a similar problem as #1502
when my ingester pod was evicted , a new ingester pod will be created .
now the ring has two ingester, but only one (the new one) is healthy. the old one will not be removed from the ring, even if I delete the evict pod manually.
the ring information as follows:
`
`
and the ingester's status is always unready, with distributor's error
level=warn ts=2019-07-19T03:41:45.413839063Z caller=server.go:1995 traceID=daf4028f530860f msg="POST /api/prom/push (500) 727.847µs Response: \"at least 1 live ingesters required, could only find 0\\n\" ws: false; Connection: close; Content-Encoding: snappy; Content-Length: 3742; Content-Type: application/x-protobuf; User-Agent: Prometheus/2.11.0; X-Forwarded-For: 172.16.0.17; X-Forwarded-Host: perf.monitorefk.huawei.com; X-Forwarded-Port: 443; X-Forwarded-Proto: https; X-Original-Uri: /api/prom/push; X-Prometheus-Remote-Write-Version: 0.1.0; X-Real-Ip: 172.16.0.17; X-Request-Id: 62a470dc6de7a83c8974e3411fa63e40; X-Scheme: https; X-Scope-Orgid: custom; "
I wonder if there is any solution to deal with the situaton automatically?
maybe to check the replicas-refactor and remove unhealthy excess ingesters from the ring?
The text was updated successfully, but these errors were encountered: