-
Notifications
You must be signed in to change notification settings - Fork 468
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clickhouse Operator exception causes clickhouse cluster exception #890
Comments
Could you share your
Multiple copies of what do you need? All clusters defined in ConfigMap and mounted in file |
1、ClickHouseInstallation manifest apiVersion: "clickhouse.altinity.com/v1"
kind: ClickHouseInstallation
metadata:
name: clickhouse
spec:
defaults:
templates:
dataVolumeClaimTemplate: clickhouse-data
podTemplate: clickhouse
serviceTemplate: clickhouse-default
configuration:
zookeeper:
nodes:
- host: zookeeper-0.zookeeper-headless
port: 2181
- host: zookeeper-1.zookeeper-headless
port: 2181
- host: zookeeper-2.zookeeper-headless
port: 2181
clusters:
- name: czhfe
layout:
shardsCount: 2
replicasCount: 2
profiles:
default/distributed_aggregation_memory_efficient: "1"
default/max_bytes_before_external_sort: "6184752906"
default/max_bytes_before_external_group_by: "3865470566"
settings:
disable_internal_dns_cache: "1"
max_server_memory_usage: "7730941133"
prometheus/asynchronous_metrics: "true"
prometheus/endpoint: /metrics
prometheus/events: "true"
prometheus/metrics: "true"
prometheus/port: "8001"
prometheus/status_info: "true"
users:
clickhouse_admin/networks/ip: "::/0"
clickhouse_admin/password: "xxxxxxxxxx"
clickhouse_admin/profile: default
clickhouse_admin/access_management: 1
templates:
podTemplates:
- name: clickhouse
podDistribution:
- type: ShardAntiAffinity
scope: Shard
spec:
containers:
- name: clickhouse-pod
image: clickhouse/clickhouse-server:21.8.14.5
ports:
- name: metrics
containerPort: 8001
resources:
requests:
memory: "8Gi"
cpu: "1"
limits:
memory: "8Gi"
cpu: "1"
env:
- name: TZ
value: "Asia/Shanghai"
lifecycle:
preStop:
exec:
command: [ "/bin/sh","-c","clickhouse stop" ]
livenessProbe:
initialDelaySeconds: 10
failureThreshold: 3
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 10
httpGet:
port: http
scheme: HTTP
path: /ping
readinessProbe:
initialDelaySeconds: 5
failureThreshold: 3
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 10
httpGet:
port: http
scheme: HTTP
path: /ping
startupProbe:
initialDelaySeconds: 20
failureThreshold: 30
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 10
httpGet:
port: http
scheme: HTTP
path: /ping
terminationGracePeriodSeconds: 60
volumeClaimTemplates:
- name: clickhouse-data
reclaimPolicy: Retain
spec:
storageClassName: csi-disk-ssd
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 50Gi
serviceTemplates:
- name: clickhouse-default
generateName: clickhouse-server
spec:
ports:
- name: http
port: 8123
- name: tcp
port: 9000
type: ClusterIP 2、This refers to whether the clickhouse operator can run multiple copies, not to the clickhouse-server 3、This problem may occur whenever the clickhouse-operator has restarted |
1 I don't see
could you share following query result? SELECT hostName() h, * FROM cluster('all-sharded',system.query_log) WHERE query_id='52fafb46-2a44-4028-9a3b-0e8d82237a02' FORMAT Vertical
It's possible, but only for the same version of If you have multiple could you share following command? kubectl get deploy --all-namespaces -l app=clickhouse-operator
|
I don't know if this clickhouse-operator can run more than one replica (the deployment replicas in kubernetes are greater than 1), and I'm not sure if running more than one replica has any impact. |
Shared results are incomplete, you skip original query (only exception related field), so I don't know what exactly happens |
|
ok. now you have only one instance could you share follwing queries results? kubectl get chi --all-namespaces
kubectl exec chi-clickhouse-skyline-0-0-0 -n <namespace_where_chi_installed> -- clickhouse-client -mn -q "SHOW CREATE TABLE `482239998746169344`.`log_detail` FORMAT Vertical; SELECT * FROM system.clusters FORMAT Vertical" |
Sorry, there are some table structures that are not so easy to share. clickhouse-operator can run multiple copies? Is there any impact |
I already asked, you not answer Multiple copies of what? |
clickhouse-operator, I'm referring to whether the replicas of clickhouse-operator deployment in kubernetes can be set to greater than 1 (i.e. multiple replicas) |
yes, clickhouse-operator could be run in different namespaces you shared following command
it mean you have only ONE copy of clickhouse-operator so, i ask again please share |
Yes, what I want to confirm now is whether clickhouse-opeartor deployments under the same namespace can be multiple copies |
An abnormal restart of the clickhouse-operator here will cause the clickhouse-operator to coordinate after restarting, isn't that the same problem mentioned in this issue (#855) |
I recently ran into the same symptoms after upgrading CH version from 0.13.5 to 0.18.2, but with 1 operator in a namespace. The cluster was removed even though CRD was upgraded. It seems like the root cause was if the install spec is broken (see below), the cluster may be removed. Here's the clickhouse operator logs:
We had to revert and manually add "" to |
@chanadian |
I followed the upgrade instructions listed here: https://github.com/Altinity/clickhouse-operator/blob/master/docs/operator_upgrade.md Sequence:
Step 3 and 4 were done almost simultaneously back to back. Should there have been a wait? Note:
|
You should first update /spec/templating/policy to the new standard, update it to auto or manual (it defaults to "" before this), then update CRD, and finally update clickhouse operator to avoid this problem (this should be considered a bug in clickhouse operator), I had this problem before #842 kubectl patch clickhouseinstallations {chi} \
-n {namespace} \
--type='json' \
-p='[{"op": "replace", "path": "/spec/templating/policy", "value": "manual"}]' |
@chanadian , yes, template CRD needs to be updated as well! Thanks for the catch! |
hi, we are running into a similar issue with operator v0.18.4: during a scale up, when we run out of resources for new CH nodes, the operator stuck in a bad state and lost some clusters in the setting, and caused queries to return errors with |
e.g. the operator is managing 3 CH clusters in a namespace: |
@cw9 do you mean |
Nope, I meant 3 ClickHouse clusters in one k8s namespace operated by one operator |
@cw9 so, could you clarify, and share |
in that namespace I have the following clusters:
|
ok. could you share
|
@cw9 separate clusters shall deployed on separate pods could you share
|
|
@Slach yes separate clusters are all deployed on separate pods, each node should have the same the nodes can see all the clusters, we usually use cluster3 as query federation nodes to combine data from cluster1 and cluster2, but during this incident when cluster1 is failing to scale up, cluster2 was missing from the remote_servers.xml despite they are up and running normally, and this caused the cluster3 to only be able to query cluster1 |
Should be fixed in earlier released, but additional rules were added at https://github.com/Altinity/clickhouse-operator/releases/tag/release-0.22.1 |
Environment
Clickhouse Operator version: 0.17.0
Question
Clickhouse Operator restart (oom, update, etc.) will cause clickhouse exceptions. The error is as follows:
The text was updated successfully, but these errors were encountered: