-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add cassandra_nodetool check #511
Conversation
426abd8
to
ac0045b
Compare
@zippolyte Should this also include the service check provided by the TLP team? |
I don't know, maybe it should be removed since we already have a service check with the jmxcheck ? I left it on purpose since it doesn't check for exactly the same connection (connection to the jmx server vs the actual cassandra). |
…ns-core into hippo/cassandra_check
@zippolyte @masci next steps on this and the dashboards? |
cassandra_check/README.md
Outdated
@@ -0,0 +1,29 @@ | |||
# Cassandra Check |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should change the name of this check, I think having "Cassandra" and "Cassandra Check" would be confusing.
At this point why not having a "Cassandra Nodetool" check? This would be explicit in what it does, plus we could support more nodetool
commands other than status
in the future with the same check.
@irabinovitch any thoughts?
cassandra_check/check.py
Outdated
class CassandraCheck(AgentCheck): | ||
|
||
datacenter_name_re = re.compile('^Datacenter: (.*)') | ||
host_status_re = re.compile('^(?P<status>[UD])[NLJM].* (?P<owns>(\d+\.\d+%)|\?).*') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since we're here, why not parsing all the info nodetool returns?
Specifically:
- Status (gauge)
- State (tag)
- Address (tag)
- Load (gauge)
- Tokens (we can probably ignore, this param is not easy to change on a cluster)
- Owns (gauge)
- Host ID (tag)
- Rack (tag, this would be the availability zone on EC2)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we really send the state as a tag ? Appart from the state normal
, these are temporary states.
For the status
should this be a gauge (1 for up, 0 for down) ? or a service check ? both ? In the check they wrote, TLP sent a gauge so they could use it to fill a host map and see down host at a glance, so I can see the use of sending a gauge here.
cassandra_check/conf.yaml.example
Outdated
|
||
instances: | ||
# Configuration options: | ||
# nodetool: a command or path to nodetool (e.g. /usr/bin/nodetool or docker exec container nodetool) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we move this to init_config
?
cassandra_check/conf.yaml.example
Outdated
@@ -0,0 +1,16 @@ | |||
init_config: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you fix the format of the config file? Comments should go before actual options and config values can be part of the commented option, for example:
# command or path to nodetool (e.g. /usr/bin/nodetool or docker exec container nodetool)
# nodetool: /usr/bin/nodetool
# the list of keyspaces to monitor
# - keyspaces: []
cassandra_check/metadata.csv
Outdated
@@ -0,0 +1,3 @@ | |||
metric_name,metric_type,interval,unit_name,per_unit_name,description,orientation,integration,short_name | |||
cassandra.replication_availability,gauge,,,,Percentage of data available per keyspace times replication factor,+1,cassandra_check,available data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could use the name of the nodetool command as a prefix for future proofing, like:
cassandra.status.replication_availability
or even cassandra.nodetool.status.replication_availability
@irabinovitch I left some comments, can you have a look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's fix the metadata catalog and we're good to go! 👍
cassandra_nodetool/metadata.csv
Outdated
@@ -0,0 +1,3 @@ | |||
metric_name,metric_type,interval,unit_name,per_unit_name,description,orientation,integration,short_name | |||
cassandra.nodetool.status.replication_availability,gauge,,,,Percentage of data available per keyspace times replication factor,+1,cassandra_nodetool,available data | |||
cassandra.nodetool.status.replication_factor,gauge,,,,Replication factor per keyspace,0,cassandra_nodetool,replication factor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add the new metrics to the catalog?
4a15ce6
to
db4aa03
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One comment about docs, then we can 🚢
----------- | ||
- instance #0 [OK] | ||
- Collected 39 metrics, 0 events & 7 service checks | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you mention the 'cassandra.nodetool.node_up
service check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's ship it! 🚢
Also remove process.conf for Agent 6
What does this PR do?
Add a python check for cassandra.
Motivation
It was not possible to get this information via the JMXCheck.
Versioning
manifest.json
CHANGELOG.md
Additional Notes
This check uses the cassandra python driver. Docs here.The driver tries to connect to a cassandra node, and if it succeed, it discovers automatically the other nodes of the cluster. It then tries to connect to those other nodes, and if it can't, it marks those nodes as down.
The check uses this flag to count the number of replicas that are down for a keyspace.
It now uses the
nodetool
utility that makes calls to JMX functions under the hood. Specifically, it uses thestatus
command, documented here.The check sends the cumulated percentage of data owned by each
UP
node of the cluster for a given keyspace, as well as the replication factor for this keyspace.I removed the service check from The Last Pickle because it required the cassandra driver dependency that weighs 32MB unpackaged and 13MB packaged.
We'll have to discuss if it's worth keeping.