Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add cassandra_nodetool check #511

Merged
merged 17 commits into from
Aug 7, 2017
Merged

Add cassandra_nodetool check #511

merged 17 commits into from
Aug 7, 2017

Conversation

zippolyte
Copy link
Contributor

@zippolyte zippolyte commented Jun 29, 2017

What does this PR do?

Add a python check for cassandra.

Motivation

It was not possible to get this information via the JMXCheck.

Versioning

  • Bumped the version check in manifest.json
  • Updated CHANGELOG.md

Additional Notes

This check uses the cassandra python driver. Docs here.
The driver tries to connect to a cassandra node, and if it succeed, it discovers automatically the other nodes of the cluster. It then tries to connect to those other nodes, and if it can't, it marks those nodes as down.
The check uses this flag to count the number of replicas that are down for a keyspace.

It now uses the nodetool utility that makes calls to JMX functions under the hood. Specifically, it uses the status command, documented here.

The check sends the cumulated percentage of data owned by each UP node of the cluster for a given keyspace, as well as the replication factor for this keyspace.

I removed the service check from The Last Pickle because it required the cassandra driver dependency that weighs 32MB unpackaged and 13MB packaged.
We'll have to discuss if it's worth keeping.

@zippolyte zippolyte force-pushed the hippo/cassandra_check branch from 426abd8 to ac0045b Compare June 29, 2017 23:14
@irabinovitch
Copy link
Contributor

@zippolyte Should this also include the service check provided by the TLP team?

@zippolyte
Copy link
Contributor Author

I don't know, maybe it should be removed since we already have a service check with the jmxcheck ? I left it on purpose since it doesn't check for exactly the same connection (connection to the jmx server vs the actual cassandra).

@irabinovitch
Copy link
Contributor

@zippolyte @masci next steps on this and the dashboards?

@@ -0,0 +1,29 @@
# Cassandra Check
Copy link
Contributor

@masci masci Jul 27, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should change the name of this check, I think having "Cassandra" and "Cassandra Check" would be confusing.
At this point why not having a "Cassandra Nodetool" check? This would be explicit in what it does, plus we could support more nodetool commands other than status in the future with the same check.
@irabinovitch any thoughts?

class CassandraCheck(AgentCheck):

datacenter_name_re = re.compile('^Datacenter: (.*)')
host_status_re = re.compile('^(?P<status>[UD])[NLJM].* (?P<owns>(\d+\.\d+%)|\?).*')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we're here, why not parsing all the info nodetool returns?
Specifically:

  • Status (gauge)
  • State (tag)
  • Address (tag)
  • Load (gauge)
  • Tokens (we can probably ignore, this param is not easy to change on a cluster)
  • Owns (gauge)
  • Host ID (tag)
  • Rack (tag, this would be the availability zone on EC2)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we really send the state as a tag ? Appart from the state normal, these are temporary states.
For the status should this be a gauge (1 for up, 0 for down) ? or a service check ? both ? In the check they wrote, TLP sent a gauge so they could use it to fill a host map and see down host at a glance, so I can see the use of sending a gauge here.


instances:
# Configuration options:
# nodetool: a command or path to nodetool (e.g. /usr/bin/nodetool or docker exec container nodetool)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we move this to init_config?

@@ -0,0 +1,16 @@
init_config:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you fix the format of the config file? Comments should go before actual options and config values can be part of the commented option, for example:

# command or path to nodetool (e.g. /usr/bin/nodetool or docker exec container nodetool)
# nodetool: /usr/bin/nodetool

# the list of keyspaces to monitor
# - keyspaces: []

@@ -0,0 +1,3 @@
metric_name,metric_type,interval,unit_name,per_unit_name,description,orientation,integration,short_name
cassandra.replication_availability,gauge,,,,Percentage of data available per keyspace times replication factor,+1,cassandra_check,available data
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could use the name of the nodetool command as a prefix for future proofing, like:
cassandra.status.replication_availability or even cassandra.nodetool.status.replication_availability

@masci
Copy link
Contributor

masci commented Jul 27, 2017

@irabinovitch I left some comments, can you have a look?

@masci masci self-assigned this Jul 27, 2017
Copy link
Contributor

@masci masci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's fix the metadata catalog and we're good to go! 👍

@@ -0,0 +1,3 @@
metric_name,metric_type,interval,unit_name,per_unit_name,description,orientation,integration,short_name
cassandra.nodetool.status.replication_availability,gauge,,,,Percentage of data available per keyspace times replication factor,+1,cassandra_nodetool,available data
cassandra.nodetool.status.replication_factor,gauge,,,,Replication factor per keyspace,0,cassandra_nodetool,replication factor
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add the new metrics to the catalog?

@masci masci added this to the 5.17 milestone Jul 28, 2017
@zippolyte zippolyte force-pushed the hippo/cassandra_check branch from 4a15ce6 to db4aa03 Compare July 28, 2017 18:55
Copy link
Contributor

@masci masci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One comment about docs, then we can 🚢

-----------
- instance #0 [OK]
- Collected 39 metrics, 0 events & 7 service checks

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you mention the 'cassandra.nodetool.node_up service check?

Copy link
Contributor

@masci masci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's ship it! 🚢

@zippolyte zippolyte merged commit 12abb12 into master Aug 7, 2017
@zippolyte zippolyte deleted the hippo/cassandra_check branch August 7, 2017 15:06
@ofek ofek changed the title [cassandra_check] Add cassandra_check Add cassandra_nodetool check Oct 13, 2018
gml3ff pushed a commit that referenced this pull request May 14, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants