Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add cassandra_nodetool check #511

Merged
merged 17 commits into from
Aug 7, 2017
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,9 @@ env:
- TRAVIS_FLAVOR=cassandra FLAVOR_VERSION=2.0.17
- TRAVIS_FLAVOR=cassandra FLAVOR_VERSION=2.1.14
- TRAVIS_FLAVOR=cassandra FLAVOR_VERSION=2.2.10
- TRAVIS_FLAVOR=cassandra_nodetool FLAVOR_VERSION=2.0.17
- TRAVIS_FLAVOR=cassandra_nodetool FLAVOR_VERSION=2.1.14
- TRAVIS_FLAVOR=cassandra_nodetool FLAVOR_VERSION=2.2.10
- TRAVIS_FLAVOR=couch FLAVOR_VERSION=1.6.1
- TRAVIS_FLAVOR=consul FLAVOR_VERSION=v0.6.4
- TRAVIS_FLAVOR=consul FLAVOR_VERSION=0.7.2
Expand Down
8 changes: 8 additions & 0 deletions cassandra_nodetool/CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
# CHANGELOG - Cassandra Nodetool Check

0.1.0/ Unreleased
==================

### Changes

* [FEATURE] adds cassandra_nodetool integration.
62 changes: 62 additions & 0 deletions cassandra_nodetool/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
# Agent Check: Cassandra Nodetool

# Overview

This check collects metrics for your Cassandra cluster that are not available through [jmx integration](https://github.com/DataDog/integrations-core/tree/master/cassandra).
It uses the `nodetool` utility to collect them.

# Installation

The varnish check is packaged with the Agent, so simply [install the Agent](https://app.datadoghq.com/account/settings#agent) on your cassandra nodes.
If you need the newest version of the check, install the `dd-check-cassandra_nodetool` package.

# Configuration

Create a file `cassandra_nodetool.yaml` in the Agent's `conf.d` directory:
```
init_config:
# command or path to nodetool (e.g. /usr/bin/nodetool or docker exec container nodetool)
# can be overwritten on an instance
# nodetool: /usr/bin/nodetool

instances:

# the list of keyspaces to monitor
- keyspaces: []

# host that nodetool will connect to.
# host: localhost

# the port JMX is listening to for connections.
# port: 7199

# a set of credentials to connect to the host. These are the credentials for the JMX server.
# For the check to work, this user must have a read/write access so that nodetool can execute the `status` command
# username:
# password:

# a list of additionnal tags to be sent with the metrics
# tags: []
```

# Validation

When you run `datadog-agent info` you should see something like the following:

Checks
======

cassandra_nodetool
-----------
- instance #0 [OK]
- Collected 39 metrics, 0 events & 7 service checks

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you mention the 'cassandra.nodetool.node_up service check?

# Compatibility

The `cassandra_nodetool` check is compatible with all major platforms

# Service Checks

**cassandra.nodetool.node_up**:

The agent sends this service check for each node of the monitored cluster. Returns CRITICAL if the node is down, otherwise OK.
136 changes: 136 additions & 0 deletions cassandra_nodetool/check.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,136 @@
# (C) Datadog, Inc. 2010-2016
# All rights reserved
# Licensed under Simplified BSD License (see LICENSE)

# stdlib
import re
import shlex

# project
from checks import AgentCheck
from utils.subprocess_output import get_subprocess_output
from collections import defaultdict

EVENT_TYPE = SOURCE_TYPE_NAME = 'cassandra_nodetool'
DEFAULT_HOST = 'localhost'
DEFAULT_PORT = '7199'
TO_BYTES = {
'B': 1,
'KB': 1e3,
'MB': 1e6,
'GB': 1e9,
'TB': 1e12,
}

class CassandraNodetoolCheck(AgentCheck):

datacenter_name_re = re.compile('^Datacenter: (.*)')
node_status_re = re.compile('^(?P<status>[UD])[NLJM] +(?P<address>\d+\.\d+\.\d+\.\d+) +'
'(?P<load>\d+\.\d*) (?P<load_unit>(K|M|G|T)?B) +\d+ +'
'(?P<owns>(\d+\.\d+)|\?)%? +(?P<id>[a-fA-F0-9-]*) +(?P<rack>.*)')

def __init__(self, name, init_config, agentConfig, instances=None):
AgentCheck.__init__(self, name, init_config, agentConfig, instances)
self.nodetool_cmd = init_config.get("nodetool", "/usr/bin/nodetool")

def check(self, instance):
# Allow to specify a complete command for nodetool such as `docker exec container nodetool`
nodetool_cmd = shlex.split(instance.get("nodetool", self.nodetool_cmd))
host = instance.get("host", DEFAULT_HOST)
port = instance.get("port", DEFAULT_PORT)
keyspaces = instance.get("keyspaces", [])
username = instance.get("username", "")
password = instance.get("password", "")
tags = instance.get("tags", [])

# Flag to send service checks only once and not for every keyspace
send_service_checks = True

for keyspace in keyspaces:
# Build the nodetool command
cmd = nodetool_cmd + ['-h', host, '-p', port]
if username and password:
cmd += ['-u', username, '-pw', password]
cmd += ['status', '--', keyspace]

# Execute the command
out, err, _ = get_subprocess_output(cmd, self.log, False)
if err or 'Error:' in out:
self.log.error('Error executing nodetool status: %s', err or out)
continue
nodes = self._process_nodetool_output(out)

percent_up_by_dc = defaultdict(float)
percent_total_by_dc = defaultdict(float)
# Send the stats per node and compute the stats per datacenter
for node in nodes:

node_tags = ['node_address:%s' % node['address'],
'node_id:%s' % node['id'],
'datacenter:%s' % node['datacenter'],
'rack:%s' % node['rack']]

# nodetool prints `?` when it can't compute the value of `owns` for certain keyspaces (e.g. system)
# don't send metric in this case
if node['owns'] != '?':
owns = float(node['owns'])
if node['status'] == 'U':
percent_up_by_dc[node['datacenter']] += owns
percent_total_by_dc[node['datacenter']] += owns
self.gauge('cassandra.nodetool.status.owns', owns,
tags=tags + node_tags + ['keyspace:%s' % keyspace])

# Send service check only once for each node
if send_service_checks:
status = AgentCheck.OK if node['status'] == 'U' else AgentCheck.CRITICAL
self.service_check('cassandra.nodetool.node_up', status, tags + node_tags)

self.gauge('cassandra.nodetool.status.status', 1 if node['status'] == 'U' else 0,
tags=tags + node_tags)
self.gauge('cassandra.nodetool.status.load', float(node['load']) * TO_BYTES[node['load_unit']],
tags=tags + node_tags)

# All service checks have been sent, don't resend
send_service_checks = False

# Send the stats per datacenter
for datacenter, percent_up in percent_up_by_dc.items():
self.gauge('cassandra.nodetool.status.replication_availability', percent_up,
tags=tags + ['keyspace:%s' % keyspace, 'datacenter:%s' % datacenter])
for datacenter, percent_total in percent_total_by_dc.items():
self.gauge('cassandra.nodetool.status.replication_factor', int(round(percent_total / 100)),
tags=tags + ['keyspace:%s' % keyspace, 'datacenter:%s' % datacenter])

def _process_nodetool_output(self, output):
nodes = []
datacenter_name = ""
for line in output.splitlines():
# Ouput of nodetool
# Datacenter: dc1
# ===============
# Status=Up/Down
# |/ State=Normal/Leaving/Joining/Moving
# -- Address Load Tokens Owns (effective) Host ID Rack
# UN 172.21.0.3 184.8 KB 256 38.4% 7501ef03-eb63-4db0-95e6-20bfeb7cdd87 RAC1
# UN 172.21.0.4 223.34 KB 256 39.5% e521a2a4-39d3-4311-a195-667bf56450f4 RAC1

match = self.datacenter_name_re.search(line)
if match:
datacenter_name = match.group(1)
continue

match = self.node_status_re.search(line)
if match:
node = {
'status': match.group('status'),
'address': match.group('address'),
'load': match.group('load'),
'load_unit': match.group('load_unit'),
'owns': match.group('owns'),
'id': match.group('id'),
'rack': match.group('rack'),
'datacenter': datacenter_name
}
nodes.append(node)

return nodes
94 changes: 94 additions & 0 deletions cassandra_nodetool/ci/cassandra_nodetool.rake
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
require 'ci/common'

def cassandra_nodetool_version
ENV['FLAVOR_VERSION'] || '2.1.14' # '2.0.17'
end

container_name = 'dd-test-cassandra'
container_name2 = 'dd-test-cassandra2'

container_port = 7199
cassandra_jmx_options = "-Dcom.sun.management.jmxremote.port=#{container_port}
-Dcom.sun.management.jmxremote.rmi.port=#{container_port}
-Dcom.sun.management.jmxremote.ssl=false
-Dcom.sun.management.jmxremote.authenticate=true
-Dcom.sun.management.jmxremote.password.file=/etc/cassandra/jmxremote.password
-Djava.rmi.server.hostname=localhost"

namespace :ci do
namespace :cassandra_nodetool do |flavor|
task before_install: ['ci:common:before_install'] do
sh %(docker kill #{container_name} 2>/dev/null || true)
sh %(docker rm #{container_name} 2>/dev/null || true)
sh %(docker kill #{container_name2} 2>/dev/null || true)
sh %(docker rm #{container_name2} 2>/dev/null || true)
sh %(rm -f #{__dir__}/jmxremote.password.tmp)
end

task :install do
Rake::Task['ci:common:install'].invoke('cassandra_nodetool')
sh %(docker create --expose #{container_port} \
-p #{container_port}:#{container_port} -e JMX_PORT=#{container_port} \
-e LOCAL_JMX=no -e JVM_EXTRA_OPTS="#{cassandra_jmx_options}" --name #{container_name} cassandra:#{cassandra_nodetool_version})
sh %(cp #{__dir__}/jmxremote.password #{__dir__}/jmxremote.password.tmp)
sh %(chmod 400 #{__dir__}/jmxremote.password.tmp)
sh %(docker cp #{__dir__}/jmxremote.password.tmp #{container_name}:/etc/cassandra/jmxremote.password)
sh %(rm -f #{__dir__}/jmxremote.password.tmp)
sh %(docker start #{container_name})

sh %(docker create --name #{container_name2} \
-e CASSANDRA_SEEDS="$(docker inspect --format='{{ .NetworkSettings.IPAddress }}' #{container_name})" \
cassandra:#{cassandra_nodetool_version})
sh %(docker start #{container_name2})
end

task before_script: ['ci:common:before_script'] do
# Wait.for container_port
wait_on_docker_logs(container_name, 20, 'Listening for thrift clients', "Created default superuser role 'cassandra'")
wait_on_docker_logs(container_name2, 40, 'Listening for thrift clients', 'Not starting RPC server as requested')
sh %(docker exec #{container_name} cqlsh -e "CREATE KEYSPACE test WITH REPLICATION={'class':'SimpleStrategy', 'replication_factor':2}")
end

task script: ['ci:common:script'] do
this_provides = [
'cassandra_nodetool'
]
Rake::Task['ci:common:run_tests'].invoke(this_provides)
end

task before_cache: ['ci:common:before_cache']

task cleanup: ['ci:common:cleanup'] do
sh %(docker kill #{container_name} 2>/dev/null || true)
sh %(docker rm #{container_name} 2>/dev/null || true)
sh %(docker kill #{container_name2} 2>/dev/null || true)
sh %(docker rm #{container_name2} 2>/dev/null || true)
sh %(rm -f #{__dir__}/jmxremote.password.tmp)
end

task :execute do
exception = nil
begin
%w(before_install install before_script).each do |u|
Rake::Task["#{flavor.scope.path}:#{u}"].invoke
end
if !ENV['SKIP_TEST']
Rake::Task["#{flavor.scope.path}:script"].invoke
else
puts 'Skipping tests'.yellow
end
Rake::Task["#{flavor.scope.path}:before_cache"].invoke
rescue => e
exception = e
puts "Failed task: #{e.class} #{e.message}".red
end
if ENV['SKIP_CLEANUP']
puts 'Skipping cleanup, disposable environments are great'.yellow
else
puts 'Cleaning up'
Rake::Task["#{flavor.scope.path}:cleanup"].invoke
end
raise exception if exception
end
end
end
15 changes: 15 additions & 0 deletions cassandra_nodetool/ci/fixtures/nodetool_output
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
Datacenter: dc1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
DN 172.21.0.6 178.43 KB 256 35.4% f86d2d7a-e5c7-4c46-b36e-df08c565171a rack1
UN 172.21.0.3 184.8 KB 256 31.0% 7501ef03-eb63-4db0-95e6-20bfeb7cdd87 RAC1
UN 172.21.0.2 182.05 KB 256 33.5% fa859fcc-5e76-44ce-9609-1f314bdf21c1 RAC1
Datacenter: dc2
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 172.21.0.5 216.75 KB 256 100.0% 2250363b-7453-48f2-b6cb-ef79cad0612b RAC1
UN 172.21.0.4 223.34 KB 256 100.0% e521a2a4-39d3-4311-a195-667bf56450f4 RAC1
1 change: 1 addition & 0 deletions cassandra_nodetool/ci/jmxremote.password
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
controlRole QED
23 changes: 23 additions & 0 deletions cassandra_nodetool/conf.yaml.example
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
init_config:
# command or path to nodetool (e.g. /usr/bin/nodetool or docker exec container nodetool)
# can be overwritten on an instance
# nodetool: /usr/bin/nodetool

instances:

# the list of keyspaces to monitor
- keyspaces: []

# host that nodetool will connect to.
# host: localhost

# the port JMX is listening to for connections.
# port: 7199

# a set of credentials to connect to the host. These are the credentials for the JMX server.
# For the check to work, this user must have a read/write access so that nodetool can execute the `status` command
# username:
# password:

# a list of additionnal tags to be sent with the metrics
# tags: []
12 changes: 12 additions & 0 deletions cassandra_nodetool/manifest.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
{
"maintainer": "help@datadoghq.com",
"manifest_version": "0.1.0",
"max_agent_version": "6.0.0",
"min_agent_version": "5.6.3",
"name": "cassandra_nodetool",
"short_description": "monitor cassandra using the nodetool utility",
"guid": "00e4a8bd-8ec2-4bb4-b725-6aaa91618d13",
"support": "contrib",
"supported_os": ["linux","mac_os","windows"],
"version": "0.1.0"
}
6 changes: 6 additions & 0 deletions cassandra_nodetool/metadata.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
metric_name,metric_type,interval,unit_name,per_unit_name,description,orientation,integration,short_name
cassandra.nodetool.status.replication_availability,gauge,,percent,,Percentage of data available per keyspace times replication factor,1,cassandra_nodetool,available data
cassandra.nodetool.status.replication_factor,gauge,,,,Replication factor per keyspace,0,cassandra_nodetool,replication factor
cassandra.nodetool.status.status,gauge,,,,Node status: up (1) or down (0),1,cassandra_nodetool,node status
cassandra.nodetool.status.owns,gauge,,percent,,Percentage of the data owned by the node per datacenter times the replication factor,0,cassandra_nodetool,owns
cassandra.nodetool.status.load,gauge,,byte,,Amount of file system data under the cassandra data directory without snapshot content,0,cassandra_nodetool,load
1 change: 1 addition & 0 deletions cassandra_nodetool/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# integration pip requirements
Loading