Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Autodiscovery guide updates #1357

Merged
merged 45 commits into from
Jul 7, 2017
Merged

Conversation

kmshultz
Copy link
Contributor

@kmshultz kmshultz commented Jun 16, 2017

TODO:

  • Verify and finish list of caveats under #how-it-works/#different-execution.
  • Clean up #service-identifiers section at the end
  • Add more detail to Kubernetes instructions, i.e. actual commands

@kmshultz
Copy link
Contributor Author

I've introduced a few new terms here:

  • service identifier — the document needs a convenient way to refer to image name/label. I considered 'container identifier' but that may not be so backend-agnostic (one day when we have non-Docker backends)
  • template source — what files, KV stores, and K8s annotations are. It would seem we already had a term like this—Configuration Backend—except it only applies to KV stores. Calling them all configuration backends could create confusion that e.g. 'kubernetes' is a valid value for SD_CONFIG_BACKEND.

Voice any opposition to these terms, dear readers, or forever hold your peace.

cc @technovangelist if others concur on these terms, we should use them in the Autodiscovery video, too.

Copy link
Contributor

@xvello xvello left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some additions, but a solid improvement on the existing page!

There are a few caveats:

- The Agent caches templates it gets from KV stores. Changes to KV store templates require a restart of the Agent.
- something about precedence of templates when multiple sources provided?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

KV-store templates take precendence over conf.d/auto_conf files indeed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xvello got it. And KV takes precedence over k8s annotations, which take precedence over files?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

k8s annotations take precedence over templates (KV then files)


There are a few caveats:

- The Agent caches templates it gets from KV stores. Changes to KV store templates require a restart of the Agent.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading the source, it looks like we are watching for changes in the KV-store and reloading templates when one is detected. @hkaj is that working on all three supported backends?

@@ -48,155 +86,169 @@ By default, the Datadog Agent includes Autodiscovery support for:
- Redis
- Riak

These are provided by the configuration templates in the Datadog Agent `conf.d/auto_conf` directory.
Storing templates as local files is easy to understand and doesn't require an external service. The downside is that you must redeploy the Agent container each time you change, add, or remove templates. You may also have to maintain your own docker-dd-agent container if you want to add your own templates.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we mention the /conf.d volume they can expose, instead of rebuilding the image? See https://github.com/DataDog/docker-dd-agent/blob/master/entrypoint.sh#L172

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, got it. But I think no matter what, users must restart Agent containers to get new check configs, correct? i.e. there is no SIGHUP or similar way to tell Agent to reload the files?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case of host deployment, restarting the collector (supervisorctl restart datadog-agent:collector) is enough. But copying conf files from /conf.d is handled by the container's entrypoint, so a container restart is indeed needed


Let's take the example of the port variable: a RabbitMQ container with the management module enabled has 6 exposed ports by default. The list of ports as seen by the agent is: `[4369, 5671, 5672, 15671, 15672, 25672]`. **Notice the order. The Agent always sorts values in ascending order.**
Autodiscovery supports Consul, etcd, and Zookeeper as template sources. To use a KV store, configure its parameters in `datadog.conf` or in environment variables passed to docker-dd-agent when starting the container.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


### Configuring etcd or Consul in `datadog.conf`
If you are using Consul and the Consul cluster requires token authentication, set `consul_token`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWI, 5.15 will support user/pass auth for etcd. See DataDog/dd-agent#3357


You can also add a network name suffix to the `%%host%%` variable—`%%host_bridge%%`, `%%host_swarm%%`, etc—for containers attached to multiple networks. When `%%host%%` does not have a suffix, Autodiscovery picks the container's bridge network IP address.

### Service Identifiers
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section should be expanded with examples and pushed further up the page.
See the ticket I sent you about user confusion on the matching

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do, I hadn't addressed this section just yet. I did link to it from earlier in the page; you think it needs to be more prominent, though?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I had two support cases that could've been solved by using container labels instead of toying around with the image names, so I'd vote for pushing it up. But that's probably sampling biais

@technovangelist
Copy link
Contributor

technovangelist commented Jun 21, 2017

re service identifiers. since services are a thing in k8s and a different thing in docker swarm, is it worth/possible to choose a different word, esp since this service is different from that service? in fact 'service' is a very overloaded term in this article


Datadog automatically keeps track of what is running where, thanks to its Autodiscovery feature. Autodiscovery allows you to define configuration templates that will be applied automatically to monitor your containers.
The Datadog Agent can automatically track which services are running where, thanks to its Autodiscovery feature. Autodiscovery lets you define configuration templates for Agent checks and specify which container types each check should apply to. The Agent enables, disables, and recompiles static check configurations from the templates as containers come and go. When your NGINX container moves from 10.0.0.6 to 10.0.0.17, Autodiscovery helps the Agent update its NGINX check configuration with the new IP address so it can keep collecting NGINX metrics without any action on your part.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...which container types...??? are they different types or just different containers?

The agent....recompiles.... ?? Recompiles? That sounds super impressive but is that accurate? maybe I am assuming too much from that word.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, yes, 'containers' is probably better, since individual containers of the same type may be targeted via labels.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, recompiles is a little puffy. 're-generates'?


## How it works
<div class="alert alert-info">
Autodiscovery was previously called Service Discovery. It is still known as Service Discovery in the Agent's code and in configuration options.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this sounds a bit awkward. It was previously called...and it still is...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Autodiscovery was previously called Service Discovery, and it's still called that in the Agent's code and in configuration options."


The Autodiscovery feature watches for Docker events like when a container is created, destroyed, started or stopped. When one of these happens, the Agent identifies which service is impacted, loads the configuration template for this image, and automatically sets up its checks.
In a traditional non-container environment, Datadog Agent configuration is, like the environment in which it runs, static. The Agent reads check configurations from disk when it starts, and as long as it's running, it continuously applies every configured check. The configuration files are static, and any network-related options configured within them serve to identify specific instances of a monitored service. When an Agent check cannot connect to such a service, it's probably not because the service was re-homed somewhere else; either the service is down, or the check configuration was unable to reach the service to begin with. In any case, the Agent continues to attempt connecting to the service until an administrator steps in to troubleshoot.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thats a lot, of commas, in the, first, sentence.

for the second sentence, the way I read that is that if the config changes, the changes are applied right away without restarting the agent, but then you say the files are static. a bit confusing

When an Agent check cannot connect to such a service.... you aren't talking about a service and havent defined one..

either the service is down, or the check configuration was unable to reach the service to begin with. - the configuration cannot reach it, or the check cannot??

until an administrator steps in to troubleshoot... just the act of troubleshooting stops the attempt to connect? wont it still try. If the admin successfully solves the problem rather than just troubleshooting, i hope it will still attempt to connect.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thats a lot, of commas, in the, first, sentence.

In a traditional non-container environment, Datadog Agent configuration is—like the environment in which it runs—static.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for the second sentence, the way I read that is that if the config changes, the changes are applied right away without restarting the agent

Which words tripped you up, 'continuously applies'? If so, is 'continuously runs' better?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When an Agent check cannot connect to such a service.... you aren't talking about a service and havent defined one..

The previous sentence says "any network-related options configured within them serve to identify specific instances of a monitored service". Would a parenthetical example have made it clear? "...instances of a monitored service (e.g. a redis instance)".


Configuration templates can be defined by simple template files or as single key-value stores using etcd or Consul.
With Autodiscovery enabled, the Agent runs checks differently.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this sentence necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's kind of blog posty, but it makes the transition to the next two sections less jarring. It matches the section names, i.e. 'Different '


To use Autodiscovery, you'll first need to run the Datadog Agent as a service.
First, Autodiscovery uses **templates** for check configuration, wherein two template variables—`%%host%%` and `%%port%%`—appear in place of any normally-hardcoded network option values. Because orchestration platforms like Docker Swarm deploy (and redeploy) containers on arbitrary hosts, static configuration files are not suitable for checks that collect data from network endpoints. For example: a template for the Agent's [Go Expvar check](https://github.com/DataDog/integrations-core/blob/master/go_expvar/conf.yaml.example) would contain the option `expvar_url: http://%%host%%:%%port%%`. For containers that have more than one IP or exposed port, Autodiscovery can pick the right one(s) using [template variable indexes](#template-variable-indexes).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

normally-hardcoded? is it better to just say hardcoded?

%%port%%`—appear .... missing a space?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I intentionally left the space out. We haven't been consistent in the docs on whether we have a space on either side of an em-dash, but lately I've been leaving them out.


Let's take the example of the port variable: a RabbitMQ container with the management module enabled has 6 exposed ports by default. The list of ports as seen by the agent is: `[4369, 5671, 5672, 15671, 15672, 25672]`. **Notice the order. The Agent always sorts values in ascending order.**
Autodiscovery supports Consul, etcd, and Zookeeper as template sources. To use a KV store, configure its parameters in `datadog.conf` or in environment variables passed to docker-dd-agent when starting the container.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Autodiscovery supports.... key value based template sources

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's implied by the heading.


To pass the settings listed above as environment variables when starting the Datadog Agent in Docker Swarm, you would run the command:
Each template is defined as a three-tuple: check name, `init_config`, and `instances`. The `docker_images` option from the previous section, which was used to provide service identifiers to Autodiscovery, is not required here; for KV store template sources, service identifiers appear as first-level keys under `check_config`. (Also note, the file-based template in the previous section didn't need a check name; the Agent infers it from the filename.)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would the term be three tuple, or just tuple?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3-tuple is most accurate.


### Template structure in key-value stores
Notice that each of the three values is a list. Autodiscovery assembles list items into check configurations based on shared list indexes. In this case, it composes the first (and only) check from `check_names[0]`, `init_configs[0]` and `instances[0]`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it worth saying its a YAML formatted list (or is it aJSON formatted list)?


Note that in the structure above, you may have multiple checks for a single container. For example you may run a Java service that provides an HTTP API, using the HTTP check and the JMX integration at the same time. To declare that in templates, simply add elements to the `check_names`, `init_configs`, and `instances lists`. These elements will be matched together based on their index in their respective lists.
Again, the list orders matter. The HTTP check will only work if all its elements have the same index (1) across the lists (they do).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

list orders or list order?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The HTTP check will only work if all its elements have the same index (1) across the lists (they do).
I don't understand this

metadata:
name: apache
annotations:
service-discovery.datadoghq.com/apache.check_names: '["apache","http_check"]'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome, didn't know you could add multiple checks, but makes sense.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC that's 5.14 only (must check the milestone). We could document that to avoid support cases with old versions

@technovangelist
Copy link
Contributor

this feels like a long document. @kmshultz you talked about splitting it up into autodisco on k8s, autodisco on docker, etc. any further ideas on this? If i just care about docker swarm, i probably don't want to be confused about k8s.

@kmshultz
Copy link
Contributor Author

Re: splitting this into smaller documents, I agree we should do that. But as the docs site is currently organized, there's no good way to do that, i.e. since the only semblance of organization we have now is /guides, /references, /integrations, and a few others. We could put AD usage instructions into the K8s integration page, the ECS page, etc (though there is no Docker Swarm integration page), but I'd rather have those pages link over to this guide using the relevant anchor, i.e. /guides/autodiscovery#kubernetes. Integrations pages should only 1) describe how to configure the integration, and 2) list their published metrics, events, and service checks.

When we redesign the docs site soon, we can organize it some other way, i.e. /agent/install#kubernetes, /agent/usage#kubernetes. But even then, I think some of the platform-agnostic info in this doc will remain here (and then, we'll no longer call it a "guide").

Thoughts @technovangelist @jyee @irabinovitch?

@kmshultz
Copy link
Contributor Author

@technovangelist

re service identifiers. since services are a thing in k8s and a different thing in docker swarm, is it worth/possible to choose a different word, esp since this service is different from that service? in fact 'service' is a very overloaded term in this article

Perhaps 'container identifier' is indeed best. Care to offer a suggestion?

@kmshultz
Copy link
Contributor Author

kmshultz commented Jul 5, 2017

@xvello I've added some more content here. Mind giving it another look, please? 🙇

@kmshultz kmshultz changed the title [WIP] Autodiscovery guide updates Autodiscovery guide updates Jul 5, 2017
kind: guide
listorder: 10
---

Docker is being [adopted rapidly](https://www.datadoghq.com/docker-adoption/) and platforms like Docker Swarm, Kubernetes and Amazon's ECS make running services easier and more resilient by managing orchestration and replication across hosts. But all of that makes monitoring more difficult. How can you monitor a service which is dynamically shifting from one host to another?
Docker is being [adopted rapidly](https://www.datadoghq.com/docker-adoption/). Orchestration platforms like Docker Swarm, Kubernetes, and Amazon ECS make running Docker-ized services easier and more resilient by managing orchestration and replication across hosts. But all of that makes monitoring more difficult. How can you reliably monitor a service which is unpredictably shifting from one host to another?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/Docker/Containers/ ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is unpredictably the right word? We didn't like dynamically?

would "dynamically being shifted" make more sense?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As docker is the only supported runtime for now, I'd go with writing docker explicitly, but we'll transition to a generic "containers" once rkt is supported

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unpredictably, dynamically... I think it's a matter of taste. To me, "unpredictably" better evokes the problem. Do you think it's inaccurate, or you just don't like the tone/sound of it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As docker is the only supported runtime for now, I'd go with writing docker explicitly, but we'll transition to a generic "containers" once rkt is supported

Indeed, and if we avoided saying Docker here, we'd want to remove it from the page's title, too.

## How it works
<div class="alert alert-info">
Autodiscovery was previously called Service Discovery. It's still called Service Discovery in the Agent's code and in Autodiscovery configuration options.
</div>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to mention this?

Copy link
Contributor

@xvello xvello Jul 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so, because current client know and might search for the old term.
That could even be at the top of the page

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xvello at first I did have it at the top, just below the page title. Do you think that's better?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, as in "just so you know... now let's get it over with"

kind: guide
listorder: 10
---

Docker is being [adopted rapidly](https://www.datadoghq.com/docker-adoption/) and platforms like Docker Swarm, Kubernetes and Amazon's ECS make running services easier and more resilient by managing orchestration and replication across hosts. But all of that makes monitoring more difficult. How can you monitor a service which is dynamically shifting from one host to another?
Docker is being [adopted rapidly](https://www.datadoghq.com/docker-adoption/). Orchestration platforms like Docker Swarm, Kubernetes, and Amazon ECS make running Docker-ized services easier and more resilient by managing orchestration and replication across hosts. But all of that makes monitoring more difficult. How can you reliably monitor a service which is unpredictably shifting from one host to another?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As docker is the only supported runtime for now, I'd go with writing docker explicitly, but we'll transition to a generic "containers" once rkt is supported


If you use Kubernetes, see the [Kubernetes integration page](http://docs.datadoghq.com/integrations/kubernetes/#installation) for instructions on running docker-dd-agent. If you use Amazon ECS, see [its integration page](http://docs.datadoghq.com/integrations/ecs/#installation).

If you use Docker Swarm, run the following command on one of your manager nodes:

docker service create \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have agent install doc for swarm yet, we could add it here

But the agent setup page need a revamp as the k8s and docker ones are getting more and more complex (different ways to deploy all mushed in the same page with no hierarchy). I'm torn between revamping these pages or just create doc.dd.com pages and link to them.


The configuration templates in `conf.d/auto_conf` directory are nearly identical to the example YAML configuration files provided in [the Datadog `conf.d` directory](https://github.com/DataDog/dd-agent/tree/master/conf.d), but with one important field added. The `docker_images` field is required and identifies the container image(s) to which the configuration template should be applied.
1. Add them to each host that runs docker-dd-agent and [mount the directory that contains them](https://github.com/DataDog/docker-dd-agent#configuration-files) into the docker-dd-agent container when starting it
1. Package them into your own release of docker-dd-agent
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd go with:

Build a custom docker image based on docker-dd-agent with your custom templates added in the /etc/dd-agent/conf.d/auto_conf


## Configuration templates with key-value stores
If this is too limiting—if you need to apply different check configurations to different containers running the same image—use [labels](#container-labels) to identify the containers. Label each container differently, then add each label to any template file's `docker_images` list (yes, `docker_images` is where to put _any_ kind of container identifier, not just images).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

metadata:
name: apache
annotations:
service-discovery.datadoghq.com/apache.check_names: '["apache","http_check"]'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC that's 5.14 only (must check the milestone). We could document that to avoid support cases with old versions


If you provide a template for the same check type via multiple template sources, the Agent will prefer, in increasing order of preference:

* Files
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC files are last in the precedence order. Annotations are first for sure, so order is Annotations -> K/V -> files

I think it's better to list them in order of lookup instead of increasing order of preference (as that's the opposite, right?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, maybe this is confusing because I put it in ascending order of precedence. I thought another engineer told me KV takes highest precedence but perhaps I misunderstood.

@kmshultz
Copy link
Contributor Author

kmshultz commented Jul 6, 2017

I'm torn between revamping these pages or just create doc.dd.com pages and link to them.

@xvello Re: the Agent install instructions for each platform, I agree we need a separate space for this. With the Docs redesign underway very soon, I'm planning to have a dedicated section for the Agent (i.e. docs.datadoghq.com/agent) where we comprehensively cover its architecture (different daemons), install instructions for different platforms (similar to dogweb), all configuration options and environment variables, etc.

I don't like having the install docs embedded in Integrations pages. And I'm not a big fan of having them in dogweb, either, though I realize it's a nice first-time-user flow to login and be presented with a one-liner to copy and paste.

@kmshultz kmshultz merged commit 16cd602 into master Jul 7, 2017
@kmshultz kmshultz deleted the kent/autodiscovery-guide-updates branch July 20, 2017 16:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants