-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Autodiscovery guide updates #1357
Conversation
I've introduced a few new terms here:
Voice any opposition to these terms, dear readers, or forever hold your peace. cc @technovangelist if others concur on these terms, we should use them in the Autodiscovery video, too. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some additions, but a solid improvement on the existing page!
content/guides/autodiscovery.md
Outdated
There are a few caveats: | ||
|
||
- The Agent caches templates it gets from KV stores. Changes to KV store templates require a restart of the Agent. | ||
- something about precedence of templates when multiple sources provided? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
KV-store templates take precendence over conf.d/auto_conf files indeed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xvello got it. And KV takes precedence over k8s annotations, which take precedence over files?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
k8s annotations take precedence over templates (KV then files)
content/guides/autodiscovery.md
Outdated
|
||
There are a few caveats: | ||
|
||
- The Agent caches templates it gets from KV stores. Changes to KV store templates require a restart of the Agent. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reading the source, it looks like we are watching for changes in the KV-store and reloading templates when one is detected. @hkaj is that working on all three supported backends?
content/guides/autodiscovery.md
Outdated
@@ -48,155 +86,169 @@ By default, the Datadog Agent includes Autodiscovery support for: | |||
- Redis | |||
- Riak | |||
|
|||
These are provided by the configuration templates in the Datadog Agent `conf.d/auto_conf` directory. | |||
Storing templates as local files is easy to understand and doesn't require an external service. The downside is that you must redeploy the Agent container each time you change, add, or remove templates. You may also have to maintain your own docker-dd-agent container if you want to add your own templates. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we mention the /conf.d volume they can expose, instead of rebuilding the image? See https://github.com/DataDog/docker-dd-agent/blob/master/entrypoint.sh#L172
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, got it. But I think no matter what, users must restart Agent containers to get new check configs, correct? i.e. there is no SIGHUP or similar way to tell Agent to reload the files?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the case of host deployment, restarting the collector (supervisorctl restart datadog-agent:collector
) is enough. But copying conf files from /conf.d is handled by the container's entrypoint, so a container restart is indeed needed
content/guides/autodiscovery.md
Outdated
|
||
Let's take the example of the port variable: a RabbitMQ container with the management module enabled has 6 exposed ports by default. The list of ports as seen by the agent is: `[4369, 5671, 5672, 15671, 15672, 25672]`. **Notice the order. The Agent always sorts values in ascending order.** | ||
Autodiscovery supports Consul, etcd, and Zookeeper as template sources. To use a KV store, configure its parameters in `datadog.conf` or in environment variables passed to docker-dd-agent when starting the container. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could link environment variables
to https://github.com/DataDog/docker-dd-agent/#environment-variables
content/guides/autodiscovery.md
Outdated
|
||
### Configuring etcd or Consul in `datadog.conf` | ||
If you are using Consul and the Consul cluster requires token authentication, set `consul_token`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWI, 5.15 will support user/pass auth for etcd. See DataDog/dd-agent#3357
content/guides/autodiscovery.md
Outdated
|
||
You can also add a network name suffix to the `%%host%%` variable—`%%host_bridge%%`, `%%host_swarm%%`, etc—for containers attached to multiple networks. When `%%host%%` does not have a suffix, Autodiscovery picks the container's bridge network IP address. | ||
|
||
### Service Identifiers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This section should be expanded with examples and pushed further up the page.
See the ticket I sent you about user confusion on the matching
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do, I hadn't addressed this section just yet. I did link to it from earlier in the page; you think it needs to be more prominent, though?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I had two support cases that could've been solved by using container labels instead of toying around with the image names, so I'd vote for pushing it up. But that's probably sampling biais
re service identifiers. since services are a thing in k8s and a different thing in docker swarm, is it worth/possible to choose a different word, esp since this service is different from that service? in fact 'service' is a very overloaded term in this article |
content/guides/autodiscovery.md
Outdated
|
||
Datadog automatically keeps track of what is running where, thanks to its Autodiscovery feature. Autodiscovery allows you to define configuration templates that will be applied automatically to monitor your containers. | ||
The Datadog Agent can automatically track which services are running where, thanks to its Autodiscovery feature. Autodiscovery lets you define configuration templates for Agent checks and specify which container types each check should apply to. The Agent enables, disables, and recompiles static check configurations from the templates as containers come and go. When your NGINX container moves from 10.0.0.6 to 10.0.0.17, Autodiscovery helps the Agent update its NGINX check configuration with the new IP address so it can keep collecting NGINX metrics without any action on your part. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
...which container types...??? are they different types or just different containers?
The agent....recompiles.... ?? Recompiles? That sounds super impressive but is that accurate? maybe I am assuming too much from that word.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, yes, 'containers' is probably better, since individual containers of the same type may be targeted via labels.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, recompiles is a little puffy. 're-generates'?
content/guides/autodiscovery.md
Outdated
|
||
## How it works | ||
<div class="alert alert-info"> | ||
Autodiscovery was previously called Service Discovery. It is still known as Service Discovery in the Agent's code and in configuration options. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this sounds a bit awkward. It was previously called...and it still is...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Autodiscovery was previously called Service Discovery, and it's still called that in the Agent's code and in configuration options."
content/guides/autodiscovery.md
Outdated
|
||
The Autodiscovery feature watches for Docker events like when a container is created, destroyed, started or stopped. When one of these happens, the Agent identifies which service is impacted, loads the configuration template for this image, and automatically sets up its checks. | ||
In a traditional non-container environment, Datadog Agent configuration is, like the environment in which it runs, static. The Agent reads check configurations from disk when it starts, and as long as it's running, it continuously applies every configured check. The configuration files are static, and any network-related options configured within them serve to identify specific instances of a monitored service. When an Agent check cannot connect to such a service, it's probably not because the service was re-homed somewhere else; either the service is down, or the check configuration was unable to reach the service to begin with. In any case, the Agent continues to attempt connecting to the service until an administrator steps in to troubleshoot. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thats a lot, of commas, in the, first, sentence.
for the second sentence, the way I read that is that if the config changes, the changes are applied right away without restarting the agent, but then you say the files are static. a bit confusing
When an Agent check cannot connect to such a service.... you aren't talking about a service and havent defined one..
either the service is down, or the check configuration was unable to reach the service to begin with. - the configuration cannot reach it, or the check cannot??
until an administrator steps in to troubleshoot... just the act of troubleshooting stops the attempt to connect? wont it still try. If the admin successfully solves the problem rather than just troubleshooting, i hope it will still attempt to connect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thats a lot, of commas, in the, first, sentence.
In a traditional non-container environment, Datadog Agent configuration is—like the environment in which it runs—static.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for the second sentence, the way I read that is that if the config changes, the changes are applied right away without restarting the agent
Which words tripped you up, 'continuously applies'? If so, is 'continuously runs' better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When an Agent check cannot connect to such a service.... you aren't talking about a service and havent defined one..
The previous sentence says "any network-related options configured within them serve to identify specific instances of a monitored service". Would a parenthetical example have made it clear? "...instances of a monitored service (e.g. a redis instance)".
content/guides/autodiscovery.md
Outdated
|
||
Configuration templates can be defined by simple template files or as single key-value stores using etcd or Consul. | ||
With Autodiscovery enabled, the Agent runs checks differently. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this sentence necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's kind of blog posty, but it makes the transition to the next two sections less jarring. It matches the section names, i.e. 'Different '
content/guides/autodiscovery.md
Outdated
|
||
To use Autodiscovery, you'll first need to run the Datadog Agent as a service. | ||
First, Autodiscovery uses **templates** for check configuration, wherein two template variables—`%%host%%` and `%%port%%`—appear in place of any normally-hardcoded network option values. Because orchestration platforms like Docker Swarm deploy (and redeploy) containers on arbitrary hosts, static configuration files are not suitable for checks that collect data from network endpoints. For example: a template for the Agent's [Go Expvar check](https://github.com/DataDog/integrations-core/blob/master/go_expvar/conf.yaml.example) would contain the option `expvar_url: http://%%host%%:%%port%%`. For containers that have more than one IP or exposed port, Autodiscovery can pick the right one(s) using [template variable indexes](#template-variable-indexes). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
normally-hardcoded? is it better to just say hardcoded?
%%port%%`—appear .... missing a space?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I intentionally left the space out. We haven't been consistent in the docs on whether we have a space on either side of an em-dash, but lately I've been leaving them out.
content/guides/autodiscovery.md
Outdated
|
||
Let's take the example of the port variable: a RabbitMQ container with the management module enabled has 6 exposed ports by default. The list of ports as seen by the agent is: `[4369, 5671, 5672, 15671, 15672, 25672]`. **Notice the order. The Agent always sorts values in ascending order.** | ||
Autodiscovery supports Consul, etcd, and Zookeeper as template sources. To use a KV store, configure its parameters in `datadog.conf` or in environment variables passed to docker-dd-agent when starting the container. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Autodiscovery supports.... key value based template sources
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's implied by the heading.
content/guides/autodiscovery.md
Outdated
|
||
To pass the settings listed above as environment variables when starting the Datadog Agent in Docker Swarm, you would run the command: | ||
Each template is defined as a three-tuple: check name, `init_config`, and `instances`. The `docker_images` option from the previous section, which was used to provide service identifiers to Autodiscovery, is not required here; for KV store template sources, service identifiers appear as first-level keys under `check_config`. (Also note, the file-based template in the previous section didn't need a check name; the Agent infers it from the filename.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would the term be three tuple, or just tuple?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
3-tuple is most accurate.
content/guides/autodiscovery.md
Outdated
|
||
### Template structure in key-value stores | ||
Notice that each of the three values is a list. Autodiscovery assembles list items into check configurations based on shared list indexes. In this case, it composes the first (and only) check from `check_names[0]`, `init_configs[0]` and `instances[0]`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it worth saying its a YAML formatted list (or is it aJSON formatted list)?
content/guides/autodiscovery.md
Outdated
|
||
Note that in the structure above, you may have multiple checks for a single container. For example you may run a Java service that provides an HTTP API, using the HTTP check and the JMX integration at the same time. To declare that in templates, simply add elements to the `check_names`, `init_configs`, and `instances lists`. These elements will be matched together based on their index in their respective lists. | ||
Again, the list orders matter. The HTTP check will only work if all its elements have the same index (1) across the lists (they do). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
list orders or list order?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The HTTP check will only work if all its elements have the same index (1) across the lists (they do).
I don't understand this
metadata: | ||
name: apache | ||
annotations: | ||
service-discovery.datadoghq.com/apache.check_names: '["apache","http_check"]' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
awesome, didn't know you could add multiple checks, but makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC that's 5.14 only (must check the milestone). We could document that to avoid support cases with old versions
this feels like a long document. @kmshultz you talked about splitting it up into autodisco on k8s, autodisco on docker, etc. any further ideas on this? If i just care about docker swarm, i probably don't want to be confused about k8s. |
Re: splitting this into smaller documents, I agree we should do that. But as the docs site is currently organized, there's no good way to do that, i.e. since the only semblance of organization we have now is When we redesign the docs site soon, we can organize it some other way, i.e. Thoughts @technovangelist @jyee @irabinovitch? |
Perhaps 'container identifier' is indeed best. Care to offer a suggestion? |
@xvello I've added some more content here. Mind giving it another look, please? 🙇 |
content/guides/autodiscovery.md
Outdated
kind: guide | ||
listorder: 10 | ||
--- | ||
|
||
Docker is being [adopted rapidly](https://www.datadoghq.com/docker-adoption/) and platforms like Docker Swarm, Kubernetes and Amazon's ECS make running services easier and more resilient by managing orchestration and replication across hosts. But all of that makes monitoring more difficult. How can you monitor a service which is dynamically shifting from one host to another? | ||
Docker is being [adopted rapidly](https://www.datadoghq.com/docker-adoption/). Orchestration platforms like Docker Swarm, Kubernetes, and Amazon ECS make running Docker-ized services easier and more resilient by managing orchestration and replication across hosts. But all of that makes monitoring more difficult. How can you reliably monitor a service which is unpredictably shifting from one host to another? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/Docker/Containers/ ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is unpredictably the right word? We didn't like dynamically?
would "dynamically being shifted" make more sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As docker is the only supported runtime for now, I'd go with writing docker explicitly, but we'll transition to a generic "containers" once rkt is supported
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unpredictably, dynamically... I think it's a matter of taste. To me, "unpredictably" better evokes the problem. Do you think it's inaccurate, or you just don't like the tone/sound of it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As docker is the only supported runtime for now, I'd go with writing docker explicitly, but we'll transition to a generic "containers" once rkt is supported
Indeed, and if we avoided saying Docker here, we'd want to remove it from the page's title, too.
content/guides/autodiscovery.md
Outdated
## How it works | ||
<div class="alert alert-info"> | ||
Autodiscovery was previously called Service Discovery. It's still called Service Discovery in the Agent's code and in Autodiscovery configuration options. | ||
</div> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to mention this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so, because current client know and might search for the old term.
That could even be at the top of the page
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xvello at first I did have it at the top, just below the page title. Do you think that's better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, as in "just so you know... now let's get it over with"
content/guides/autodiscovery.md
Outdated
kind: guide | ||
listorder: 10 | ||
--- | ||
|
||
Docker is being [adopted rapidly](https://www.datadoghq.com/docker-adoption/) and platforms like Docker Swarm, Kubernetes and Amazon's ECS make running services easier and more resilient by managing orchestration and replication across hosts. But all of that makes monitoring more difficult. How can you monitor a service which is dynamically shifting from one host to another? | ||
Docker is being [adopted rapidly](https://www.datadoghq.com/docker-adoption/). Orchestration platforms like Docker Swarm, Kubernetes, and Amazon ECS make running Docker-ized services easier and more resilient by managing orchestration and replication across hosts. But all of that makes monitoring more difficult. How can you reliably monitor a service which is unpredictably shifting from one host to another? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As docker is the only supported runtime for now, I'd go with writing docker explicitly, but we'll transition to a generic "containers" once rkt is supported
content/guides/autodiscovery.md
Outdated
|
||
If you use Kubernetes, see the [Kubernetes integration page](http://docs.datadoghq.com/integrations/kubernetes/#installation) for instructions on running docker-dd-agent. If you use Amazon ECS, see [its integration page](http://docs.datadoghq.com/integrations/ecs/#installation). | ||
|
||
If you use Docker Swarm, run the following command on one of your manager nodes: | ||
|
||
docker service create \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't have agent install doc for swarm yet, we could add it here
But the agent setup page need a revamp as the k8s and docker ones are getting more and more complex (different ways to deploy all mushed in the same page with no hierarchy). I'm torn between revamping these pages or just create doc.dd.com pages and link to them.
content/guides/autodiscovery.md
Outdated
|
||
The configuration templates in `conf.d/auto_conf` directory are nearly identical to the example YAML configuration files provided in [the Datadog `conf.d` directory](https://github.com/DataDog/dd-agent/tree/master/conf.d), but with one important field added. The `docker_images` field is required and identifies the container image(s) to which the configuration template should be applied. | ||
1. Add them to each host that runs docker-dd-agent and [mount the directory that contains them](https://github.com/DataDog/docker-dd-agent#configuration-files) into the docker-dd-agent container when starting it | ||
1. Package them into your own release of docker-dd-agent |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd go with:
Build a custom docker image based on docker-dd-agent with your custom templates added in the /etc/dd-agent/conf.d/auto_conf
content/guides/autodiscovery.md
Outdated
|
||
## Configuration templates with key-value stores | ||
If this is too limiting—if you need to apply different check configurations to different containers running the same image—use [labels](#container-labels) to identify the containers. Label each container differently, then add each label to any template file's `docker_images` list (yes, `docker_images` is where to put _any_ kind of container identifier, not just images). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
metadata: | ||
name: apache | ||
annotations: | ||
service-discovery.datadoghq.com/apache.check_names: '["apache","http_check"]' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC that's 5.14 only (must check the milestone). We could document that to avoid support cases with old versions
content/guides/autodiscovery.md
Outdated
|
||
If you provide a template for the same check type via multiple template sources, the Agent will prefer, in increasing order of preference: | ||
|
||
* Files |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC files are last in the precedence order. Annotations are first for sure, so order is Annotations -> K/V -> files
I think it's better to list them in order of lookup instead of increasing order of preference (as that's the opposite, right?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, maybe this is confusing because I put it in ascending order of precedence. I thought another engineer told me KV takes highest precedence but perhaps I misunderstood.
@xvello Re: the Agent install instructions for each platform, I agree we need a separate space for this. With the Docs redesign underway very soon, I'm planning to have a dedicated section for the Agent (i.e. docs.datadoghq.com/agent) where we comprehensively cover its architecture (different daemons), install instructions for different platforms (similar to dogweb), all configuration options and environment variables, etc. I don't like having the install docs embedded in Integrations pages. And I'm not a big fan of having them in dogweb, either, though I realize it's a nice first-time-user flow to login and be presented with a one-liner to copy and paste. |
TODO: