diff --git a/README.md b/README.md index cddde019a..93628756d 100644 --- a/README.md +++ b/README.md @@ -28,7 +28,7 @@ This table represents the supported components of AWS OTel Collector in 2020. Th | prometheusreceiver | attributesprocessor | `awsxrayexporter` | healthcheckextension | | otlpreceiver | resourceprocessor | `awsemfexporter` | pprofextension | | `awsecscontainermetricsreceiver`| queuedprocessor | `awsprometheusremotewriteexporter` | zpagesextension | -| `awsxrayreceiver` | batchprocessor | loggingexporter | | +| `awsxrayreceiver` | batchprocessor | loggingexporter | `ecsobserver` | | `statsdreceiver` | memorylimiter | otlpexporter | | | zipkinreceiver | tailsamplingprocessor | fileexporter | | | jaegerreceiver | probabilisticsamplerprocessor | otlphttpexporter | | @@ -42,24 +42,31 @@ This table represents the supported components of AWS OTel Collector in 2020. Th #### AWS OTel Collector AWS Components + * [OpenTelemetry Collector](https://github.com/open-telemetry/opentelemetry-collector/) * [Trace X-Ray Exporter](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/master/exporter/awsxrayexporter) * [Metrics EMF Exporter](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/master/exporter/awsemfexporter/README.md) * [ECS Container Metrics Receiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/master/receiver/awsecscontainermetricsreceiver) * [StatsD Receiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/receiver/statsdreceiver) +* [ECS Observer Extension](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/extension/observer/ecsobserver) ### Getting Started + #### Prerequisites + To build AWS OTel Collector locally, you will need to have Golang installed. You can download and install Golang [here](https://golang.org/doc/install). #### AWS OTel Collector Configuration + We built in a [default configuration](https://github.com/aws-observability/aws-otel-collector/blob/main/config.yaml) to our docker image and other format of release. So you can run AWS OTel Collector out of box with the default settings. Also, AWS OTel Collector configuration uses the same configuration syntax/design from [OpenTelemetry Collector](https://github.com/open-telemetry/opentelemetry-collector) so you can customize or porting your OpenTelemetry Collector configuration files when running AWS OTel Collector. please refer `Try out AWS OTel Collector` section on configuring AWS OTel Collector. #### Try out AWS OTel Collector + AWS OTel Collector supports all AWS computing platforms and docker/kubernetes. Here are some examples on how to run AWS OTel Collector to send telemetry data: + * [Run it with Docker](https://github.com/aws-observability/aws-otel-collector/blob/main/docs/developers/docker-demo.md) * [Run it with ECS](https://github.com/aws-observability/aws-otel-collector/blob/main/docs/developers/ecs-demo.md) * [Run it with EKS](https://github.com/aws-observability/aws-otel-collector/blob/main/docs/developers/eks-demo.md) @@ -68,11 +75,18 @@ AWS OTel Collector supports all AWS computing platforms and docker/kubernetes. H * [Run it on AWS Debian EC2](https://github.com/aws-observability/aws-otel-collector/blob/main/docs/developers/debian-deb-demo.md) #### Build Your Own Artifacts + Use the following instruction to build your own AWS OTel Collector artifacts: + * [Build Docker Image](https://github.com/aws-observability/aws-otel-collector/blob/main/docs/developers/build-docker.md) * [Build RPM/Deb/MSI](https://github.com/aws-observability/aws-otel-collector/blob/main/docs/developers/build-aoc.md) +### Development + +See [docs/developers](docs/developers/README.md) + ### Release Process + * [Release new version](RELEASING.md) ### Benchmark @@ -80,7 +94,6 @@ Use the following instruction to build your own AWS OTel Collector artifacts: The latest performance model result is [here](https://github.com/aws-observability/aws-otel-collector/blob/main/docs/performance_model.md). The performance test conducted by following the [instruction](https://github.com/aws-observability/aws-otel-test-framework/blob/terraform/docs/get-performance-model.md) here. - - ### License + AWS OTel Collector is licensed under an Apache 2.0 license. diff --git a/deployment-template/ecs/aws-otel-container-insights-prometheus-ec2-deployment-cfn.yaml b/deployment-template/ecs/aws-otel-container-insights-prometheus-ec2-deployment-cfn.yaml new file mode 100644 index 000000000..69aaa2770 --- /dev/null +++ b/deployment-template/ecs/aws-otel-container-insights-prometheus-ec2-deployment-cfn.yaml @@ -0,0 +1,323 @@ +Parameters: + ClusterName: + Type: String + Description: Enter the name of your ECS cluster from which you want to collect prometheus metrics + # IAM + CreateIAMRoles: + Type: String + Default: 'False' + AllowedValues: + - 'True' + - 'False' + Description: Create new default IAM roles or use existing ones. + ConstraintDescription: must specify True or False. + TaskRoleArn: + Type: String + Default: Default + Description: Enter the role arn you want to use as the ecs task role + ExecutionRoleArn: + Type: String + Default: Default + Description: Enter the role arn you want to use as the ecs execution role + # Collector + CollectorImage: + Type: String + Default: 'public.ecr.aws/aws-observability/aws-otel-collector:latest' +Conditions: + CreateRoles: !Equals + - !Ref CreateIAMRoles + - 'True' + DefaultTaskRole: !Equals + - !Ref TaskRoleArn + - Default + DefaultExecutionRole: !Equals + - !Ref ExecutionRoleArn + - Default +Resources: + ECSTaskDefinition: + Type: 'AWS::ECS::TaskDefinition' + Properties: + Family: !Sub 'adot-container-insights-prometheus-${ClusterName}' + TaskRoleArn: !If + - CreateRoles + - !GetAtt + - ECSTaskRole + - Arn + - !If + - DefaultTaskRole + - !Sub 'arn:aws:iam::${AWS::AccountId}:role/AWSOTelRole' + - !Ref TaskRoleArn + ExecutionRoleArn: !If + - CreateRoles + - !GetAtt + - ECSExecutionRole + - Arn + - !If + - DefaultExecutionRole + - !Sub 'arn:aws:iam::${AWS::AccountId}:role/AWSOTelExecutionRole' + - !Ref ExecutionRoleArn + NetworkMode: bridge + ContainerDefinitions: + - LogConfiguration: + LogDriver: awslogs + Options: + awslogs-create-group: 'True' + awslogs-group: !Sub '/ecs/aws-otel-collector/${ClusterName}' + awslogs-region: !Ref 'AWS::Region' + awslogs-stream-prefix: ecs + Image: !Ref CollectorImage + Name: aws-collector + Secrets: + - Name: AOT_CONFIG_CONTENT + ValueFrom: !Sub 'AmazonCloudWatch-AOC-ECS-Prometheus-${ClusterName}' + Memory: '512' + RequiresCompatibilities: + - EC2 + Cpu: '256' + ECSReplicaService: + Type: 'AWS::ECS::Service' + Properties: + TaskDefinition: !Ref ECSTaskDefinition + Cluster: !Ref ClusterName + LaunchType: EC2 + SchedulingStrategy: REPLICA + DesiredCount: 1 + ServiceName: adot-container-insights-prometheus-service + ECSTaskRole: + Type: 'AWS::IAM::Role' + Condition: CreateRoles + Properties: + Description: Allows ECS tasks to call AWS services on your behalf. + AssumeRolePolicyDocument: + Version: 2012-10-17 + Statement: + - Sid: '' + Effect: Allow + Principal: + Service: ecs-tasks.amazonaws.com + Action: 'sts:AssumeRole' + Policies: + - PolicyName: AWSOpenTelemetryPolicy + PolicyDocument: + Version: 2012-10-17 + Statement: + - Effect: Allow + Action: + - 'logs:PutLogEvents' + - 'logs:CreateLogGroup' + - 'logs:CreateLogStream' + - 'logs:DescribeLogStreams' + - 'logs:DescribeLogGroups' + - 'xray:PutTraceSegments' + - 'xray:PutTelemetryRecords' + - 'xray:GetSamplingRules' + - 'xray:GetSamplingTargets' + - 'xray:GetSamplingStatisticSummaries' + - 'ssm:GetParameters' + Resource: '*' + - PolicyName: AWSOpenTelemetryPolicyPrometheusECSDiscovery + PolicyDocument: + Version: 2012-10-17 + Statement: + - Effect: Allow + Action: + - 'ec2:DescribeInstances' + - 'ecs:ListTasks' + - 'ecs:ListServices' + - 'ecs:DescribeContainerInstances' + - 'ecs:DescribeServices' + - 'ecs:DescribeTasks' + - 'ecs:DescribeTaskDefinition' + Resource: '*' + RoleName: AWSOTelRolePrometheusECS + ECSExecutionRole: + Type: 'AWS::IAM::Role' + Condition: CreateRoles + Properties: + Description: >- + Allows ECS container agent makes calls to the Amazon ECS API on your + behalf. + AssumeRolePolicyDocument: + Version: 2012-10-17 + Statement: + - Sid: '' + Effect: Allow + Principal: + Service: ecs-tasks.amazonaws.com + Action: 'sts:AssumeRole' + ManagedPolicyArns: + - 'arn:aws:iam::aws:policy/service-role/AmazonECSTaskExecutionRolePolicy' + - 'arn:aws:iam::aws:policy/CloudWatchLogsFullAccess' + - 'arn:aws:iam::aws:policy/AmazonSSMReadOnlyAccess' + RoleName: AWSOTelExecutionRolePrometheusECS + AocConfigSSMParameter: + Type: AWS::SSM::Parameter + Properties: + Name: !Sub 'AmazonCloudWatch-AOC-ECS-Prometheus-${ClusterName}' + Type: String + Tier: Intelligent-Tiering + Description: !Sub 'CWAgent SSM Parameter with App Mesh and Java EMF Definition for ECS Cluster: ${ClusterName}' + Value: !Sub |- + + extensions: + ecs_observer: + cluster_name: '${ClusterName}' + cluster_region: '${AWS::Region}' + result_file: '/etc/ecs_sd_targets.yaml' + refresh_interval: 60s + job_label_name: prometheus_job + # nginx and nginx plus + # https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/ContainerInsights-Prometheus-Setup-nginx-ecs.html + services: + - name_pattern: '^.*nginx-service$' + metrics_ports: + - 9113 + job_name: nginx-prometheus-exporter + # jmx + docker_labels: + - port_label: 'ECS_PROMETHEUS_EXPORTER_PORT' + # App Mesh, port and metrics are from envoy sidecar + task_definitions: + - arn_pattern: '.*:task-definition/.*-ColorTeller-(white):[0-9]+' + metrics_path: '/stats/prometheus' + metrics_ports: + - 9901 + job_name: ecs-appmesh-color + - arn_pattern: '.*:task-definition/.*-ColorGateway:[0-9]+' + metrics_path: '/stats/prometheus' + metrics_ports: + - 9901 + job_name: ecs-appmesh-color + + receivers: + prometheus: + config: + scrape_configs: + - job_name: "ecssd" + file_sd_configs: + - files: + - '/etc/ecs_sd_targets.yaml' + relabel_configs: + - source_labels: [ __meta_ecs_cluster_name ] # ClusterName + action: replace + target_label: ClusterName + - source_labels: [ __meta_ecs_service_name ] # ServiceName + action: replace + target_label: ServiceName + - source_labels: [ __meta_ecs_task_definition_family ] # TaskDefinitionFamily + action: replace + target_label: TaskDefinitionFamily + - source_labels: [ __meta_ecs_container_name ] # container_name + action: replace + target_label: container_name + - action: labelmap # docker labels + regex: ^__meta_ecs_container_labels_(.+)$ + replacement: '$$1' + + exporters: + awsemf: + region: '${AWS::Region}' + namespace: ECS/ContainerInsights/Prometheus + log_group_name: "/aws/ecs/containerinsights/${ClusterName}/prometheus" + dimension_rollup_option: NoDimensionRollup + metric_declarations: + # nginx + - dimensions: [ [ ClusterName, TaskDefinitionFamily, ServiceName ] ] + label_matchers: + - label_names: + - ServiceName + regex: '^.*nginx-service$' + metric_name_selectors: + - "^nginx_.*$" + # nginx plus + - dimensions: [ [ ClusterName, TaskDefinitionFamily, ServiceName ] ] + label_matchers: + - label_names: + - ServiceName + regex: '^.*nginx-plus-service$' + metric_name_selectors: + - "^nginxplus_connections_accepted$" + - "^nginxplus_connections_active$" + - "^nginxplus_connections_dropped$" + - "^nginxplus_connections_idle$" + - "^nginxplus_http_requests_total$" + - "^nginxplus_ssl_handshakes$" + - "^nginxplus_ssl_handshakes_failed$" + - "^nginxplus_up$" + - "^nginxplus_upstream_server_health_checks_fails$" + - dimensions: [ [ ClusterName, TaskDefinitionFamily, ServiceName, upstream ] ] + label_matchers: + - label_names: + - ServiceName + regex: '^.*nginx-plus-service$' + metric_name_selectors: + - "^nginxplus_upstream_server_response_time$" + - dimensions: [ [ ClusterName, TaskDefinitionFamily, ServiceName, code ] ] + label_matchers: + - label_names: + - ServiceName + regex: '^.*nginx-plus-service$' + metric_name_selectors: + - "^nginxplus_upstream_server_responses$" + - "^nginxplus_server_zone_responses$" + # jmx + - dimensions: [ [ ClusterName, TaskDefinitionFamily, area ] ] + label_matchers: + - label_names: + - Java_EMF_Metrics + regex: ^true$ + metric_name_selectors: + - "^jvm_memory_bytes_used$" + - dimensions: [ [ ClusterName, TaskDefinitionFamily, pool ] ] + label_matchers: + - label_names: + - Java_EMF_Metrics + regex: ^true$ + metric_name_selectors: + - "^jvm_memory_pool_bytes_used$" + - dimensions: [ [ ClusterName, TaskDefinitionFamily ] ] + label_matchers: + - label_names: + - Java_EMF_Metrics + regex: ^true$ + metric_name_selectors: + - "^jvm_threads_(current|daemon)$" + - "^jvm_classes_loaded$" + - "^java_lang_operatingsystem_(freephysicalmemorysize|totalphysicalmemorysize|freeswapspacesize|totalswapspacesize|systemcpuload|processcpuload|availableprocessors|openfiledescriptorcount)$" + - "^catalina_manager_(rejectedsessions|activesessions)$" + - "^jvm_gc_collection_seconds_(count|sum)$" + - "^catalina_globalrequestprocessor_(bytesreceived|bytessent|requestcount|errorcount|processingtime)$" + # AppMesh envoy + - dimensions: [ [ "ClusterName","TaskDefinitionFamily" ] ] + label_matchers: + - label_names: + - container_name + regex: ^envoy$ + metric_name_selectors: + - "^envoy_http_downstream_rq_(total|xx)$" + - "^envoy_cluster_upstream_cx_(r|t)x_bytes_total$" + - "^envoy_cluster_membership_(healthy|total)$" + - "^envoy_server_memory_(allocated|heap_size)$" + - "^envoy_cluster_upstream_cx_(connect_timeout|destroy_local_with_active_rq)$" + - "^envoy_cluster_upstream_rq_(pending_failure_eject|pending_overflow|timeout|per_try_timeout|rx_reset|maintenance_mode)$" + - "^envoy_http_downstream_cx_destroy_remote_active_rq$" + - "^envoy_cluster_upstream_flow_control_(paused_reading_total|resumed_reading_total|backed_up_total|drained_total)$" + - "^envoy_cluster_upstream_rq_retry$" + - "^envoy_cluster_upstream_rq_retry_(success|overflow)$" + - "^envoy_server_(version|uptime|live)$" + - dimensions: [ [ "ClusterName","TaskDefinitionFamily","envoy_http_conn_manager_prefix","envoy_response_code_class" ] ] + label_matchers: + - label_names: + - container_name + regex: ^envoy$ + metric_name_selectors: + - "^envoy_http_downstream_rq_xx$" + + + service: + extensions: [ ecs_observer ] + pipelines: + metrics: + receivers: [ prometheus ] + exporters: [ awsemf ] + diff --git a/docs/developers/README.md b/docs/developers/README.md new file mode 100644 index 000000000..5b1263332 --- /dev/null +++ b/docs/developers/README.md @@ -0,0 +1,16 @@ +# Developer Documentation + +Build + +- [Binary](build-aoc.md) +- [Docker Image](build-docker.md) + +Container Insights for Prometheus Support + +- [EKS](container-insight-install-aoc.md) +- [ECS](container-insights-ecs-prometheus.md) + +EC2 + +- [Linux RPM](linux-rpm-demo.md) +- [Winows](windows-other-demo.md) \ No newline at end of file diff --git a/docs/developers/container-insights-ecs-prometheus.md b/docs/developers/container-insights-ecs-prometheus.md new file mode 100644 index 000000000..8b8bb9cf8 --- /dev/null +++ b/docs/developers/container-insights-ecs-prometheus.md @@ -0,0 +1,173 @@ +# Container Insight ECS Prometheus + +NOTE: This is doc for developing this feature, for user doc please +check [user guide](https://aws-otel.github.io/docs/getting-started/container-insights/ecs-prometheus). + +## Links + +- [ecsobserver](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/extension/observer/ecsobserver) + discover ECS tasks +- [Integration test](https://github.com/aws-observability/aws-otel-test-framework/pull/308) + +## Quick Start + +After building your own image `12346.dkr.ecr.us-west-2.amazonaws.com/aoc:ecssd-0.2` you can +use [this cfn](../../deployment-template/ecs/aws-otel-container-insights-prometheus-ec2-deployment-cfn.yaml) +to launch the collector on ECS EC2 cluster. + +```bash +export CLUSTER_NAME=aoc-prometheus-dashboard-1 +export CREATE_IAM_ROLES=True +export COLLECTOR_IMAGE=12346.dkr.ecr.us-west-2.amazonaws.com/aoc:ecssd-0.2 + +aws cloudformation create-stack --stack-name AOC-Prometheus-ECS-${CLUSTER_NAME} \ + --template-body file://aws-otel-container-insights-prometheus-ec2-deployment-cfn.yaml \ + --parameters ParameterKey=ClusterName,ParameterValue=${CLUSTER_NAME} \ + ParameterKey=CreateIAMRoles,ParameterValue=${CREATE_IAM_ROLES} \ + ParameterKey=CollectorImage,ParameterValue=${COLLECTOR_IMAGE} \ + --capabilities CAPABILITY_NAMED_IAM +``` + +It will create the following resource: + +- SSM parameter +- IAM roles, `AWSOTelRolePrometheusECS` and `AWSOTelExecutionRolePrometheusECS` by default +- ECS task definition and replica service + +If you need to test your image frequently, you need a script to update SSM parameter, push the image, scale service down +to 0 and back to 1. The cloudformation stack is a bit slow for iteration. (The script is an exercise for the reader, +hint: aws cli can't upload ssm parameter value from file name) + +## Internal + +NOTE: some problems (or problematic solutions...) also apply to (are copied from) Container Insights EKS Prometheus. + +To understand the codebase, check README +in [ecsobserver](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/extension/observer/ecsobserver) +. You can also +use [cloudwatch-agent](https://github.com/aws/amazon-cloudwatch-agent/tree/master/internal/ecsservicediscovery) as +reference. + +### Label, Relabel and Dimension + +Labels are key value pairs e.g. `env=prod`. They are called Dimension in CloudWatch. There is no direct translation from +label to dimension because CloudWatch does not support too many dimensions. Metrics declaration allow picking some +labels as dimension. There is also dimension rollup, but we disable it using `NoDimensionRollup`. + +For builtin dashboard to work, specific metric dimensions are required. In `ecsobserver`, we export labels +with `__meta_ecs_` prefix (e.g. `__meta_ecs_task_definition_family`), which is different from cloudwatch-agent. +Using `__` prefix is more popular in prometheus's builtin discovery implementations, so we followed that instead when +porting the discovery logic. For getting a dimension like `TaskDefinitionFamily` in CloudWatch we go through two steps: + +- prometheus relabel ensures we carry the label down the pipeline, otherwise all `__` are + dropped. `__meta_ecs_task_definition_family=MyTask` becomes `TaskDefinitionFamily=MyTask` +- emf's metrics declaration picks some label as metric dimension, other labels becomes structured log field. `MyTask` + becomes a dimension value that will show up in dashboard when you slice and dice. + +```yaml +receivers: + prometheus: + config: + scrape_configs: + - job_name: "ecssd" + relabel_configs: # Relabel here because label with __ prefix will be dropped by receiver. + - source_labels: [ __meta_ecs_task_definition_family ] # TaskDefinitionFamily + action: replace + target_label: TaskDefinitionFamily + +exporters: + awsemf: + metric_declarations: + - dimensions: [ [ ClusterName, TaskDefinitionFamily, ServiceName ] ] # dimension names are same as our relabeled keys. + label_matchers: + - label_names: + - ServiceName + regex: '^.*nginx-service$' + metric_name_selectors: + - "^nginx_.*$" +``` + +### job Label + +We allow user to specify different names using `job_name` in config. They are NOT exported as `job` and uses the value +from `job_label_name` as exported label key (e.g. `prometheus_job`). Then in processors we use `metricstransform` +processor to rename `promethus_job` back to `job`. + +Why don't we just use `job` directly? Short answer is prometheus receiver does not support specifying `job` in discovery +output. We use `file_sd` as the actual discovery implementation to bridge our discovery result, all the targets are +under the job `ecssd` in prometheus config. However, prometheus receiver does not behave exactly like prometheus, it +relies on job name for detecting metrics type. If we export target with job `nginx-prometheus-exporter`, receiver will +look up metadata cache using `nginx-prometheus-exporter` while the only job in cache is `ecssd`, the result is metric +type unknown. The comment in +[this PR](https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/3785#discussion_r654028642) +gives more detail and links to upstream issue. + +```yaml +extensions: + ecs_observer: # extension type is ecs_observer + # custom name for 'job' so we can rename it back to 'job' using metricstransform processor + job_label_name: prometheus_job + result_file: '/etc/ecs_sd_targets.yaml' + services: + - name_pattern: '^.*nginx-service$' # NGINX + metrics_ports: + - 9113 + job_name: nginx-prometheus-exporter + +receivers: + prometheus: + config: + scrape_configs: + - job_name: "ecssd" + file_sd_configs: + - files: + - '/etc/ecs_sd_targets.yaml' # MUST match the file name in ecs_observer.result_file + +processors: + metricstransform: + transforms: + - include: ".*" # Rename customized job label back to job + match_type: regexp + action: update + operations: + - label: prometheus_job # must match the value configured in ecs_observer + new_label: job + action: update_label + +``` + +### prom_metric_type Label + +`prom_metric_type` is a label only used by CloudWatch builtin dashboards. In order to do that, we changed EMF exporter +to look up resource attributes +and [change output when receiver is prometheus](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/f02e8a03a15a64cd94f0cc5364dc67a9c58343fd/exporter/awsemfexporter/metric_translator.go#L146-L162) +. However, `recevier` is not a default attribute, and we insert it manually using `resource` processor. In another word, +our solution only works when prometheus receiver is the only metrics receiver sending metrics to CloudWatch EMF exporter +in the pipeline. + +```yaml +processors: + resource: + attributes: + - key: receiver # Insert receiver: prometheus for CloudWatch EMF Exporter to add prom_metric_type + value: "prometheus" + action: insert +``` + +## Future Work + +### Cluster name auto detection + +Unlike EKS, ECS has a reliable way to discover current cluster using endpoint provided by ECS agent. We didn't include +it in initial release because we already +have [two components](https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/3188) +with duplicated code for metadata client. + +To implement this feature, just check metadata API if user give empty cluster name. Scraping metrics in cluster A using +collector running in cluster B is a valid use case, so we shouldn't override cluster name if user already provide one. +In fact, the collector can run anywhere as long as it can connect to AWS API and ECS tasks. + +## Changelog + +- 2021-06-23 @pingleig init the doc, ported + from [#435](https://github.com/aws-observability/aws-otel-collector/pull/435) \ No newline at end of file