opensearch-project · anirudha · May 31, 2023 · May 30, 2023 · May 30, 2023 · May 31, 2023
@@ -10,6 +10,7 @@ ENV_PLATFORM=local
 # OpenTelemetry Collector
 OTEL_COLLECTOR_HOST=otelcol
 OTEL_COLLECTOR_PORT=4317
+OTEL_COLLECTOR_HEALTH_CHECK_PORT=1313
 OTEL_EXPORTER_OTLP_ENDPOINT=http://${OTEL_COLLECTOR_HOST}:${OTEL_COLLECTOR_PORT}
 PUBLIC_OTEL_EXPORTER_OTLP_TRACES_ENDPOINT=http://localhost:4318/v1/traces
 

@@ -17,6 +17,14 @@ git clone https://github.com/opensearch/opentelemetry-demo.git
 cd opentelemetry-demo
 docker compose up -d
 ```
+#### Known Issues
+In some cases few services might not load correctly - in such cases you may have to re-run the docker-compose command:
+```
+docker compose up -d
+```
+**_This is a known issue we are currently working on to resolve_** 
+
+---
 
 ### Services
 
@@ -35,9 +43,12 @@ The next instructions are similar and use the same docker compose file.
    ```
    docker compose up -d
    ```
-**Note:** The docker compose `--no-build` flag is used to fetch released docker images from [ghcr](http://ghcr.io/open-telemetry/demo) instead of building from source.
-Removing the `--no-build` command line option will rebuild all images from source. It may take more than 20 minutes to build if the flag is omitted.
+**Note:** 
+
+_The docker compose `--no-build` flag is used to fetch released docker images from [ghcr](http://ghcr.io/open-telemetry/demo) instead of building from source.
+Removing the `--no-build` command line option will rebuild all images from source. It may take more than 20 minutes to build if the flag is omitted._
 
+---
 ### Explore and analyze the data With OpenSearch Observability
 Review revised OpenSearch [Observability Architecture](architecture.md)
 

@@ -37,15 +37,15 @@ Backend supportive services
  - [Frontend Nginx Proxy](http://nginx:90) *(replacement for _Frontend-Proxy_)*
    - See [description](../src/nginx-otel/README.md)
  - [OpenSearch](https://opensearch-node1:9200)
-    - See [description](https://github.com/YANG-DB/opentelemetry-demo/blob/12d52cbb23bbf4226f6de2dfec840482a0a7d054/docker-compose.yml#L697)
+    - See [description](https://github.com/opensearch-project/opentelemetry-demo/blob/12d52cbb23bbf4226f6de2dfec840482a0a7d054/docker-compose.yml#L697)
  - [Dashboards](http://opensearch-dashboards:5601)
-   - See [description](https://github.com/YANG-DB/opentelemetry-demo/blob/12d52cbb23bbf4226f6de2dfec840482a0a7d054/docker-compose.yml#L747) 
+   - See [description](https://github.com/opensearch-project/opentelemetry-demo/blob/12d52cbb23bbf4226f6de2dfec840482a0a7d054/docker-compose.yml#L747) 
  - [Prometheus](http://prometheus:9090)
-   - See [description](https://github.com/YANG-DB/opentelemetry-demo/blob/12d52cbb23bbf4226f6de2dfec840482a0a7d054/docker-compose.yml#L674)
+   - See [description](https://github.com/opensearch-project/opentelemetry-demo/blob/12d52cbb23bbf4226f6de2dfec840482a0a7d054/docker-compose.yml#L674)
  - [Feature-Flag](http://feature-flag-service:8881)
    - See [description](../src/featureflagservice/README.md)
  - [Grafana](http://grafana:3000)
-   - See [description](https://github.com/YANG-DB/opentelemetry-demo/blob/12d52cbb23bbf4226f6de2dfec840482a0a7d054/docker-compose.yml#L637)
+   - See [description](https://github.com/opensearch-project/opentelemetry-demo/blob/12d52cbb23bbf4226f6de2dfec840482a0a7d054/docker-compose.yml#L637)
 
 ### Services Topology
 The next diagram shows the docker compose services dependencies

@@ -678,10 +678,11 @@ services:
       - ./src/otelcollector/otelcol-config.yml:/etc/otelcol-config.yml
       - ./src/otelcollector/otelcol-config-extras.yml:/etc/otelcol-config-extras.yml
     ports:
-      - "4317"          # OTLP over gRPC receiver
-      - "4318:4318"     # OTLP over HTTP receiver
-      - "9464"          # Prometheus exporter
-      - "8888"          # metrics endpoint
+      - "4317"            # OTLP over gRPC receiver
+      - "4318:4318"       # OTLP over HTTP receiver
+      - "13133:13133"     # health check port
+      - "9464"            # Prometheus exporter
+      - "8888"            # metrics endpoint
     depends_on:
       - jaeger-agent
       - data-prepper

@@ -50,3 +50,6 @@ There are some additional `trace.group` related fields which are not part of the
 - traceGroupFields.durationInNanos - A derived field, the durationInNanos of the trace's root span.
 
 ```
+### Metrics from Traces Processors
+
+Adding new processors for creating metrics for logs and traces that pass through [Data Prepper](https://opensearch.org/blog/Announcing-Data-Prepper-2.1.0/)
@@ -13,8 +13,13 @@
 # For testing only. Don't store credentials in code.
 auth = ('admin', 'admin')
 
-# Configure logging
-logging.basicConfig(format='%(asctime)s [%(levelname)s] %(message)s', level=logging.INFO)
+# Configure logging to file
+logging.basicConfig(
+    filename='application.log',  # The file where the logs should be written to
+    filemode='a',  # Append mode, use 'w' for overwriting each time the script is run
+    format='%(asctime)s [%(levelname)s] %(message)s',
+    level=logging.INFO
+)
 logger = logging.getLogger(__name__)
 
 # Create the client with SSL/TLS enabled, but hostname verification disabled.
@@ -25,13 +30,15 @@
     use_ssl = True,
     verify_certs = False,
     ssl_assert_hostname = False,
-    ssl_show_warn = False
+    ssl_show_warn = False,
+    timeout=20,
+    max_retries=50, # Increasing max_retries from 3 (default) to 5
+    retry_on_timeout=True
 )
 # verify connection to opensearch
-# verify connection to opensearch
 def test_connection():
-    max_retries = 30  # Maximum number of retries
-    retry_interval = 10  # Wait for 10 seconds between retries
+    max_retries = 100  # Maximum number of retries
+    retry_interval = 20  # Wait for 20 seconds between retries
 
     for i in range(max_retries):
         try:
@@ -42,15 +49,15 @@ def test_connection():
                 verify=False  # Disable SSL verification
             )
             response.raise_for_status()  # Raise an exception if the request failed
-            print('Successfully connected to OpenSearch')
+            logger.info('Successfully connected to OpenSearch')
             return  # Exit the function if connection is successful
         except requests.HTTPError as e:
             logging.error(f'Failed to connect to OpenSearch, error: {str(e)}')
 
-        print(f'Attempt {i + 1} failed, waiting for {retry_interval} seconds before retrying...')
+        logger.info(f'Attempt {i + 1} failed, waiting for {retry_interval} seconds before retrying...')
         time.sleep(retry_interval)
 
-    print(f'Failed to connect to OpenSearch after {max_retries} attempts')
+    logger.error(f'Failed to connect to OpenSearch after {max_retries} attempts')
     exit(1)  # Exit the program with an error code
 
 # create mapping components to compose the different observability categories
@@ -62,7 +69,7 @@ def create_mapping_components(client):
                 mapping = json.load(f)
 
                 template_name = os.path.splitext(filename)[0]  # Remove the .mapping extension
-                print(f'About to load  template: {template_name}')
+                logger.info(f'About to load  template: {template_name}')
                 # Create the component template
                 try:
                     response = requests.put(
@@ -74,7 +81,6 @@ def create_mapping_components(client):
                     )
                     response.raise_for_status()  # Raise an exception if the request failed
                     logger.info(f'Successfully created component template: {template_name}')
-                    print(f'Successfully created component template: {template_name}')
                 except requests.HTTPError as e:
                     logger.error(f'Failed to create component template: {template_name}, error: {str(e)}')
 
@@ -87,12 +93,12 @@ def create_mapping_templates(client):
                 mapping = json.load(f)
 
                 template_name = os.path.splitext(filename)[0]  # Remove the .mapping extension
-                print(f'About to created index template: {template_name}')
+                logger.info(f'About to created index template: {template_name}')
 
                 # Create the template
                 try:
                     client.indices.put_index_template(name=template_name, body=mapping)
-                    print(f'Successfully created index template: {template_name}')
+                    logger.info(f'Successfully created index template: {template_name}')
                 except OpenSearchException as e:
                     logger.error(f'Failed to create index template: {template_name}, error: {str(e)}')
 
@@ -143,10 +149,28 @@ def load_dashboards():
 # create the data_streams based on the list given in the data-stream.json file
 def create_datasources():
     datasource_dir = '../datasource/'
+    # get current list of data sources
+    try:
+        response = requests.get(
+            url=f'https://{opensearch_host}:9200/_plugins/_query/_datasources',
+            auth=HTTPBasicAuth('admin', 'admin'),
+            verify=False,  # Disable SSL verification
+            headers={'Content-Type': 'application/json'}
+        )
+        response.raise_for_status()  # Raise an exception if the request failed
+        current_datasources = response.json()  # convert response to json
+    except requests.HTTPError as e:
+        logger.error(f'Failed to fetch datasources, error: {str(e)}')
+
     for filename in os.listdir(datasource_dir):
         if filename.endswith('.json'):
             with open(os.path.join(datasource_dir, filename), 'r') as f:
                 datasource = json.load(f)
+                # check if datasource already exists
+                if any(d['name'] == datasource['name'] for d in current_datasources):
+                    logger.info(f'Datasource already exists: {filename}')
+                    continue  # Skip to the next datasource if this one already exists
+
                 try:
                     response = requests.post(
                         url=f'https://{opensearch_host}:9200/_plugins/_query/_datasources',
@@ -157,12 +181,11 @@ def create_datasources():
                     )
                     response.raise_for_status()  # Raise an exception if the request failed
                     logger.info(f'Successfully created datasource: {filename}')
-                    print(f'Successfully created datasource: {filename}')
                 except requests.HTTPError as e:
-                    print(f'Failed to create datasource: {filename}, error: {str(e)}')
                     logger.error(f'Failed to create datasource: {filename}, error: {str(e)}')
 
 
+
 if __name__ == '__main__':
     # import all assets
     test_connection()

@@ -0,0 +1,129 @@
+# OTEL Collector Pipeline
+
+The next diagram describes the Observability signals pipeline defined inside the OTEL collector
+
+```mermaid
+%%{
+  init: {
+    'theme': 'base',
+    'themeVariables': {
+      'primaryColor': '#BB2528',
+      'primaryTextColor': '#fff',
+      'primaryBorderColor': '#7C0000',
+      'lineColor': '#F8B229',
+      'secondaryColor': '#006100',
+      'tertiaryColor': '#fff'
+    }
+  }
+}%%
+
+graph LR
+    otlp-->otlpReceiverTraces["otlp Receiver"]
+    style otlpReceiverTraces fill:#f9d,stroke:#333,stroke-width:4px
+    otlp-->otlpReceiverMetrics["otlp Receiver"]
+    style otlpReceiverMetrics fill:#f9d,stroke:#333,stroke-width:4px
+    otlp-->otlpReceiverLogs["otlp Receiver"]
+    style otlpReceiverLogs fill:#f9d,stroke:#333,stroke-width:4px
+    otlpServiceGraph-->otlpServiceGraphReceiver["otlp/servicegraph Receiver"]
+    spanmetrics-->spanmetricsReceiver["spanmetrics Receiver"]
+
+    subgraph Traces Pipeline
+    otlpReceiverTraces-->tracesProcessor["Traces Processor"]
+    style tracesProcessor fill:#9cf,stroke:#333,stroke-width:4px
+    tracesProcessor-->otlpExporter["otlp Exporter"]
+    tracesProcessor-->loggingExporterTraces["Logging Exporter"]
+
+    tracesProcessor-->spanmetricsExporter["Spanmetrics Exporter"]
+    tracesProcessor-->otlp2Exporter["otlp/2 Exporter"]
+    end
+
+    subgraph Metrics/Servicegraph Pipeline
+    otlpServiceGraphReceiver-->metricsServiceGraphProcessor["Metrics/ServiceGraph Processor"]
+    style otlpServiceGraphReceiver fill:#f9d,stroke:#333,stroke-width:4px
+    metricsServiceGraphProcessor-->prometheusServiceGraphExporter["Prometheus/ServiceGraph Exporter"]
+    style metricsServiceGraphProcessor fill:#9cf,stroke:#333,stroke-width:4px
+    end
+
+    subgraph Metrics Pipeline
+    otlpReceiverMetrics-->metricsProcessor["Metrics Processor"]
+    style metricsProcessor fill:#9cf,stroke:#333,stroke-width:4px
+
+    spanmetricsReceiver-->metricsProcessor
+    style spanmetricsReceiver fill:#f9d,stroke:#333,stroke-width:4px
+    metricsProcessor-->prometheusExporter["Prometheus Exporter"]
+    metricsProcessor-->loggingExporterMetrics["Logging Exporter"]
+    end
+
+    subgraph Logs Pipeline
+    otlpReceiverLogs-->logsProcessor["Logs Processor"]
+    style logsProcessor fill:#9cf,stroke:#333,stroke-width:4px
+    logsProcessor-->loggingExporterLogs["Logging Exporter"]
+    end
+
+
+```
+
+### Traces
+The traces  pipeline consists of a receiver, multiple processors, and multiple exporters.
+
+**Receiver (otlp):**
+This is where the data comes in from. In your configuration, the traces pipeline is using the otlp receiver. OTLP stands for OpenTelemetry Protocol. This receiver is configured to accept data over both gRPC and HTTP protocols. The HTTP protocol is also configured to allow CORS from any origin.
+
+**Processors (memory_limiter, batch, servicegraph):**
+Once the data is received, it is processed before being exported. The processors in the traces pipeline are:
+
+1. **memory_limiter:** This processor checks memory usage every second (check_interval: 1s) and ensures it does not exceed 4000 MiB (limit_mib: 4000). It also allows for a spike limit of 800 MiB (spike_limit_mib: 800).
+
+2. **batch:** This processor batches together traces before sending them on to the exporters, improving efficiency.
+
+3. **servicegraph:** This processor is specifically designed for creating a service graph from the traces. It is configured with certain parameters for handling latency histogram buckets, dimensions, store configurations, and so on.
+
+**Exporters (otlp, logging, spanmetrics, otlp/2):**
+After processing, the data is sent to the configured exporters:
+
+1. **otlp:** This exporter sends data to an endpoint configured as jaeger:4317 over OTLP with TLS encryption in insecure mode.
+
+2. **logging:** This exporter logs the traces.
+
+3. **spanmetrics:** This is likely a custom exporter defined as a connector in your configuration. It seems to be designed to create metrics from spans, but this is mostly conjecture based on the name.
+
+4. **otlp/2:** This exporter sends data to an endpoint configured as data-prepper:21890 over OTLP with TLS encryption in insecure mode.
+
+### Metrics
+**Metrics Pipeline**
+
+This pipeline handles metric data.
+
+- **Receivers (otlp, spanmetrics):**
+
+Metric data comes in from the `otlp` receiver and the `spanmetrics` receiver.
+- **Processors (filter, memory_limiter, batch):**
+The data is then processed:
+1. **filter:** This processor excludes specific metrics. In this configuration, it is set to strictly exclude the queueSize metric.
+2. **memory_limiter:** Similar to the traces pipeline, this processor ensures memory usage doesn't exceed a certain limit.
+3. **batch:** This processor batches together metrics before sending them to the exporters, enhancing efficiency.
+
+- **Exporters (prometheus, logging):**
+The processed data is then exported:
+1. **prometheus:** This exporter sends metrics to an endpoint configured as
+2. **otelcol:9464**. It also converts resource information to Prometheus labels and enables OpenMetrics.
+3. **logging:** This exporter logs the metrics.
+
+### Logs
+
+**Logs Pipeline**
+
+This pipeline handles log data.
+
+- **Receiver (otlp):**
+
+    Log data comes in from the otlp receiver. 
+- **Processors (memory_limiter, batch):**
+The data is then processed:
+1. **memory_limiter:** Similar to the traces and metrics pipelines, this processor ensures memory usage doesn't exceed a certain limit.
+2. **batch:** This processor batches together logs before sending them to the exporter, enhancing efficiency.
+
+- **Exporter (logging):**
+
+The processed data is then exported:
+ -  Logs Pipeline