diff --git a/AUTHORS b/AUTHORS index 2a4d786..a495042 100644 --- a/AUTHORS +++ b/AUTHORS @@ -1,4 +1,4 @@ # aci-exporter authors # List alphabetically by surname -Anders Håål \ No newline at end of file +Anders Håål \ No newline at end of file diff --git a/README.md b/README.md index c9f4d20..8cc0311 100644 --- a/README.md +++ b/README.md @@ -4,38 +4,53 @@ aci-exporter - A Cisco ACI Prometheus exporter # Overview The aci-exporter provide metrics from a Cisco ACI fabric by using the ACI Rest API against ACPI controller(s). +The exporter also have the capability to directly scrape individual splines and leafs using the aci-exporter inbuilt +http based service discovery. Doing direct spine and leaf queries is typical useful in very large fabrics, where doing all +api calls through the apic can put a high load on the apic and result in high response time. + +The aci-exporter has been tested on a fabric with more than 500 spines and leafs. To achieve this the exporter use a +number of key features: +- Dynamic service discovery of all spines and leafs nodes in the fabric +- Using node queries to scrape individual spine and leaf nodes +- Parallel page request when queries include the `order-by` statement The exporter can return data both in the [Prometheus](https://prometheus.io/) and the [Openmetrics](https://openmetrics.io/) (v1) exposition format. -The metrics that are exported is configured by definitions of a query. The query can be of any supported ACI class. +The metrics that are exported is configured by definitions of queries. The query can be of any supported ACI class. + +The exporter is written in Go and is a single binary with no dependencies. ![Dashboard example](images/aci_obf.png) +>If you are looking for a complete way to monitor your ACI fabric, including the aci-exporter, Prometheus +Loki, and Grafana you should check out [ACI Monitor Stack](https://github.com/datacenter/aci-monitoring-stack). + # How to configure queries The exporter provides three types of query configuration: - Class queries - one query, many metrics - These are applicable where one query can result in multiple metric names sharing the same labels. -A good example is queries on interfaces, ethpmPhysIf, that results in metrics for speed, state, etc. +A good example is queries on interfaces, class ethpmPhysIf, that results in metrics for speed, state, etc. - Group class queries - multiple queries, one metric - These are applicable when multiple queries result in a single -metrics name but with configured, common and uniq labels. +metrics name with common and uniq labels. Example of this is the metric `health`, where all the different objects health require different queries, but they are all health. So instead of xyz_health it becomes health and some label with value xyz. - Compound queries - multiple queries, one metric and fixed labels - These are applicable where multiple queries result -in single metric name with configured labels. This is typical when counting different entities with -`?rsp-subtree-include=count` since no labels are returned that can be used for labels. +in single metric name with configured labels. This is typical when counting different entities with filter like +`?rsp-subtree-include=count`. Since no labels are returned fixed labels is used. There also some so-called built-in queries. These are hard coded queries. > Example of queries can be found in the `example-config.yaml` file. > Make sure you understand the ACI api before changing or creating new ones. +# High level features ## Configuration directory (Since version 0.7.0) -In addition to configure all queries in the configuration file they can also be configured in different files in the +In addition to configure all queries in the configuration file, they can also be configured in different files in the configuration directory. This is by default the directory `config.d` located in the same directory as the configuration file. Instead of having all queries in a single file it is possible to divide by type and/or purpose. @@ -102,7 +117,8 @@ The export has some standard metric "built-in". These are: The configuration should by default be in the file `config.yaml`. It is also an option to place `class_queries`, `compound_queries` and/or `group_class_queries` in different files in a directory, a directory by default named `config.d` that is in the same directory path as the configuration file. -> The name of the directory can be changed using the `-config_dir` argument. +> The name of the directory can be changed using the `-config_dir` argument or the `config_dir: ..` entry in the config +> file or by using environment variables. If queries has the same name they will be overridden by the order they are parsed and finally query name in the configuration file, default, `config.yaml` will have the highest priority. @@ -211,7 +227,7 @@ of children objets, and for all the children classes we like to get the `hiAlarm ```yaml value_name: ethpmDOMStats.children.[.*].attributes.hiAlarm ``` -The `.*` will be substituted with the children class name. So that means it can also be used as a label like: +The `.*` will be substituted with the children class name. That means it can also be used as a label like: ```yaml labels: # this will be the child class name @@ -423,6 +439,190 @@ The aci-exporter will attach the following labels to all metrics - `aci` the name of the ACI. This is done by an API call. - `fabric` the name of the configuration. +# Use aci-exporter in large fabric setups (since 0.8.0) +In large fabrics the aci-exporter provide a way to distribute the api calls to the individual spine and leaf nodes +instead of using a single apic (or multiple behind a LB). +This configuration depend on the aci-exporter's dynamic service discovery used by Prometheus. The discovery detect all +the current nodes in the fabric including the apic's based on the `topSystems` class. To collect metrics the same +`/probe` api is used with the addition of the query parameter `node` that is set to the spine or leaf node where +to scrape. + +## Service discovery +The service discovery is exposed on the `/sd` endpoint where the query parameter `target` is the name of fabric in the +`config.yml` file, e.g. `'http://localhost:9643/sd?fabric=xyz'`. The output can look like this: + +```json + .... + }, + { + "targets": [ + "sydney#172.16.0.68" + ], + "labels": { + "__meta_aci_exporter_fabric": "sydney", + "__meta_address": "10.3.96.66", + "__meta_dn": "topology/pod-2/node-202/sys", + "__meta_fabricDomain": "fab2", + "__meta_fabricId": "1", + "__meta_id": "202", + "__meta_inbMgmtAddr": "0.0.0.0", + "__meta_name": "leaf202", + "__meta_nameAlias": "", + "__meta_nodeType": "unspecified", + "__meta_oobMgmtAddr": "172.16.0.68", + "__meta_podId": "2", + "__meta_role": "leaf", + "__meta_serial": "FDO2442054U", + "__meta_siteId": "2", + "__meta_state": "in-service", + "__meta_version": "n9000-16.0(5h)" + } + }, + ..... +``` +As describe in the example above the `targets` is by default set to the fabric name defined in the aci-exporter +config.yaml, the label `__meta_aci_exporter_fabric` and the label `__meta_oobMgmtAddr` separated with a `#` character. +The `#` character can be used in the prometheus config as a separator to get both query parameters needed to access a +single node of a spine or leaf: +```yaml + relabel_configs: + - source_labels: [ __meta_role ] + # Only run this job for spine and leaf roles + regex: "(spine|leaf)" + action: "keep" + + # Get the target param from __address__ that is # by default + - source_labels: [ __address__ ] + separator: "#" + regex: (.*)#(.*) + replacement: "$1" + target_label: __param_target + + # Get the node param from __address__ that is # by default + - source_labels: [ __address__ ] + separator: "#" + regex: (.*)#(.*) + replacement: "$2" + target_label: +``` + +If the endpoint is called without a query parameter, service discovery is done for all configured fabrics. +The discovery response can now be used in the prometheus configuration as described in the example file +[`prometheus/prometheus_nodes.yml`](prometheus/prometheus_nodes.yml). + +> In the directory `config_node.d` there is a selection of queries that works for node based queries. + +What the service discovery should return can be highly configurable. This is both related to the targets and labels +returned. + +Overriding the defaults can be done for all fabrics, but also for individual fabrics. The individual configuration always +take precedence. + +For the targets the default is to return fabric name and `oobMgmtAddr`, but if all fabrics instead use the inbMgmtAddr +for access this can be changed in the `config.yaml` + +```yaml +# Common service discovery +service_discovery: + target_format: "%s#%s" + target_fields: + - aci_exporter_fabric + - inMgmtAddr +``` + +For each fabric the discovery can also be override using the same definitions as above but on the fabric level. +```yaml +fabrics: + # This is the Cisco provided sandbox that is open for testing + cisco_sandbox: + username: admin + password: + apic: + - https://sandboxapicdc.cisco.com + service_discovery: + target_format: "%s#%s" + target_fields: + - aci_exporter_fabric + - inbMgmtAddr +``` + +All fields returned by the `topSystems` class query can be used as targets and labels. + +## Fabric service discovery +The service discovery will also return the discovery of the configured aci-exporter fabrics. This will be entries +with the following content: + +```yaml + { + "targets": [ + "sydney" + ], + "labels": { + "__meta_fabricDomain": "fab2", + "__meta_role": "aci_exporter_fabric" + } + } + +``` + +This can now be used from the prometheus configuration to do the "classic" apic queries like: +```yaml + - job_name: 'aci' + scrape_interval: 1m + scrape_timeout: 30s + metrics_path: /probe + params: + queries: + - health,fabric_node_info,object_count,max_capacity + + http_sd_configs: + - url: "http://localhost:9643/sd" + refresh_interval: 5m + + relabel_configs: + - source_labels: [ __meta_role ] + regex: "aci_exporter_fabric" + action: "keep" + + - source_labels: [ __address__ ] + target_label: __param_target + - source_labels: [ __param_target ] + target_label: instance + - target_label: __address__ + replacement: 127.0.0.1:9643 +``` +Please review [`prometheus/prometheus_nodes.yml`](prometheus/prometheus_nodes.yml) example. With discovery there is +no need for any static configuration and only two job configurations to manage all aci fabrics configured. + +## Configure node queries +There is no difference how a node query is configured in the aci-exporter from apic query except: +1. Not all queries are supported on the node +2. When extracting label values there is no information about the node id or pod id. These must be managed by +discovery and relabeling, see [`prometheus/prometheus_nodes.yml`](prometheus/prometheus_nodes.yml) +3. The resulting DN is different between apic api and node api. From the apic we typical do label extraction using +```yaml + labels: + # The field in the json used to parse the labels from + - property_name: ethpmPhysIf.attributes.dn + # The regex where the string enclosed in the P is the label name + regex: "^topology/pod-(?P[1-9][0-9]*)/node-(?P[1-9][0-9]*)/sys/phys-\\[(?P[^\\]]+)\\]/" +``` +In the above the topology path is part of the response. But for a node based query the same would be: +```yaml + labels: + # The field in the json used to parse the labels from + - property_name: ethpmPhysIf.attributes.dn + # The regex where the string enclosed in the P is the label name + regex: "^sys/phys-\\[(?P[^\\]]+)\\]/" +``` +> As mentioned above the podid and nodeid is added to the timeserie using Prometheus relabeling. + +4. Node queries must have named queries in the Prometheus config + +> It is highly recommended to do direct spine and leaf node queries if the fabric is large, both in the number of nodes +> but also in the number of objects in the fabric. +> Most queries should be possible to do directly on the nodes. + # Configuration > For configuration options please see the `example-config.yml` file. @@ -466,7 +666,7 @@ The metrics created by the aci-exporter is controlled by the following attribute - `name` the name of the metric - `type` the type of the metric, if not set it will default to gauge. If the type is a counter the metric name will be postfix with `_total` -- `unit` a base unit like bytes, seconds etc. If defined the metrics name will be postfixed with the unit +- `unit` a base unit like bytes, seconds etc. If defined the metrics name will have postfix with the unit - `help` the description text of the metrics, if not set it will default to `Missing description` With the following settings: @@ -483,6 +683,34 @@ The metric output will be like: # TYPE aci_uptime_seconds_total counter aci_uptime_seconds_total{.......} 98657 ``` +## Paging support +For large fabrics the response latency can increase and even the max response items may not be enough. For these large +fabrics it possible to use paging request where aci-exporter will make each paging request. + +Paging is **only** supported if the query has been specified with `order-by=.dn`, like: +```yaml +class_queries: + bgp_peers: + class_name: bgpPeer + query_parameter: '?order-by=bgpPeer.dn&rsp-subtree=children&rsp-subtree-class=bgpPeerEntry' + .... +``` + +The paged request is by default done sequential, but parallel paging is supported. To use parallel +paging the following configuration can be done in the configuration file: +```yaml +httpclient: + # this is the max and also the default value + pagesize: 1000 + # enable parallel paging, default is false + parallel_paging: true +``` +It is also possible to set the configuration through environment variables: +```shell +ACI_EXPORTER_HTTPCLIENT_PAGESIZE=1000 +ACI_EXPORTER_HTTPCLIENT_PARALLEL_PAGING=true +``` + ## Metric output formatting There is a number of options to control the output format. The configuration related to the formatting is defined in the `metric_format` section of the configuration file. @@ -519,6 +747,7 @@ faulty configuration. They will just not be part of the metric output. Any access failures to apic[s] are written to the log. # Installation +Get the latest release from the [release page](https://github.com/opsdis/aci-exporter/releases). ## Build ```shell @@ -556,9 +785,14 @@ The target is a named fabric in the configuration file. There is also possible to run a limited number of queries by using the query parameter `queries`. This should be a comma separated list of the query names in the config file. It may also contain built-in query names. + ```shell curl -s 'http://localhost:9643/probe?target=cisco_sandbox&queries=node_health,faults' ``` +In addition to queries as a comma separated list, it is also possible to repeat `queries` as a query parameter. +```shell +curl -s 'http://localhost:9643/probe?target=cisco_sandbox&queries=node_health&queries=faults' +``` ## Run in standalone query mode (beta and may change in future releases) It is possible to run the aci-exporter in a standalone query mode. This mode enable to run a APIC query against @@ -577,7 +811,9 @@ To get the metrics in openmetrics format use the header `Accept: application/ope Please see the example file prometheus/prometheus.yml. # Docker -The aci-export can be build and run as a docker container and it supports multi-arch. +Pre built docker images are available on [packages](https://github.com/opsdis/aci-exporter/pkgs/container/aci-exporter). + +The aci-export can be build and run as a docker container, and it supports multi-arch. ```shell docker buildx build . -t regystry/aci-exporter:Version --platform=linux/arm64,linux/amd64 --push @@ -595,6 +831,9 @@ Just change `ACI_EXPORTER_CONFIG` to use different configuration files. Thanks to https://github.com/RavuAlHemio/prometheus_aci_exporter for the inspiration of the configuration of queries. Please check out that project especially if you like to contribute to a Python project. +Special thanks to [camrossi](https://github.com/camrossi) for his deep knowledge of Cisco ACI, all valuable ideas and +endless testing. + # License This work is licensed under the GNU GENERAL PUBLIC LICENSE Version 3. diff --git a/aci-api.go b/aci-api.go index ccd92fb..5fff373 100644 --- a/aci-api.go +++ b/aci-api.go @@ -8,8 +8,6 @@ // GNU General Public License for more details. // You should have received a copy of the GNU General Public License // along with this program. If not, see . -// -// Copyright 2020-2023 Opsdis package main @@ -19,7 +17,6 @@ import ( "errors" "fmt" "strconv" - "strings" "time" "github.com/umisama/go-regexpcache" @@ -32,60 +29,31 @@ import ( var arrayExtension = regexpcache.MustCompile("^(?P.*)\\.\\[(?P.*)\\](?P.*)") -func newAciAPI(ctx context.Context, fabricConfig *Fabric, configQueries AllQueries, queryFilter string) *aciAPI { - - executeQueries := configQueries - queryArray := strings.Split(queryFilter, ",") - if queryArray[0] != "" { - // If there are some queries named - executeQueries.ClassQueries = ClassQueries{} - executeQueries.CompoundClassQueries = CompoundClassQueries{} - executeQueries.GroupClassQueries = GroupClassQueries{} - // Find the named queries for the different type - for _, queryName := range queryArray { - for configQueryName := range configQueries.ClassQueries { - if queryName == configQueryName { - executeQueries.ClassQueries[configQueryName] = configQueries.ClassQueries[configQueryName] - } - } - for k := range configQueries.CompoundClassQueries { - if queryName == k { - executeQueries.CompoundClassQueries[k] = configQueries.CompoundClassQueries[k] - } - } - for k := range configQueries.GroupClassQueries { - if queryName == k { - executeQueries.GroupClassQueries[k] = configQueries.GroupClassQueries[k] - } - } - } - } else { - // Use all configured - executeQueries = configQueries - } +func newAciAPI(ctx context.Context, fabricConfig *Fabric, configQueries AllQueries, queryArray []string, node *string) *aciAPI { + executeQueries := queriesToExecute(configQueries, queryArray) api := &aciAPI{ ctx: ctx, - connection: newAciConnection(ctx, fabricConfig), + connection: newAciConnection(fabricConfig, node), metricPrefix: viper.GetString("prefix"), configQueries: executeQueries.ClassQueries, configCompoundQueries: executeQueries.CompoundClassQueries, configGroupQueries: executeQueries.GroupClassQueries, - confgBuiltInQueries: BuiltinQueries{}, + configBuiltInQueries: BuiltinQueries{}, } // Make sure all built in queries are handled - if queryArray[0] != "" { + if queryArray != nil { // If query parameter queries is used for _, v := range queryArray { if v == "faults" { - api.confgBuiltInQueries["faults"] = api.faults + api.configBuiltInQueries["faults"] = api.faults } // Add all other builtin with if statements } } else { // If query parameter queries is NOT used, include all - api.confgBuiltInQueries["faults"] = api.faults + api.configBuiltInQueries["faults"] = api.faults } return api @@ -98,7 +66,38 @@ type aciAPI struct { configQueries ClassQueries configCompoundQueries CompoundClassQueries configGroupQueries GroupClassQueries - confgBuiltInQueries BuiltinQueries + configBuiltInQueries BuiltinQueries +} + +func queriesToExecute(configQueries AllQueries, queryArray []string) AllQueries { + if queryArray == nil { + // Default is all configured queries to execute + return configQueries + } + executeQueries := AllQueries{} + executeQueries.ClassQueries = ClassQueries{} + executeQueries.CompoundClassQueries = CompoundClassQueries{} + executeQueries.GroupClassQueries = GroupClassQueries{} + + // Find the named queries for the different type + for _, queryName := range queryArray { + for configQueryName := range configQueries.ClassQueries { + if queryName == configQueryName { + executeQueries.ClassQueries[configQueryName] = configQueries.ClassQueries[configQueryName] + } + } + for k := range configQueries.CompoundClassQueries { + if queryName == k { + executeQueries.CompoundClassQueries[k] = configQueries.CompoundClassQueries[k] + } + } + for k := range configQueries.GroupClassQueries { + if queryName == k { + executeQueries.GroupClassQueries[k] = configQueries.GroupClassQueries[k] + } + } + } + return executeQueries } // CollectMetrics Gather all aci metrics and return name of the aci fabric, slice of metrics and status of @@ -107,7 +106,7 @@ func (p aciAPI) CollectMetrics() (string, []MetricDefinition, error) { var metrics []MetricDefinition start := time.Now() - err := p.connection.login() + err := p.connection.login(p.ctx) // defer p.connection.logout() if err != nil { @@ -145,9 +144,9 @@ func (p aciAPI) CollectMetrics() (string, []MetricDefinition, error) { metrics = append(metrics, *p.scrape(end.Seconds())) metrics = append(metrics, *p.up(1.0)) log.WithFields(log.Fields{ - "requestid": p.ctx.Value("requestid"), - "exec_time": end.Microseconds(), - "fabric": fmt.Sprintf("%v", p.ctx.Value("fabric")), + LogFieldRequestID: p.ctx.Value(LogFieldRequestID), + LogFieldExecTime: end.Microseconds(), + LogFieldFabric: fmt.Sprintf("%v", p.ctx.Value(LogFieldFabric)), }).Info("total scrape time ") return aciName, metrics, nil } @@ -192,11 +191,11 @@ func (p aciAPI) up(state float64) *MetricDefinition { func (p aciAPI) configuredBuiltInMetrics(chall chan []MetricDefinition) { var metricDefinitions []MetricDefinition ch := make(chan []MetricDefinition) - for _, fun := range p.confgBuiltInQueries { + for _, fun := range p.configBuiltInQueries { go fun(ch) } - for range p.confgBuiltInQueries { + for range p.configBuiltInQueries { metricDefinitions = append(metricDefinitions, <-ch...) } @@ -204,11 +203,11 @@ func (p aciAPI) configuredBuiltInMetrics(chall chan []MetricDefinition) { } func (p aciAPI) faults(ch chan []MetricDefinition) { - data, err := p.connection.getByQuery("faults") + data, err := p.connection.GetByQuery(p.ctx, "faults") if err != nil { log.WithFields(log.Fields{ - "requestid": p.ctx.Value("requestid"), - "fabric": fmt.Sprintf("%v", p.ctx.Value("fabric")), + LogFieldRequestID: p.ctx.Value(LogFieldRequestID), + LogFieldFabric: fmt.Sprintf("%v", p.ctx.Value(LogFieldFabric)), }).Error("faults not supported", err) ch <- nil return @@ -308,11 +307,15 @@ func (p aciAPI) faults(ch chan []MetricDefinition) { } func (p aciAPI) getAciName() (string, error) { + // Do not query aci name when query a node + if p.connection.Node != nil { + return "", nil + } if p.connection.fabricConfig.AciName != "" { return p.connection.fabricConfig.AciName, nil } - data, err := p.connection.getByClassQuery("infraCont", "?query-target=self") + data, err := p.connection.GetByClassQuery(p.ctx, "infraCont", "?query-target=self") if err != nil { return "", err @@ -350,7 +353,7 @@ func (p aciAPI) getCompoundMetrics(ch chan []MetricDefinition, v *CompoundClassQ var metrics []Metric for _, classLabel := range v.ClassNames { metric := Metric{} - data, _ := p.connection.getByClassQuery(classLabel.Class, classLabel.QueryParameter) + data, _ := p.connection.GetByClassQuery(p.ctx, classLabel.Class, classLabel.QueryParameter) if classLabel.ValueName == "" { metric.Value = p.toFloat(gjson.Get(data, fmt.Sprintf("imdata.0.%s", v.Metrics[0].ValueName)).Str) } else { @@ -441,12 +444,12 @@ func (p aciAPI) getGroupClassMetrics(ch chan []MetricDefinition, v GroupClassQue func (p aciAPI) getClassMetrics(ch chan []MetricDefinition, v *ClassQuery) { var metricDefinitions []MetricDefinition - data, err := p.connection.getByClassQuery(v.ClassName, v.QueryParameter) + data, err := p.connection.GetByClassQuery(p.ctx, v.ClassName, v.QueryParameter) if err != nil { log.WithFields(log.Fields{ - "requestid": p.ctx.Value("requestid"), - "fabric": fmt.Sprintf("%v", p.ctx.Value("fabric")), + LogFieldRequestID: p.ctx.Value(LogFieldRequestID), + LogFieldFabric: fmt.Sprintf("%v", p.ctx.Value(LogFieldFabric)), }).Error(fmt.Sprintf("%s not supported", v.ClassName), err) ch <- nil return @@ -630,16 +633,15 @@ func (p aciAPI) toRatio(value string) float64 { func (p aciAPI) toFloat(value string) float64 { rate, err := strconv.ParseFloat(value, 64) if err != nil { - // if the value a date time convert to timestamp + // if the value is a date time convert to timestamp t, err := time.Parse(time.RFC3339, value) - rate = float64(t.Unix()) if err != nil { log.WithFields(log.Fields{ "value": value, }).Info("could not convert value to float, will return 0.0 ") return 0.0 } - + rate = float64(t.Unix()) } return rate } diff --git a/aci-client.go b/aci-client.go new file mode 100644 index 0000000..cb66ac1 --- /dev/null +++ b/aci-client.go @@ -0,0 +1,431 @@ +// This program is free software: you can redistribute it and/or modify +// it under the terms of the GNU General Public License as published by +// the Free Software Foundation, either version 3 of the License, or +// (at your option) any later version. +// This program is distributed in the hope that it will be useful, +// but WITHOUT ANY WARRANTY; without even the implied warranty of +// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +// GNU General Public License for more details. +// You should have received a copy of the GNU General Public License +// along with this program. If not, see . + +package main + +import ( + "bytes" + "context" + "encoding/json" + "fmt" + "io" + "net/http" + "strings" + "time" + + log "github.com/sirupsen/logrus" + "github.com/spf13/viper" + "github.com/tidwall/gjson" +) + +type AciClient interface { + Get(ctx context.Context, url string) ([]byte, int, error) +} + +func NewAciClient(client http.Client, headers map[string]string, token *AciToken, fabricName string, url string) AciClient { + + if strings.Contains(url, "order-by") { + if viper.GetBool("HTTPClient.parallel_paging") { + return &AciClientParallelPage{ + Client: client, + Headers: headers, + Token: token, + FabricName: fabricName, + PageSize: viper.GetInt("HTTPClient.pagesize"), + } + } + return &AciClientSequentialPage{ + Client: client, + Headers: headers, + Token: token, + FabricName: fabricName, + PageSize: viper.GetInt("HTTPClient.pagesize"), + } + } + + return &AciClientSequential{ + Client: client, + Headers: headers, + Token: token, + FabricName: fabricName, + } +} + +type AciClientSequential struct { + Client http.Client + Headers map[string]string + Token *AciToken + FabricName string +} + +func (acs *AciClientSequential) Get(ctx context.Context, url string) ([]byte, int, error) { + req, err := http.NewRequest("GET", url, bytes.NewBuffer([]byte{})) + if err != nil { + log.WithFields(log.Fields{ + LogFieldRequestID: ctx.Value(LogFieldRequestID), + LogFieldFabric: fmt.Sprintf("%v", acs.FabricName), + }).Error(err) + return nil, 0, err + } + for k, v := range acs.Headers { + req.Header.Set(k, v) + } + + cookie := http.Cookie{ + Name: HeaderAPICCookie, + Value: acs.Token.token, + Path: "", + Domain: "", + Expires: time.Time{}, + RawExpires: "", + MaxAge: 0, + Secure: false, + HttpOnly: false, + SameSite: 0, + Raw: "", + Unparsed: nil, + } + + req.AddCookie(&cookie) + + resp, err := acs.Client.Do(req) + if err != nil { + log.WithFields(log.Fields{ + LogFieldRequestID: ctx.Value(LogFieldRequestID), + LogFieldFabric: fmt.Sprintf("%v", acs.FabricName), + }).Error(err) + return nil, 0, err + } + + defer resp.Body.Close() + + if resp.StatusCode == http.StatusOK { + bodyBytes, err := io.ReadAll(resp.Body) + if err != nil { + log.WithFields(log.Fields{ + LogFieldRequestID: ctx.Value(LogFieldRequestID), + LogFieldFabric: fmt.Sprintf("%v", acs.FabricName), + }).Error(err) + return nil, resp.StatusCode, err + } + + return bodyBytes, resp.StatusCode, nil + } + return nil, resp.StatusCode, fmt.Errorf(ACIApiReturnedStatusCode, resp.StatusCode) +} + +type ACIResponse struct { + TotalCount uint64 `json:"totalCount"` + ImData []map[string]interface{} `json:"imdata"` +} + +type AciClientSequentialPage struct { + Client http.Client + Headers map[string]string + Token *AciToken + FabricName string + PageSize int +} + +func (acsp *AciClientSequentialPage) Get(ctx context.Context, url string) ([]byte, int, error) { + + aciResponse := ACIResponse{ + TotalCount: 0, + ImData: make([]map[string]interface{}, 0, acsp.PageSize), + } + + pagedUrl := "" + + // do a single call to get totalCount + if strings.Contains(url, "?") { + pagedUrl = "%s&page-size=%d&page=%d" + } else { + pagedUrl = "%s?page-size=%d&page=%d" + } + + // First request to determine the total count + bodyBytes, status, err := acsp.getPage(ctx, url, pagedUrl, 0) + if err != nil { + return nil, status, err + } + + aciResponse.TotalCount = gjson.Get(string(bodyBytes), "totalCount").Uint() + _ = json.Unmarshal(bodyBytes, &aciResponse) + + numberOfPages := aciResponse.TotalCount / uint64(acsp.PageSize) + + for ii := 1; ii < int(numberOfPages)+1; ii++ { + bodyBytes, status, err = acsp.getPage(ctx, url, pagedUrl, ii) + if err != nil { + return nil, status, err + } + + aciResponsePage := ACIResponse{ + TotalCount: 0, + ImData: make([]map[string]interface{}, 0, acsp.PageSize), + } + + _ = json.Unmarshal(bodyBytes, &aciResponsePage) + aciResponse.ImData = append(aciResponse.ImData, aciResponsePage.ImData...) + } + + data, _ := json.Marshal(aciResponse) + + return data, status, nil + +} + +func (acsp *AciClientSequentialPage) getPage(ctx context.Context, url string, pagedUrl string, page int) ([]byte, int, error) { + req, err := http.NewRequest("GET", fmt.Sprintf(pagedUrl, url, acsp.PageSize, page), bytes.NewBuffer([]byte{})) + if err != nil { + log.WithFields(log.Fields{ + LogFieldRequestID: ctx.Value(LogFieldRequestID), + LogFieldFabric: fmt.Sprintf("%v", acsp.FabricName), + }).Error(err) + return nil, 0, err + } + for k, v := range acsp.Headers { + req.Header.Set(k, v) + } + + cookie := http.Cookie{ + Name: HeaderAPICCookie, + Value: acsp.Token.token, + Path: "", + Domain: "", + Expires: time.Time{}, + RawExpires: "", + MaxAge: 0, + Secure: false, + HttpOnly: false, + SameSite: 0, + Raw: "", + Unparsed: nil, + } + + req.AddCookie(&cookie) + + resp, err := acsp.Client.Do(req) + if err != nil { + log.WithFields(log.Fields{ + LogFieldRequestID: ctx.Value(LogFieldRequestID), + LogFieldFabric: fmt.Sprintf("%v", acsp.FabricName), + }).Error(err) + return nil, 0, err + } + + defer resp.Body.Close() + + if resp.StatusCode != http.StatusOK { + log.WithFields(log.Fields{ + LogFieldRequestID: ctx.Value(LogFieldRequestID), + LogFieldFabric: fmt.Sprintf("%v", acsp.FabricName), + "status": resp.StatusCode, + }).Error(ErrMsgInvalidStatusCode) + return nil, resp.StatusCode, fmt.Errorf(ACIApiReturnedStatusCode, resp.StatusCode) + } + + bodyBytes, err := io.ReadAll(resp.Body) + if err != nil { + log.WithFields(log.Fields{ + LogFieldRequestID: ctx.Value(LogFieldRequestID), + LogFieldFabric: fmt.Sprintf("%v", acsp.FabricName), + }).Error(err) + return nil, resp.StatusCode, err + } + return bodyBytes, resp.StatusCode, nil +} + +type AciClientParallelPage struct { + Client http.Client + Headers map[string]string + Token *AciToken + FabricName string + PageSize int +} + +func (acpp *AciClientParallelPage) Get(ctx context.Context, url string) ([]byte, int, error) { + + aciResponse := ACIResponse{ + TotalCount: 0, + ImData: make([]map[string]interface{}, 0, acpp.PageSize), + } + + pagedUrl := "" + + // do a single call to get totalCount + if strings.Contains(url, "?") { + pagedUrl = "%s&page-size=%d&page=%d" + } else { + pagedUrl = "%s?page-size=%d&page=%d" + } + + // First request to determine the total count + bodyBytes, status, err := acpp.getPage(ctx, url, pagedUrl, 0) + if err != nil { + return nil, status, err + } + + aciResponse.TotalCount = gjson.Get(string(bodyBytes), "totalCount").Uint() + _ = json.Unmarshal(bodyBytes, &aciResponse) + + numberOfPages := aciResponse.TotalCount / uint64(acpp.PageSize) + ch := make(chan ACIResponse) + for ii := 1; ii < int(numberOfPages)+1; ii++ { + go acpp.getParallelPage(ctx, url, pagedUrl, ii, ch) + log.Info(fmt.Sprintf("Send page %d", ii)) + } + for i := 1; i < int(numberOfPages)+1; i++ { + comm := <-ch + for _, imData := range comm.ImData { + aciResponse.ImData = append(aciResponse.ImData, imData) + } + log.Info(fmt.Sprintf("Fetched page %d", i)) + } + + data, _ := json.Marshal(aciResponse) + + return data, status, nil + +} + +func (acpp *AciClientParallelPage) getPage(ctx context.Context, url string, pagedUrl string, page int) ([]byte, int, error) { + req, err := http.NewRequest("GET", fmt.Sprintf(pagedUrl, url, acpp.PageSize, page), bytes.NewBuffer([]byte{})) + if err != nil { + log.WithFields(log.Fields{ + LogFieldRequestID: ctx.Value(LogFieldRequestID), + LogFieldFabric: fmt.Sprintf("%v", acpp.FabricName), + }).Error(err) + return nil, 0, err + } + for k, v := range acpp.Headers { + req.Header.Set(k, v) + } + + cookie := http.Cookie{ + Name: HeaderAPICCookie, + Value: acpp.Token.token, + Path: "", + Domain: "", + Expires: time.Time{}, + RawExpires: "", + MaxAge: 0, + Secure: false, + HttpOnly: false, + SameSite: 0, + Raw: "", + Unparsed: nil, + } + + req.AddCookie(&cookie) + + resp, err := acpp.Client.Do(req) + if err != nil { + log.WithFields(log.Fields{ + LogFieldRequestID: ctx.Value(LogFieldRequestID), + LogFieldFabric: fmt.Sprintf("%v", acpp.FabricName), + }).Error(err) + return nil, 0, err + } + + defer resp.Body.Close() + + if resp.StatusCode != http.StatusOK { + log.WithFields(log.Fields{ + LogFieldRequestID: ctx.Value(LogFieldRequestID), + LogFieldFabric: fmt.Sprintf("%v", acpp.FabricName), + "status": resp.StatusCode, + }).Error(ErrMsgInvalidStatusCode) + return nil, resp.StatusCode, fmt.Errorf(ACIApiReturnedStatusCode, resp.StatusCode) + } + + bodyBytes, err := io.ReadAll(resp.Body) + if err != nil { + log.WithFields(log.Fields{ + LogFieldRequestID: ctx.Value(LogFieldRequestID), + LogFieldFabric: fmt.Sprintf("%v", acpp.FabricName), + }).Error(err) + return nil, resp.StatusCode, err + } + return bodyBytes, resp.StatusCode, nil +} + +func (acpp *AciClientParallelPage) getParallelPage(ctx context.Context, url string, pagedUrl string, page int, ch chan ACIResponse) { + aciResponse := ACIResponse{ + TotalCount: 0, + ImData: make([]map[string]interface{}, 0, acpp.PageSize), + } + + req, err := http.NewRequest("GET", fmt.Sprintf(pagedUrl, url, acpp.PageSize, page), bytes.NewBuffer([]byte{})) + if err != nil { + log.WithFields(log.Fields{ + LogFieldRequestID: ctx.Value(LogFieldRequestID), + LogFieldFabric: fmt.Sprintf("%v", acpp.FabricName), + }).Error(err) + ch <- aciResponse + return + } + for k, v := range acpp.Headers { + req.Header.Set(k, v) + } + + cookie := http.Cookie{ + Name: HeaderAPICCookie, + Value: acpp.Token.token, + Path: "", + Domain: "", + Expires: time.Time{}, + RawExpires: "", + MaxAge: 0, + Secure: false, + HttpOnly: false, + SameSite: 0, + Raw: "", + Unparsed: nil, + } + + req.AddCookie(&cookie) + + resp, err := acpp.Client.Do(req) + if err != nil { + log.WithFields(log.Fields{ + LogFieldRequestID: ctx.Value(LogFieldRequestID), + LogFieldFabric: fmt.Sprintf("%v", acpp.FabricName), + }).Error(err) + ch <- aciResponse + return + } + + defer resp.Body.Close() + + if resp.StatusCode != http.StatusOK { + log.WithFields(log.Fields{ + LogFieldRequestID: ctx.Value(LogFieldRequestID), + LogFieldFabric: fmt.Sprintf("%v", acpp.FabricName), + "status": resp.StatusCode, + }).Error(ErrMsgInvalidStatusCode) + ch <- aciResponse + return + } + + bodyBytes, err := io.ReadAll(resp.Body) + if err != nil { + log.WithFields(log.Fields{ + LogFieldRequestID: ctx.Value(LogFieldRequestID), + LogFieldFabric: fmt.Sprintf("%v", acpp.FabricName), + }).Error(err) + ch <- aciResponse + return + } + _ = json.Unmarshal(bodyBytes, &aciResponse) + ch <- aciResponse + return +} diff --git a/aci-connection.go b/aci-connection.go index 1cc77d0..af9615a 100644 --- a/aci-connection.go +++ b/aci-connection.go @@ -8,8 +8,6 @@ // GNU General Public License for more details. // You should have received a copy of the GNU General Public License // along with this program. If not, see . -// -// Copyright 2020-2023 Opsdis package main @@ -64,7 +62,6 @@ type AciToken struct { // AciConnection is the connection object type AciConnection struct { - ctx context.Context fabricConfig *Fabric activeController *int URLMap map[string]string @@ -72,14 +69,24 @@ type AciConnection struct { Client http.Client token *AciToken tokenMutex sync.Mutex - //responseTime *prometheus.HistogramVec + // If a node query this is set to the instance + Node *string } -var connectionCache = make(map[*Fabric]*AciConnection) +var connectionCache = make(map[string]*AciConnection) -func newAciConnection(ctx context.Context, fabricConfig *Fabric) *AciConnection { +// cacheName returns a unique name for the connection. Every connection is unique per fabric and node with own +// cache entry +func cacheName(aciName string, node *string) string { + if node == nil { + return aciName + } + return aciName + *node +} - val, ok := connectionCache[fabricConfig] +func newAciConnection(fabricConfig *Fabric, node *string) *AciConnection { + // Check if we have a connection in the cache + val, ok := connectionCache[cacheName(fabricConfig.FabricName, node)] if ok { return val } @@ -102,36 +109,43 @@ func newAciConnection(ctx context.Context, fabricConfig *Fabric) *AciConnection urlMap["faults"] = "/api/class/faultCountsWithDetails.json" con := &AciConnection{ - ctx: ctx, fabricConfig: fabricConfig, activeController: new(int), URLMap: urlMap, Headers: headers, Client: *httpClient, - //responseTime: responseTime, + Node: node, } - connectionCache[fabricConfig] = con - return connectionCache[fabricConfig] + connectionCache[cacheName(fabricConfig.FabricName, node)] = con + return connectionCache[cacheName(fabricConfig.FabricName, node)] } // login get the existing token if valid or do a full /login -func (c *AciConnection) login() error { +func (c *AciConnection) login(ctx context.Context) error { - err, done := c.tokenProcessing() + err, done := c.tokenProcessing(ctx) if done { return err } - return c.loginProcessing() + return c.loginProcessing(ctx) } // loginProcessing do a full /login -func (c *AciConnection) loginProcessing() error { +func (c *AciConnection) loginProcessing(ctx context.Context) error { c.tokenMutex.Lock() defer c.tokenMutex.Unlock() + if c.Node != nil { + return c.nodeLogin(ctx) + } else { + return c.apicLogin(ctx) + } +} + +func (c *AciConnection) apicLogin(ctx context.Context) error { for i, controller := range c.fabricConfig.Apic { - response, status, err := c.doPostJSON("login", fmt.Sprintf("%s%s", controller, c.URLMap["login"]), + response, status, err := c.doPostJSON(ctx, "login", fmt.Sprintf("%s%s", controller, c.URLMap["login"]), []byte(fmt.Sprintf("{\"aaaUser\":{\"attributes\":{\"name\":\"%s\",\"pwd\":\"%s\"}}}", c.fabricConfig.Username, c.fabricConfig.Password))) if err != nil || status != 200 { @@ -139,18 +153,20 @@ func (c *AciConnection) loginProcessing() error { err = fmt.Errorf("failed to login to %s, try next apic", controller) log.WithFields(log.Fields{ - "requestid": c.ctx.Value("requestid"), - "fabric": fmt.Sprintf("%v", c.ctx.Value("fabric")), - "token": fmt.Sprintf("login"), + LogFieldRequestID: ctx.Value(LogFieldRequestID), + "fabric": fmt.Sprintf("%v", c.fabricConfig.FabricName), + "token": "login", + "controller": controller, }).Error(err) } else { c.newToken(response) *c.activeController = i log.WithFields(log.Fields{ - "requestid": c.ctx.Value("requestid"), - "fabric": fmt.Sprintf("%v", c.ctx.Value("fabric")), - "token": fmt.Sprintf("login"), + LogFieldRequestID: ctx.Value(LogFieldRequestID), + LogFieldFabric: fmt.Sprintf("%v", c.fabricConfig.FabricName), + "token": "login", + "controller": controller, }).Info(fmt.Sprintf("Using apic %s", controller)) return nil } @@ -158,48 +174,72 @@ func (c *AciConnection) loginProcessing() error { return fmt.Errorf("failed to login to any apic controllers") } +func (c *AciConnection) nodeLogin(ctx context.Context) error { + // Node query + response, status, err := c.doPostJSON(ctx, "login", fmt.Sprintf("%s%s", *c.Node, c.URLMap["login"]), + []byte(fmt.Sprintf("{\"aaaUser\":{\"attributes\":{\"name\":\"%s\",\"pwd\":\"%s\"}}}", c.fabricConfig.Username, c.fabricConfig.Password))) + + if err != nil || status != 200 { + err = fmt.Errorf("failed to login to %s", *c.Node) + log.WithFields(log.Fields{ + LogFieldRequestID: ctx.Value(LogFieldRequestID), + LogFieldFabric: fmt.Sprintf("%v", c.fabricConfig.FabricName), + "token": fmt.Sprintf("login"), + "node": *c.Node, + }).Error(err) + return fmt.Errorf("failed to login to node") + } + + c.newToken(response) + log.WithFields(log.Fields{ + LogFieldRequestID: ctx.Value(LogFieldRequestID), + LogFieldFabric: fmt.Sprintf("%v", c.fabricConfig.FabricName), + "token": fmt.Sprintf("login"), + "node": *c.Node, + }).Info(fmt.Sprintf("Using node")) + return nil +} + // tokenProcessing if token are valid reuse or try to do a /refresh -func (c *AciConnection) tokenProcessing() (error, bool) { +func (c *AciConnection) tokenProcessing(ctx context.Context) (error, bool) { if c.token != nil { c.tokenMutex.Lock() defer c.tokenMutex.Unlock() if c.token.lifetime < time.Now().Unix() { log.WithFields(log.Fields{ - "requestid": c.ctx.Value("requestid"), - "fabric": fmt.Sprintf("%v", c.ctx.Value("fabric")), - "token": fmt.Sprintf("lifetime"), + LogFieldRequestID: ctx.Value(LogFieldRequestID), + LogFieldFabric: fmt.Sprintf("%v", c.fabricConfig.FabricName), + "token": fmt.Sprintf("lifetime"), }).Info("token reached lifetime seconds") return nil, false } else if c.token.expire < time.Now().Unix() { - response, status, err := c.get("refresh", fmt.Sprintf("%s%s", c.fabricConfig.Apic[*c.activeController], c.URLMap["refresh"])) + response, status, err := c.get(ctx, "refresh", fmt.Sprintf("%s%s", c.fabricConfig.Apic[*c.activeController], c.URLMap["refresh"])) if err != nil || status != 200 { - //errRe = fmt.Errorf("failed to refresh token %s", c.fabricConfig.Apic[*c.activeController]) log.WithFields(log.Fields{ - "requestid": c.ctx.Value("requestid"), - "fabric": fmt.Sprintf("%v", c.ctx.Value("fabric")), - "token": fmt.Sprintf("refresh"), + LogFieldRequestID: ctx.Value(LogFieldRequestID), + LogFieldFabric: fmt.Sprintf("%v", c.fabricConfig.FabricName), + "token": fmt.Sprintf("refresh"), }).Warning(err) refreshFailedMetric.With(prometheus.Labels{ - "fabric": fmt.Sprintf("%v", c.ctx.Value("fabric"))}).Inc() + LogFieldFabric: fmt.Sprintf("%v", c.fabricConfig.FabricName)}).Inc() return err, false } else { c.refreshToken(response) log.WithFields(log.Fields{ - "requestid": c.ctx.Value("requestid"), - "fabric": fmt.Sprintf("%v", c.ctx.Value("fabric")), - "token": fmt.Sprintf("refresh"), + LogFieldRequestID: ctx.Value(LogFieldRequestID), + LogFieldFabric: fmt.Sprintf("%v", c.fabricConfig.FabricName), + "token": fmt.Sprintf("refresh"), }).Info("refresh token") refreshMetric.With(prometheus.Labels{ - "fabric": fmt.Sprintf("%v", c.ctx.Value("fabric"))}).Inc() + LogFieldFabric: fmt.Sprintf("%v", c.fabricConfig.FabricName)}).Inc() return nil, true } } else { log.WithFields(log.Fields{ - "requestid": c.ctx.Value("requestid"), - "fabric": fmt.Sprintf("%v", c.ctx.Value("fabric")), + LogFieldRequestID: ctx.Value(LogFieldRequestID), + LogFieldFabric: fmt.Sprintf("%v", c.fabricConfig.FabricName), "token": fmt.Sprintf("valid"), - "valid_time_seconds": c.token.expire - time.Now().Unix(), - }).Info("token still valid") + "valid_time_seconds": c.token.expire - time.Now().Unix()}).Debug("token still valid") return nil, true } } @@ -232,73 +272,80 @@ func (c *AciConnection) refreshToken(response []byte) { } } -func (c *AciConnection) logout() bool { - _, status, err := c.doPostJSON("logout", fmt.Sprintf("%s%s", c.fabricConfig.Apic[*c.activeController], c.URLMap["logout"]), - []byte(fmt.Sprintf("{\"aaaUser\":{\"attributes\":{\"name\":\"%s\"}}}", c.fabricConfig.Username))) - if err != nil || status != 200 { - log.WithFields(log.Fields{ - "requestid": c.ctx.Value("requestid"), - "fabric": fmt.Sprintf("%v", c.ctx.Value("fabric")), - }).Error(err) - return false - } - return true -} - -func (c *AciConnection) getByQuery(table string) (string, error) { - data, _, err := c.get(table, fmt.Sprintf("%s%s", c.fabricConfig.Apic[*c.activeController], c.URLMap[table])) +func (c *AciConnection) GetByQuery(ctx context.Context, table string) (string, error) { + data, _, err := c.get(ctx, table, fmt.Sprintf("%s%s", c.fabricConfig.Apic[*c.activeController], c.URLMap[table])) if err != nil { log.WithFields(log.Fields{ - "requestid": c.ctx.Value("requestid"), - "fabric": fmt.Sprintf("%v", c.ctx.Value("fabric")), + LogFieldRequestID: ctx.Value(LogFieldRequestID), + LogFieldFabric: fmt.Sprintf("%v", c.fabricConfig.FabricName), }).Error(fmt.Sprintf("Request %s failed - %s.", c.URLMap[table], err)) return "", err } return string(data), nil } -func (c *AciConnection) getByClassQuery(class string, query string) (string, error) { - data, _, err := c.get(class, fmt.Sprintf("%s/api/class/%s.json%s", c.fabricConfig.Apic[*c.activeController], class, query)) - if err != nil { - log.WithFields(log.Fields{ - "requestid": c.ctx.Value("requestid"), - "fabric": fmt.Sprintf("%v", c.ctx.Value("fabric")), - }).Error(fmt.Sprintf("Class request %s failed - %s.", class, err)) - return "", err +func (c *AciConnection) GetByClassQuery(ctx context.Context, class string, query string) (string, error) { + if c.Node == nil { + // A apic query + data, _, err := c.get(ctx, class, fmt.Sprintf("%s/api/class/%s.json%s", c.fabricConfig.Apic[*c.activeController], class, query)) + if err != nil { + log.WithFields(log.Fields{ + LogFieldRequestID: ctx.Value(LogFieldRequestID), + LogFieldFabric: fmt.Sprintf("%v", c.fabricConfig.FabricName), + }).Error(fmt.Sprintf("Class request %s failed - %s.", class, err)) + return "", err + } + return string(data), nil + } else { + // A node query + data, _, err := c.get(ctx, class, fmt.Sprintf("%s/api/class/%s.json%s", *c.Node, class, query)) + if err != nil { + log.WithFields(log.Fields{ + LogFieldRequestID: ctx.Value(LogFieldRequestID), + LogFieldFabric: fmt.Sprintf("%v", c.fabricConfig.FabricName), + "node": c.Node, + }).Error(fmt.Sprintf("Class request %s failed - %s.", class, err)) + return "", err + } + return string(data), nil } - return string(data), nil } -func (c *AciConnection) get(label string, url string) ([]byte, int, error) { +func (c *AciConnection) get(ctx context.Context, label string, url string) ([]byte, int, error) { start := time.Now() - body, status, err := c.doGet(url) + //body, status, err := c.doGet(ctx, url) + + aciClient := NewAciClient(c.Client, c.Headers, c.token, c.fabricConfig.FabricName, url) + + body, status, err := aciClient.Get(ctx, url) + responseTime := time.Since(start).Seconds() responseTimeMetric.With(prometheus.Labels{ - "fabric": fmt.Sprintf("%v", c.ctx.Value("fabric")), - "class": label, - "method": "GET", - "status": strconv.Itoa(status)}).Observe(responseTime) + LogFieldFabric: fmt.Sprintf("%v", c.fabricConfig.FabricName), + "class": label, + "method": "GET", + "status": strconv.Itoa(status)}).Observe(responseTime) log.WithFields(log.Fields{ - "method": "GET", - "uri": url, - "class": label, - "status": status, - "length": len(body), - "requestid": c.ctx.Value("requestid"), - "exec_time": time.Since(start).Microseconds(), - "fabric": fmt.Sprintf("%v", c.ctx.Value("fabric")), + "method": "GET", + "uri": url, + "class": label, + "status": status, + "length": len(body), + LogFieldRequestID: ctx.Value(LogFieldRequestID), + LogFieldExecTime: time.Since(start).Microseconds(), + LogFieldFabric: fmt.Sprintf("%v", c.fabricConfig.FabricName), }).Info("api call fabric") return body, status, err } -func (c *AciConnection) doGet(url string) ([]byte, int, error) { +func (c *AciConnection) doGet(ctx context.Context, url string) ([]byte, int, error) { req, err := http.NewRequest("GET", url, bytes.NewBuffer([]byte{})) if err != nil { log.WithFields(log.Fields{ - "requestid": c.ctx.Value("requestid"), - "fabric": fmt.Sprintf("%v", c.ctx.Value("fabric")), + LogFieldRequestID: ctx.Value(LogFieldRequestID), + LogFieldFabric: fmt.Sprintf("%v", c.fabricConfig.FabricName), }).Error(err) return nil, 0, err } @@ -326,8 +373,8 @@ func (c *AciConnection) doGet(url string) ([]byte, int, error) { resp, err := c.Client.Do(req) if err != nil { log.WithFields(log.Fields{ - "requestid": c.ctx.Value("requestid"), - "fabric": fmt.Sprintf("%v", c.ctx.Value("fabric")), + LogFieldRequestID: ctx.Value(LogFieldRequestID), + LogFieldFabric: fmt.Sprintf("%v", c.fabricConfig.FabricName), }).Error(err) return nil, 0, err } @@ -338,8 +385,8 @@ func (c *AciConnection) doGet(url string) ([]byte, int, error) { bodyBytes, err := io.ReadAll(resp.Body) if err != nil { log.WithFields(log.Fields{ - "requestid": c.ctx.Value("requestid"), - "fabric": fmt.Sprintf("%v", c.ctx.Value("fabric")), + LogFieldRequestID: ctx.Value(LogFieldRequestID), + LogFieldFabric: fmt.Sprintf("%v", c.fabricConfig.FabricName), }).Error(err) return nil, resp.StatusCode, err } @@ -349,13 +396,13 @@ func (c *AciConnection) doGet(url string) ([]byte, int, error) { return nil, resp.StatusCode, fmt.Errorf("ACI api returned %d", resp.StatusCode) } -func (c *AciConnection) doPostJSON(label string, url string, requestBody []byte) ([]byte, int, error) { +func (c *AciConnection) doPostJSON(ctx context.Context, label string, url string, requestBody []byte) ([]byte, int, error) { req, err := http.NewRequest("POST", url, bytes.NewBuffer(requestBody)) if err != nil { log.WithFields(log.Fields{ - "requestid": c.ctx.Value("requestid"), - "fabric": fmt.Sprintf("%v", c.ctx.Value("fabric")), + LogFieldRequestID: ctx.Value(LogFieldRequestID), + LogFieldFabric: fmt.Sprintf("%v", c.fabricConfig.FabricName), }).Error(err) return nil, 0, err } @@ -370,8 +417,8 @@ func (c *AciConnection) doPostJSON(label string, url string, requestBody []byte) if err != nil { log.WithFields(log.Fields{ - "requestid": c.ctx.Value("requestid"), - "fabric": fmt.Sprintf("%v", c.ctx.Value("fabric")), + LogFieldRequestID: ctx.Value(LogFieldRequestID), + LogFieldFabric: fmt.Sprintf("%v", c.fabricConfig.FabricName), }).Error(err) return nil, 0, err } @@ -379,18 +426,18 @@ func (c *AciConnection) doPostJSON(label string, url string, requestBody []byte) var status = resp.StatusCode responseTimeMetric.With(prometheus.Labels{ - "fabric": fmt.Sprintf("%v", c.ctx.Value("fabric")), - "class": label, - "method": "POST", - "status": strconv.Itoa(status)}).Observe(responseTime) + LogFieldFabric: fmt.Sprintf("%v", c.fabricConfig.FabricName), + "class": label, + "method": "POST", + "status": strconv.Itoa(status)}).Observe(responseTime) log.WithFields(log.Fields{ - "method": "POST", - "uri": url, - "status": status, - "requestid": c.ctx.Value("requestid"), - "exec_time": time.Since(start).Microseconds(), - "fabric": fmt.Sprintf("%v", c.ctx.Value("fabric")), + "method": "POST", + "uri": url, + "status": status, + LogFieldRequestID: ctx.Value(LogFieldRequestID), + LogFieldExecTime: time.Since(start).Microseconds(), + LogFieldFabric: fmt.Sprintf("%v", c.fabricConfig.FabricName), }).Info("api call fabric") defer resp.Body.Close() @@ -399,8 +446,8 @@ func (c *AciConnection) doPostJSON(label string, url string, requestBody []byte) bodyBytes, err := io.ReadAll(resp.Body) if err != nil { log.WithFields(log.Fields{ - "requestid": c.ctx.Value("requestid"), - "fabric": fmt.Sprintf("%v", c.ctx.Value("fabric")), + LogFieldRequestID: ctx.Value(LogFieldRequestID), + LogFieldFabric: fmt.Sprintf("%v", c.fabricConfig.FabricName), }).Error(err) return nil, resp.StatusCode, err } diff --git a/aci-exporter.go b/aci-exporter.go index 2a0ac76..d66425d 100644 --- a/aci-exporter.go +++ b/aci-exporter.go @@ -8,16 +8,16 @@ // GNU General Public License for more details. // You should have received a copy of the GNU General Public License // along with this program. If not, see . -// -// Copyright 2020-2023 Opsdis package main import ( "context" + "encoding/json" "flag" "fmt" "net/http/pprof" + "net/url" "os" "path/filepath" "strconv" @@ -36,6 +36,16 @@ import ( "github.com/spf13/viper" ) +// Common constants +const ( + HeaderAPICCookie = "APIC-cookie" + ErrMsgInvalidStatusCode = "Not a valid status code" + LogFieldRequestID = "requestid" + LogFieldFabric = "fabric" + LogFieldExecTime = "exec_time" + ACIApiReturnedStatusCode = "ACI api returned %d" +) + type loggingResponseWriter struct { http.ResponseWriter statusCode int @@ -56,6 +66,16 @@ func (lrw *loggingResponseWriter) Write(b []byte) (int, error) { return n, err } +func isFlagPassed(name string) bool { + found := false + flag.Visit(func(f *flag.Flag) { + if f.Name == name { + found = true + } + }) + return found +} + var version = "undefined" func main() { @@ -71,7 +91,7 @@ func main() { flag.Int("p", viper.GetInt("port"), "The port to start on") logFile := flag.String("logfile", viper.GetString("logfile"), "Set log file, default stdout") logFormat := flag.String("logformat", viper.GetString("logformat"), "Set log format to text or json, default json") - logLevel := flag.String("loglevel", viper.GetString("loglevel"), "Set log log level, default info") + logLevel := flag.String("loglevel", viper.GetString("loglevel"), "Set log level, default info") config := flag.String("config", viper.GetString("config"), "Set configuration file, default config.yaml") usage := flag.Bool("u", false, "Show usage") writeConfig := flag.Bool("default", false, "Write default config named aci_exporter_default_config.yaml. If config.d directory exist all queries will be merged into single file.") @@ -91,6 +111,7 @@ func main() { fmt.Printf("aci-exporter, version %s\n", version) os.Exit(0) } + log.SetFormatter(&log.JSONFormatter{}) if *logFormat == "text" { log.SetFormatter(&log.TextFormatter{}) @@ -128,7 +149,12 @@ func main() { } if *cli { - fmt.Printf("%s", cliQuery(fabric, class, query)) + data, err := cliQuery(context.TODO(), fabric, class, query) + if err != nil { + fmt.Printf("Error %s", err) + os.Exit(1) + } + fmt.Printf("%s", data) os.Exit(0) } @@ -158,6 +184,39 @@ func main() { os.Exit(1) } + // Check if the arguments for loglevel, logfile and logformat is set on command line + // if not use the values from the config file if exists or defaults + if !isFlagPassed("loglevel") { + level, err := log.ParseLevel(viper.GetString("loglevel")) + if err != nil { + log.Error(fmt.Sprintf("Not supported log level - %s", err)) + os.Exit(1) + } + log.SetLevel(level) + } + + if !isFlagPassed("logfile") { + if viper.GetString("logfile") != "" { + f, err := os.OpenFile(viper.GetString("logfile"), os.O_APPEND|os.O_CREATE|os.O_WRONLY, 0644) + if err != nil { + log.Error(fmt.Sprintf("Error open logfile %s - %s", viper.GetString("logfile"), err)) + os.Exit(1) + } + log.SetOutput(f) + } + } + + if !isFlagPassed("logformat") { + if viper.GetString("logFormat") == "text" { + log.SetFormatter(&log.TextFormatter{}) + } + } + + if !isFlagPassed("config_dir") { + dirName := viper.GetString("config_dir") + configDirName = &dirName + } + // Read all config from config file and directory var queries = AllQueries{} @@ -179,21 +238,17 @@ func main() { err = viper.UnmarshalKey("qroup_class_queries", &queries.GroupClassQueries) if err != nil { - log.Error("Unable to decode compound_queries into struct - ", err) + log.Error("Unable to decode qroup_class_queries into struct - ", err) os.Exit(1) } - err = viper.UnmarshalKey("group_class_queries", &queries.GroupClassQueries) - if err != nil { - log.Error("Unable to decode compound_queries into struct - ", err) - os.Exit(1) - } allQueries := AllQueries{ ClassQueries: queries.ClassQueries, CompoundClassQueries: queries.CompoundClassQueries, GroupClassQueries: queries.GroupClassQueries, } + // Init all fabrics allFabrics := make(map[string]*Fabric) err = viper.UnmarshalKey("fabrics", &allFabrics) @@ -202,11 +257,27 @@ func main() { os.Exit(1) } + // Init discovery settings + for fabricName := range allFabrics { + if allFabrics[fabricName].DiscoveryConfig.TargetFields == nil { + allFabrics[fabricName].DiscoveryConfig.TargetFields = viper.GetStringSlice("service_discovery.target_fields") + } + if allFabrics[fabricName].DiscoveryConfig.LabelsKeys == nil { + allFabrics[fabricName].DiscoveryConfig.LabelsKeys = viper.GetStringSlice("service_discovery.labels") + } + if allFabrics[fabricName].DiscoveryConfig.TargetFormat == "" { + allFabrics[fabricName].DiscoveryConfig.TargetFormat = viper.GetString("service_discovery.target_format") + } + } // Overwrite username or password for APIC by environment variables if set for fabricName := range allFabrics { fabricEnv(fabricName, allFabrics) } + for fabricName, fabric := range allFabrics { + fabric.FabricName = fabricName + } + if val, exists := os.LookupEnv(fmt.Sprintf("%s_FABRIC_NAMES", ExporterNameAsEnv())); exists == true && val != "" { for _, fabricName := range strings.Split(val, ",") { fabricEnv(fabricName, allFabrics) @@ -215,7 +286,7 @@ func main() { for fabricName := range allFabrics { log.WithFields(log.Fields{ - "fabric": fabricName, + LogFieldFabric: fabricName, }).Info("Configured fabric") } @@ -233,6 +304,7 @@ func main() { // Setup handler for aci destinations http.Handle("/probe", logCall(promMonitor(http.HandlerFunc(handler.getMonitorMetrics), responseTime, "/probe"))) http.Handle("/alive", logCall(promMonitor(http.HandlerFunc(alive), responseTime, "/alive"))) + http.Handle("/sd", logCall(promMonitor(http.HandlerFunc(handler.discovery), responseTime, "/sd"))) // Setup handler for exporter metrics http.Handle("/metrics", promhttp.HandlerFor( @@ -323,38 +395,39 @@ func fabricEnv(fabricName string, allFabrics map[string]*Fabric) { } } -func cliQuery(fabric *string, class *string, query *string) string { +func cliQuery(ctx context.Context, fabric *string, class *string, query *string) (string, error) { err := viper.ReadInConfig() if err != nil { log.Error("Configuration file not valid - ", err) - os.Exit(1) + return "", err } username := viper.GetString(fmt.Sprintf("fabrics.%s.username", *fabric)) password := viper.GetString(fmt.Sprintf("fabrics.%s.password", *fabric)) apicControllers := viper.GetStringSlice(fmt.Sprintf("fabrics.%s.apic", *fabric)) aciName := viper.GetString(fmt.Sprintf("fabrics.%s.aci_name", *fabric)) - fabricConfig := Fabric{Username: username, Password: password, Apic: apicControllers, AciName: aciName} - ctx := context.TODO() - con := *newAciConnection(ctx, &fabricConfig) - err = con.login() + fabricConfig := Fabric{Username: username, Password: password, Apic: apicControllers, FabricName: *fabric, AciName: aciName} + + con := newAciConnection(&fabricConfig, nil) + err = con.login(ctx) if err != nil { fmt.Printf("Login error %s", err) - return "" + return "", err } - defer con.logout() + var data string - if string((*query)[0]) != "?" { - data, err = con.getByClassQuery(*class, fmt.Sprintf("?%s", *query)) + if len(*query) > 0 && string((*query)[0]) != "?" { + data, err = con.GetByClassQuery(ctx, *class, fmt.Sprintf("?%s", *query)) } else { - data, err = con.getByClassQuery(*class, *query) + data, err = con.GetByClassQuery(ctx, *class, *query) } if err != nil { fmt.Printf("Error %s", err) + return "", err } - return fmt.Sprintf("%s", data) + return fmt.Sprintf("%s", data), nil } type HandlerInit struct { @@ -362,6 +435,66 @@ type HandlerInit struct { AllFabrics map[string]*Fabric } +func (h HandlerInit) discovery(w http.ResponseWriter, r *http.Request) { + ctx := r.Context() + + fabric := r.URL.Query().Get("target") + if fabric != strings.ToLower(fabric) { + w.Header().Set("Content-Type", "text/plain; version=0.0.4; charset=utf-8") + w.Header().Set("Content-Length", "0") + log.WithFields(log.Fields{ + LogFieldFabric: fabric, + }).Warning("fabric target must be in lower case") + lrw := loggingResponseWriter{ResponseWriter: w} + lrw.WriteHeader(400) + return + } + + if fabric != "" { + _, ok := h.AllFabrics[fabric] + if !ok { + w.Header().Set("Content-Type", "text/plain; version=0.0.4; charset=utf-8") + w.Header().Set("Content-Length", "0") + log.WithFields(log.Fields{ + LogFieldFabric: fabric, + }).Warning("fabric target do not exists") + lrw := loggingResponseWriter{ResponseWriter: w} + lrw.WriteHeader(404) + return + } + } + + discovery := Discovery{ + Fabric: fabric, + Fabrics: h.AllFabrics, + } + + lrw := loggingResponseWriter{ResponseWriter: w} + + serviceDiscoveries, err := discovery.DoDiscovery(ctx) + if err != nil || len(serviceDiscoveries) == 0 { + lrw.WriteHeader(http.StatusServiceUnavailable) + return + } + + w.Header().Set("Content-Type", "application/json; charset=utf-8") + lrw.WriteHeader(http.StatusOK) + enc := json.NewEncoder(w) + enc.SetIndent("", " ") + if err := enc.Encode(serviceDiscoveries); err != nil { + lrw.WriteHeader(http.StatusInternalServerError) + return + } +} + +func formatQueries(queries string) string { + var trimQueries string + trimQueries = strings.ReplaceAll(queries, " ", "") + trimQueries = strings.ReplaceAll(trimQueries, "\n", "") + trimQueries = strings.ReplaceAll(trimQueries, "\t", "") + return trimQueries +} + func (h HandlerInit) getMonitorMetrics(w http.ResponseWriter, r *http.Request) { openmetrics := false @@ -370,14 +503,42 @@ func (h HandlerInit) getMonitorMetrics(w http.ResponseWriter, r *http.Request) { openmetrics = true } + var node *string fabric := r.URL.Query().Get("target") - queries := r.URL.Query().Get("queries") + queryArray := r.URL.Query()["queries"] + nodeName := r.URL.Query().Get("node") + + if nodeName != "" { + // Check if the nodeName is a valid url if not append https:// + if queryArray == nil { + lrw := loggingResponseWriter{ResponseWriter: w} + lrw.WriteHeader(400) + return + } + _, err := url.ParseRequestURI(nodeName) + if err != nil { + nodeName = fmt.Sprintf("https://%s", nodeName) + } + node = &nodeName + } else { + node = nil + } + + // Handle multiple queries + var queries []string + for _, queryString := range queryArray { + // If the queries query parameter include a comma, split it and add to the queries array + querySplit := strings.Split(queryString, ",") + for _, query := range querySplit { + queries = append(queries, strings.TrimSpace(query)) + } + } if fabric != strings.ToLower(fabric) { w.Header().Set("Content-Type", "text/plain; version=0.0.4; charset=utf-8") w.Header().Set("Content-Length", "0") log.WithFields(log.Fields{ - "fabric": fabric, + LogFieldFabric: fabric, }).Warning("fabric target must be in lower case") lrw := loggingResponseWriter{ResponseWriter: w} lrw.WriteHeader(400) @@ -390,7 +551,7 @@ func (h HandlerInit) getMonitorMetrics(w http.ResponseWriter, r *http.Request) { w.Header().Set("Content-Type", "text/plain; version=0.0.4; charset=utf-8") w.Header().Set("Content-Length", "0") log.WithFields(log.Fields{ - "fabric": fabric, + LogFieldFabric: fabric, }).Warning("fabric target do not exists") lrw := loggingResponseWriter{ResponseWriter: w} lrw.WriteHeader(404) @@ -398,15 +559,15 @@ func (h HandlerInit) getMonitorMetrics(w http.ResponseWriter, r *http.Request) { } ctx := r.Context() - ctx = context.WithValue(ctx, "fabric", fabric) - api := *newAciAPI(ctx, h.AllFabrics[fabric], h.AllQueries, queries) + ctx = context.WithValue(ctx, LogFieldFabric, fabric) + api := newAciAPI(ctx, h.AllFabrics[fabric], h.AllQueries, queries, node) start := time.Now() aciName, metrics, err := api.CollectMetrics() log.WithFields(log.Fields{ - "requestid": ctx.Value("requestid"), - "exec_time": time.Since(start).Microseconds(), - "fabric": fmt.Sprintf("%v", ctx.Value("fabric")), + LogFieldRequestID: ctx.Value(LogFieldRequestID), + LogFieldExecTime: time.Since(start).Microseconds(), + LogFieldFabric: fmt.Sprintf("%v", ctx.Value(LogFieldFabric)), }).Info("total query collection time") commonLabels := make(map[string]string) @@ -419,9 +580,9 @@ func (h HandlerInit) getMonitorMetrics(w http.ResponseWriter, r *http.Request) { var bodyText = Metrics2Prometheus(metrics, api.metricPrefix, commonLabels, metricsFormat) log.WithFields(log.Fields{ - "requestid": ctx.Value("requestid"), - "exec_time": time.Since(start).Microseconds(), - "fabric": fmt.Sprintf("%v", ctx.Value("fabric")), + LogFieldRequestID: ctx.Value(LogFieldRequestID), + LogFieldExecTime: time.Since(start).Microseconds(), + LogFieldFabric: fmt.Sprintf("%v", ctx.Value(LogFieldFabric)), }).Info("metrics to prometheus format") if openmetrics { @@ -438,7 +599,8 @@ func (h HandlerInit) getMonitorMetrics(w http.ResponseWriter, r *http.Request) { if err != nil { lrw.WriteHeader(503) } - w.Write([]byte(bodyText)) + _, _ = w.Write([]byte(bodyText)) + return } @@ -450,7 +612,7 @@ func alive(w http.ResponseWriter, r *http.Request) { lrw := loggingResponseWriter{ResponseWriter: w} lrw.WriteHeader(200) - w.Write([]byte(alive)) + _, _ = w.Write([]byte(alive)) } func nextRequestID() ksuid.KSUID { @@ -465,18 +627,18 @@ func logCall(next http.Handler) http.Handler { lrw := loggingResponseWriter{ResponseWriter: w} requestId := nextRequestID() - ctx := context.WithValue(r.Context(), "requestid", requestId) + ctx := context.WithValue(r.Context(), LogFieldRequestID, requestId) next.ServeHTTP(&lrw, r.WithContext(ctx)) // call original w.Header().Set("Content-Length", strconv.Itoa(lrw.length)) log.WithFields(log.Fields{ - "method": r.Method, - "uri": r.RequestURI, - "fabric": r.URL.Query().Get("target"), - "status": lrw.statusCode, - "length": lrw.length, - "requestid": requestId, - "exec_time": time.Since(start).Microseconds(), + "method": r.Method, + "uri": r.RequestURI, + LogFieldFabric: r.URL.Query().Get("target"), + "status": lrw.statusCode, + "length": lrw.length, + LogFieldRequestID: ctx.Value(LogFieldRequestID), + LogFieldExecTime: time.Since(start).Microseconds(), }).Info("api call") }) } @@ -485,13 +647,9 @@ func promMonitor(next http.Handler, ops *prometheus.HistogramVec, endpoint strin return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { start := time.Now() - lrw := loggingResponseWriter{ResponseWriter: w} - next.ServeHTTP(&lrw, r) // call original - response := time.Since(start).Seconds() - ops.With(prometheus.Labels{"url": endpoint, "status": strconv.Itoa(lrw.statusCode)}).Observe(response) }) } diff --git a/config.d/interface.yaml b/config.d/interface.yaml index e80f9f4..b7746d0 100644 --- a/config.d/interface.yaml +++ b/config.d/interface.yaml @@ -39,6 +39,45 @@ class_queries: # The regex where the string enclosed in the P is the label name regex: "^topology/pod-(?P[1-9][0-9]*)/node-(?P[1-9][0-9]*)/sys/phys-\\[(?P[^\\]]+)\\]/" + node_interface_info: + # Interface speed and status + class_name: ethpmPhysIf + metrics: + # The name of the metrics without prefix and unit + - name: interface_oper_speed + value_name: ethpmPhysIf.attributes.operSpeed + unit: bps + type: gauge + help: The current operational speed of the interface, in bits per second. + value_transform: + 'unknown': 0 + '100M': 100000000 + '1G': 1000000000 + '10G': 10000000000 + '25G': 25000000000 + '40G': 40000000000 + '100G': 100000000000 + '400G': 400000000000 + - name: interface_oper_state + # The field in the json that is used as the metric value, qualified path (gjson) under imdata + value_name: ethpmPhysIf.attributes.operSt + # Type + type: gauge + # Help text without prefix of metrics name + help: The current operational state of the interface. (0=unknown, 1=down, 2=up, 3=link-up) + # A string to float64 transform table of the value + value_transform: + 'unknown': 0 + 'down': 1 + 'up': 2 + 'link-up': 3 + # The labels to extract as regex + labels: + # The field in the json used to parse the labels from + - property_name: ethpmPhysIf.attributes.dn + # The regex where the string enclosed in the P is the label name + regex: "^sys/phys-\\[(?P[^\\]]+)\\]/" + interface_rx_stats: class_name: eqptIngrBytes5min metrics: diff --git a/config_node.d/interface.yaml b/config_node.d/interface.yaml new file mode 100644 index 0000000..489a9fd --- /dev/null +++ b/config_node.d/interface.yaml @@ -0,0 +1,147 @@ +class_queries: + + interface_info: + # Interface speed and status + class_name: ethpmPhysIf + metrics: + # The name of the metrics without prefix and unit + - name: interface_oper_speed + value_name: ethpmPhysIf.attributes.operSpeed + unit: bps + type: gauge + help: The current operational speed of the interface, in bits per second. + value_transform: + 'unknown': 0 + '100M': 100000000 + '1G': 1000000000 + '10G': 10000000000 + '25G': 25000000000 + '40G': 40000000000 + '100G': 100000000000 + '400G': 400000000000 + - name: interface_oper_state + # The field in the json that is used as the metric value, qualified path (gjson) under imdata + value_name: ethpmPhysIf.attributes.operSt + # Type + type: gauge + # Help text without prefix of metrics name + help: The current operational state of the interface. (0=unknown, 1=down, 2=up, 3=link-up) + # A string to float64 transform table of the value + value_transform: + 'unknown': 0 + 'down': 1 + 'up': 2 + 'link-up': 3 + # The labels to extract as regex + labels: + # The field in the json used to parse the labels from + - property_name: ethpmPhysIf.attributes.dn + # The regex where the string enclosed in the P is the label name + regex: "^sys/phys-\\[(?P[^\\]]+)\\]/" + + interface_rx_stats: + class_name: eqptIngrBytes5min + metrics: + - name: interface_rx_unicast + value_name: eqptIngrBytes5min.attributes.unicastCum + type: counter + unit: bytes + help: The number of unicast bytes received on the interface since it was integrated into the fabric. + - name: interface_rx_multicast + value_name: eqptIngrBytes5min.attributes.multicastCum + type: counter + unit: bytes + help: The number of multicast bytes received on the interface since it was integrated into the fabric. + - name: interface_rx_broadcast + value_name: eqptIngrBytes5min.attributes.floodCum + type: counter + unit: bytes + help: The number of broadcast bytes received on the interface since it was integrated into the fabric. + labels: + - property_name: eqptIngrBytes5min.attributes.dn + regex: "^sys/(?P[a-z]+)-\\[(?P[^\\]]+)\\]/" + + interface_tx_stats: + class_name: eqptEgrBytes5min + metrics: + - name: interface_tx_unicast + value_name: eqptEgrBytes5min.attributes.unicastCum + type: counter + unit: bytes + help: The number of unicast bytes transmitted on the interface since it was integrated into the fabric. + - name: interface_tx_multicast + value_name: eqptEgrBytes5min.attributes.multicastCum + type: counter + unit: bytes + help: The number of multicast bytes transmitted on the interface since it was integrated into the fabric. + - name: interface_tx_broadcast + value_name: eqptEgrBytes5min.attributes.floodCum + type: counter + unit: bytes + help: The number of broadcast bytes transmitted on the interface since it was integrated into the fabric. + labels: + - property_name: eqptEgrBytes5min.attributes.dn + regex: "^sys/(?P[a-z]+)-\\[(?P[^\\]]+)\\]/" + + interface_rx_err_stats: + class_name: eqptIngrDropPkts5min + metrics: + - name: interface_rx_buffer_dropped + value_name: eqptIngrDropPkts5min.attributes.bufferCum + type: counter + unit: pkts + help: The number of packets dropped by the interface due to a + buffer overrun while receiving since it was integrated into the + fabric. + - name: interface_rx_error_dropped + value_name: eqptIngrDropPkts5min.attributes.errorCum + type: counter + unit: pkts + help: The number of packets dropped by the interface due to a + packet error while receiving since it was integrated into the + fabric. + - name: interface_rx_forwarding_dropped + value_name: eqptIngrDropPkts5min.attributes.forwardingCum + type: counter + unit: pkts + help: The number of packets dropped by the interface due to a + forwarding issue while receiving since it was integrated into the + fabric. + - name: interface_rx_loadbal_dropped + value_name: eqptIngrDropPkts5min.attributes.lbCum + type: counter + unit: pkts + help: The number of packets dropped by the interface due to a + load balancing issue while receiving since it was integrated into + the fabric. + labels: + - property_name: eqptIngrDropPkts5min.attributes.dn + regex: "^sys/(?P[a-z]+)-\\[(?P[^\\]]+)\\]/" + + interface_tx_err_stats: + class_name: eqptEgrDropPkts5min + metrics: + - name: interface_tx_queue_dropped + value_name: eqptEgrDropPkts5min.attributes.afdWredCum + type: counter + unit: pkts + help: The number of packets dropped by the interface during queue + management while transmitting since it was integrated into the + fabric. + - name: interface_tx_buffer_dropped + value_name: eqptEgrDropPkts5min.attributes.bufferCum + type: counter + unit: pkts + help: The number of packets dropped by the interface due to a + buffer overrun while transmitting since it was integrated into the + fabric. + - name: interface_tx_error_dropped + value_name: eqptEgrDropPkts5min.attributes.errorCum + type: counter + unit: pkts + help: The number of packets dropped by the interface due to a + packet error while transmitting since it was integrated into the + fabric. + labels: + - property_name: eqptEgrDropPkts5min.attributes.dn + regex: "^sys/(?P[a-z]+)-\\[(?P[^\\]]+)\\]/" diff --git a/config_node.d/node.yaml b/config_node.d/node.yaml new file mode 100644 index 0000000..b58ff01 --- /dev/null +++ b/config_node.d/node.yaml @@ -0,0 +1,311 @@ +class_queries: + node_cpu: + class_name: procSysCPU5min + metrics: + - name: node_cpu_user + value_name: procSysCPU5min.attributes.userLast + type: "gauge" + unit: "ratio" + help: "Returns the user space cpu load of a fabric node" + # Recalculate the metrics value. The expression support simple math expressions - https://github.com/Knetic/govaluate + # The name must be value. + # This example recalculate percentage like 90 to 0.9 + value_calculation: "value / 100" + - name: node_cpu_kernel + value_name: procSysCPU5min.attributes.kernelLast + type: "gauge" + unit: "ratio" + help: "Returns the kernel space cpu load of a fabric node" + value_calculation: "value / 100" + labels: + - property_name: procSysCPU5min.attributes.dn + regex: "^sys/procsys/CDprocSysCPU5min" + + node_memory: + class_name: procSysMem5min + metrics: + - name: node_memory_used + value_name: procSysMem5min.attributes.usedLast + type: "gauge" + unit: "bytes" + help: "Returns the used memory of a fabric node" + - name: node_memory_free + value_name: procSysMem5min.attributes.freeLast + type: "gauge" + unit: "bytes" + help: "Returns the kernel space cpu load of a fabric node" + labels: + - property_name: procSysMem5min.attributes.dn + regex: "^sys/procsys/CDprocSysMem5min" + + node_temperature_sup: + class_name: eqptSensor + query_parameter: '?query-target-filter=wcard(eqptSensor.dn,"sup")' + metrics: + - name: node_temperature + value_name: eqptSensor.attributes.value + type: "gauge" + help: "Returns the temperature by sensor of a fabric node" + labels: + - property_name: eqptSensor.attributes.dn + regex: "^sys/ch/supslot-(?P[1-9][0-9]*)/sup/sensor-(?P[1-9][0-9]*)" + + node_temperature_board: + class_name: eqptSensor + query_parameter: '?query-target-filter=wcard(eqptSensor.dn,"bslot")' + metrics: + - name: node_temperature + value_name: eqptSensor.attributes.value + type: "gauge" + help: "Returns the temperature by sensor of a fabric node" + labels: + - property_name: eqptSensor.attributes.dn + regex: "^sys/ch/bslot/board/sensor-(?P[1-9][0-9]*)" + + fru_power_usage: + class_name: eqptFruPower5min + metrics: + - name: fru_power_drawn_avg + value_name: eqptFruPower5min.attributes.drawnAvg + type: gauge + - name: fru_power_drawn_last + value_name: eqptFruPower5min.attributes.drawnLast + type: gauge + - name: fru_power_drawn_max + value_name: eqptFruPower5min.attributes.drawnMax + type: gauge + - name: fru_power_drawn_min + value_name: eqptFruPower5min.attributes.drawnMin + type: gauge + labels: + - property_name: eqptFruPower5min.attributes.dn + regex: "^sys/ch/(?P.*[0-9]+)/" + + ps_power_usage: + class_name: eqptPsPower5min + metrics: + - name: psu_power_drawn_avg + value_name: eqptPsPower5min.attributes.drawnAvg + - name: psu_power_drawn_last + value_name: eqptPsPower5min.attributes.drawnLast + - name: psu_power_drawn_max + value_name: eqptPsPower5min.attributes.drawnMax + - name: psu_power_drawn_min + value_name: eqptPsPower5min.attributes.drawnMin + - name: psu_power_drawn_base + value_name: eqptPsPower5min.attributes.drawnTrBase + - name: psu_power_drawn_ttl + value_name: eqptPsPower5min.attributes.drawnTtl + - name: psu_supplied_avg + value_name: eqptPsPower5min.attributes.suppliedAvg + - name: psu_supplied_last + value_name: eqptPsPower5min.attributes.suppliedLast + - name: psu_supplied_max + value_name: eqptPsPower5min.attributes.suppliedMax + - name: psu_supplied_min + value_name: eqptPsPower5min.attributes.suppliedMin + - name: psu_supplied_base + value_name: eqptPsPower5min.attributes.suppliedTrBase + labels: + - property_name: eqptPsPower5min.attributes.dn + regex: "^sys/ch/psuslot-(?P[1-9][0-9]*)/" + + node_scale_profiles: + class_name: configprofileEntity + metrics: + - name: node_bd_capacity + value_name: configprofileEntity.attributes.bd + type: gauge + help: Max number of BDs a node supports + - name: node_ipv4_capacity + value_name: configprofileEntity.attributes.epIpv4 + type: gauge + help: Max number of IPv4 a node supports + - name: node_ipv6_capacity + value_name: configprofileEntity.attributes.epIpv6 + type: gauge + help: Max number of IPv6 a node supports + - name: node_epg_capacity + value_name: configprofileEntity.attributes.epg + type: gauge + help: Max number of EPGs a node supports + - name: node_esg_capacity + value_name: configprofileEntity.attributes.esg + type: gauge + help: Max number of ESG a node supports + - name: node_esgIp_capacity + value_name: configprofileEntity.attributes.esgIp + type: gauge + help: Max number of IP Based classification for ESGs a node supports + - name: node_esgMac_capacity + value_name: configprofileEntity.attributes.esgMac + type: gauge + help: Max number of MAC Based classification for ESGs a node supports + - name: node_lpm_capacity + value_name: configprofileEntity.attributes.lpm + type: gauge + help: Max number of Longest Prefix Match a node supports + - name: node_slash128_capacity + value_name: configprofileEntity.attributes.slash128 + type: gauge + help: Max number of /128 Routes a node supports + - name: node_slash32_capacity + value_name: configprofileEntity.attributes.slash32 + type: gauge + help: Max number of /32 Routes a node supports + - name: max_proxy_db_capacity + value_name: configprofileEntity.attributes.syntheticIp + type: gauge + help: Max capacity of the proxy_db for each spine, the minimum is the max scale for the whole fabric. Ignore values of 0 as those are leaves + - name: node_vrf_capacity + value_name: configprofileEntity.attributes.vrf + type: gauge + help: Max number of VRFs a node supports + labels: + - property_name: configprofileEntity.attributes.name + regex: "^(?P.*)" + + node_active_scale_profile: + class_name: topoctrlFwdScaleProf + metrics: + - name: node_active_scale_profile + value_name: topoctrlFwdScaleProf.attributes.modTs + # Use the time the profile was applied + value_regex_transformation: "(?P.*)" + value_calculation: "date" + labels: + # active_profile "default" is the same as configured_profile "high-policy" this is an inconsistency in the Object model + - property_name: topoctrlFwdScaleProf.attributes.profType + regex: "^(?P.*)" + - property_name: topoctrlFwdScaleProf.attributes.currentProfile + regex: "cfgent-(?P.*)" + + node_tcam_current: + class_name: eqptcapacityPolUsage5min + metrics: + - name: node_policy_cum + value_name: eqptcapacityPolUsage5min.attributes.polUsageCum + type: "gauge" + - name: node_policy_base + value_name: eqptcapacityPolUsage5min.attributes.polUsageBase + type: "gauge" + - name: node_policy_capacity_cum + value_name: eqptcapacityPolUsage5min.attributes.polUsageCapCum + type: "gauge" + - name: node_policy_capacity_base + value_name: eqptcapacityPolUsage5min.attributes.polUsageCapBase + type: "gauge" + + node_labels_current: + class_name: eqptcapacityPGLabelUsage5min + metrics: + - name: node_labels_cum + value_name: eqptcapacityPGLabelUsage5min.attributes.pgLblUsageCum + type: "gauge" + - name: node_labels_base + value_name: eqptcapacityPGLabelUsage5min.attributes.pgLblUsageBase + type: "gauge" + - name: node_labels_capacity_cum + value_name: eqptcapacityPGLabelUsage5min.attributes.pgLblCapCum + type: "gauge" + - name: node_labels_capacity_base + value_name: eqptcapacityPGLabelUsage5min.attributes.pgLblCapBase + type: "gauge" + + node_mac_current: + class_name: eqptcapacityL2TotalUsage5min + metrics: + - name: node_mac_current + value_name: eqptcapacityL2TotalUsage5min.attributes.totalEpLast + type: "gauge" + - name: node_mac_capacity + value_name: eqptcapacityL2TotalUsage5min.attributes.totalEpCapLast + type: "gauge" + + node_ipv4_current: + class_name: eqptcapacityL3TotalUsage5min + metrics: + - name: node_ipv4_current + value_name: eqptcapacityL3TotalUsage5min.attributes.v4TotalEpLast + type: "gauge" + + node_ipv6_current: + class_name: eqptcapacityL3TotalUsage5min + metrics: + - name: node_ipv6_current + value_name: eqptcapacityL3TotalUsage5min.attributes.v6TotalEpLast + type: "gauge" + + node_mcast_current: + class_name: eqptcapacityMcastUsage5min + metrics: + - name: node_mcast_cum + value_name: eqptcapacityMcastUsage5min.attributes.localEpCum + type: "gauge" + - name: node_mcast_base + value_name: eqptcapacityMcastUsage5min.attributes.localEpBase + type: "gauge" + - name: node_mcast_capacity_cum + value_name: eqptcapacityMcastUsage5min.attributes.localEpCapCum + type: "gauge" + - name: node_mcast_capacity_base + value_name: eqptcapacityMcastUsage5min.attributes.localEpCapBase + type: "gauge" + + node_vlan_current: + class_name: eqptcapacityVlanUsage5min + metrics: + - name: node_vlan_cum + value_name: eqptcapacityVlanUsage5min.attributes.totalCum + type: "gauge" + - name: node_vlan_base + value_name: eqptcapacityVlanUsage5min.attributes.totalBase + type: "gauge" + - name: node_vlan_capacity_cum + value_name: eqptcapacityVlanUsage5min.attributes.totalCapCum + type: "gauge" + - name: node_vlan_capacity_base + value_name: eqptcapacityVlanUsage5min.attributes.totalCapBase + type: "gauge" + + node_lpm_current: + class_name: eqptcapacityPrefixEntries5min + metrics: + - name: node_lpm_current + value_name: eqptcapacityPrefixEntries5min.attributes.extNormalizedLast + type: "gauge" + + node_slash32_current: + class_name: eqptcapacityL3v4Usage325min + metrics: + - name: node_slash32_cum + value_name: eqptcapacityL3v4Usage325min.attributes.v4TotalCum + type: "gauge" + - name: node_slash32_base + value_name: eqptcapacityL3v4Usage325min.attributes.v4TotalBase + type: "gauge" + + node_slash128_current: + class_name: eqptcapacityL3v6Usage1285min + metrics: + - name: node_slash128_current + value_name: eqptcapacityL3v6Usage1285min.attributes.v6TotalCum + type: "gauge" + - name: node_slash128_base + value_name: eqptcapacityL3v6Usage1285min.attributes.v6TotalBase + type: "gauge" + + node_scale_ctx: + # The class names are a bit confusing: + ## fvEpP: EPGs count + ## fvEPSelectorDef / fvMacBdSelectorDef : IP / MAC Based ESG Selector + ## l3Dom: VRFs + class_name: ctxClassCnt + query_parameter: '?rsp-subtree-class=l2BD,fvEpP,l3Dom,fvMacBdSelectorDef,fvEPSelectorDef' + metrics: + - name: node_scale_ctx + value_name: ctxClassCnt.attributes.count + type: "gauge" + labels: + - property_name: ctxClassCnt.attributes.name + regex: "^(?P.*)" diff --git a/config_node.d/ospf.yaml b/config_node.d/ospf.yaml new file mode 100644 index 0000000..50eb12f --- /dev/null +++ b/config_node.d/ospf.yaml @@ -0,0 +1,25 @@ +class_queries: + # OSPF Neighbors + ospf_neighbors: + class_name: ospfAdjEp + query_parameter: '?order-by=ospfAdjEp.dn&rsp-subtree-include=required&rsp-subtree-class=ospfAdjStats&rsp-subtree=children' + metrics: + - name: ospf_neighbors + # As metric I am saving the last time the conenction changed state + value_name: ospfAdjEp.children.[ospfAdjStats].attributes.lastStChgTs + value_regex_transformation: "(?P.*)" + value_calculation: "date" + labels: + - property_name: ospfAdjEp.attributes.dn + regex: "^topology/pod-(?P[1-9][0-9]*)/node-(?P[1-9][0-9]*).*/dom-(?P.*)/if-\\[(?P.*)\\]" + - property_name: ospfAdjEp.attributes.area + regex: "(?P.*)" + - property_name: ospfAdjEp.attributes.id + regex: "(?P.*)" + - property_name: ospfAdjEp.attributes.operSt + regex: "(?P.*)" + - property_name: ospfAdjEp.attributes.peerIp + regex: "(?P.*)" + staticlabels: + - key: type + value: ospf diff --git a/config_node.d/system.yaml b/config_node.d/system.yaml new file mode 100644 index 0000000..82f6f65 --- /dev/null +++ b/config_node.d/system.yaml @@ -0,0 +1,294 @@ +class_queries: + node_top: + class_name: topSystem + metrics: + - name: node_id + value_name: topSystem.attributes.id + type: "gauge" + labels: + - property_name: topSystem.attributes.name + regex: "^(?P.*)" + node_top2: + class_name: topSystem + metrics: + - name: node_id2 + value_name: topSystem.attributes.id + type: "gauge" + labels: + - property_name: topSystem.attributes.name + regex: "^(?P.*)" + + fabric_node_info: + # Get all the fabric nodes (Controllers, Spines and Leaves) + class_name: fabricNode + query_parameter: '?order-by=fabricNode.dn' + metrics: + - name: fabric_node + # In this case we are not looking for a value just the labels for info + value_name: + type: "gauge" + help: "Returns the info of the infrastructure apic node" + unit: "info" + value_calculation: "1" + labels: + - property_name: fabricNode.attributes.name + regex: "^(?P.*)" + - property_name: fabricNode.attributes.address + regex: "^(?P.*)" + - property_name: fabricNode.attributes.role + regex: "^(?P.*)" + - property_name: fabricNode.attributes.adminSt + regex: "^(?P.*)" + - property_name: fabricNode.attributes.serial + regex: "^(?P.*)" + - property_name: fabricNode.attributes.model + regex: "^(?P.*)" + - property_name: fabricNode.attributes.version + regex: "^(?P.*)" + - property_name: fabricNode.attributes.dn + regex: "^topology/pod-(?P[1-9][0-9]*)/node-(?P[1-9][0-9]*)" + + uptime_topsystem: + class_name: topSystem + query_parameter: "?rsp-subtree-include=health" + metrics: + - name: uptime + type: counter + unit: seconds + help: The uptime since boot + value_name: topSystem.attributes.systemUpTime + value_regex_transformation: "([0-9].*):([0-2][0-9]):([0-6][0-9]):([0-6][0-9])\\..*" + value_calculation: "value1 * 86400 + value2 * 3600 + value3 * 60 + value4" + labels: + - property_name: topSystem.attributes.dn + regex: "^topology/pod-(?P[1-9][0-9]*)/node-(?P[1-9][0-9]*)/sys" + - property_name: topSystem.attributes.state + regex: "^(?P.*)" + - property_name: topSystem.attributes.oobMgmtAddr + regex: "^(?P.*)" + - property_name: topSystem.attributes.name + regex: "^(?P.*)" + - property_name: topSystem.attributes.role + regex: "^(?P.*)" + + max_capacity: + class_name: fvcapRule + # Additional query parameters for the class query, must start with ? and be separated by & + query_parameter: '?query-target-filter=ne(fvcapRule.userConstraint,"feature-unavailable")' + metrics: + - name: max_capacity + value_name: fvcapRule.attributes.constraint + type: gauge + help: Returns the max capacity of the fabric + labels: + - property_name: fvcapRule.attributes.subj + regex: "^(?P.*)" + + apic_node_info: + class_name: infraWiNode + metrics: + - name: infra_node + # In this case we are not looking for a value just the labels for info + value_name: + type: "gauge" + help: "Returns the info of the apic node" + unit: "info" + # Since this is an info metrics the value is always 1 + value_calculation: "1" + labels: + - property_name: infraWiNode.attributes.nodeName + regex: "^(?P.*)" + - property_name: infraWiNode.attributes.addr + regex: "^(?P.*)" + - property_name: infraWiNode.attributes.health + regex: "^(?P.*)" + - property_name: infraWiNode.attributes.apicMode + regex: "^(?P.*)" + - property_name: infraWiNode.attributes.adminSt + regex: "^(?P.*)" + - property_name: infraWiNode.attributes.operSt + regex: "^(?P.*)" + - property_name: infraWiNode.attributes.failoverStatus + regex: "^(?P.*)" + - property_name: infraWiNode.attributes.podId + regex: "^(?P.*)" + +compound_queries: + object_count: + # NOT + classnames: + - class_name: fvCtx + # The label value that will be set to the "labelname: class" + label_value: fvCtx + query_parameter: '?rsp-subtree-include=count' + - class_name: fvCEp + label_value: fvCEp + query_parameter: '?rsp-subtree-include=count' + - class_name: fvCEp + label_value: fvCEpIp + query_parameter: '?rsp-subtree-include=required,count&rsp-subtree-class=fvIp&rsp-subtree=children' + - class_name: fvAEPg + label_value: fvAEPg + query_parameter: '?rsp-subtree-include=count' + - class_name: fvBD + label_value: fvBD + query_parameter: '?rsp-subtree-include=count' + - class_name: fvTenant + label_value: fvTenant + query_parameter: '?rsp-subtree-include=count' + - class_name: vnsCDev + label_value: vnsCDev + query_parameter: '?rsp-subtree-include=count' + - class_name: vnsGraphInst + label_value: vnsGraphInst + query_parameter: '?rsp-subtree-include=count' + - class_name: fvIP + label_value: fvIP + query_parameter: '?rsp-subtree-include=count' + - class_name: fvSyntheticIp + label_value: fvSyntheticIp + query_parameter: '?rsp-subtree-include=count' + - class_name: eqptLC + label_value: eqptLC + query_parameter: '?rsp-subtree-include=count' + labelname: class + metrics: + - name: object_instances + value_name: moCount.attributes.count + type: gauge + help: Returns the current count of objects for ACI classes + + node_count: + classnames: + - class_name: topSystem + label_value: spine + query_parameter: '?query-target-filter=eq(topSystem.role,"spine")&rsp-subtree-include=count' + - class_name: topSystem + label_value: leaf + query_parameter: '?query-target-filter=eq(topSystem.role,"leaf")&rsp-subtree-include=count' + - class_name: topSystem + label_value: controller + query_parameter: '?query-target-filter=eq(topSystem.role,"controller")&rsp-subtree-include=count' + labelname: type + metrics: + - name: nodes + value_name: moCount.attributes.count + type: gauge + help: Returns the current count of nodes + +# Group class queries +group_class_queries: + # Gather all different health related metrics + health: + name: health + unit: ratio + type: gauge + help: Returns health score + queries: + - node_health: + class_name: topSystem + query_parameter: "?rsp-subtree-include=health" + metrics: + - value_name: topSystem.children.@reverse.0.healthInst.attributes.cur + value_calculation: "value / 100" + labels: + - property_name: topSystem.attributes.dn + regex: "^topology/pod-(?P[1-9][0-9]*)/node-(?P[1-9][0-9]*)/sys" + - property_name: topSystem.attributes.state + regex: "^(?P.*)" + - property_name: topSystem.attributes.oobMgmtAddr + regex: "^(?P.*)" + - property_name: topSystem.attributes.name + regex: "^(?P.*)" + - property_name: topSystem.attributes.role + regex: "^(?P.*)" + # A label for the class query + staticlabels: + - key: class + value: topSystem + + - fabric_health: + class_name: fabricHealthTotal + query_parameter: '?query-target-filter=wcard(fabricHealthTotal.dn,"topology/.*/health")' + metrics: + - + value_name: fabricHealthTotal.attributes.cur + value_calculation: "value / 100" + labels: + - property_name: fabricHealthTotal.attributes.dn + regex: "^topology/pod-(?P[1-9][0-9]*)/health" + staticlabels: + - key: class + value: fabricHealthTotal + + - contract: + class_name: fvCtx + query_parameter: '?rsp-subtree-include=health,required' + metrics: + - + value_name: fvCtx.children.[healthInst].attributes.cur + value_calculation: "value / 100" + labels: + - property_name: fvCtx.attributes.dn + regex: "^uni/tn-(?P.*)/ctx-(?P.*)" + staticlabels: + - key: class + value: fvCtx + + - bridge_domain_health_by_label: + class_name: fvBD + query_parameter: '?rsp-subtree-include=health,required' + metrics: + - + value_name: fvBD.children.[healthInst].attributes.cur + value_calculation: "value / 100" + labels: + - property_name: fvBD.attributes.dn + regex: "^uni/tn-(?P.*)/BD-(?P.*)" + staticlabels: + - key: class + value: fvBD + + - tenant: + class_name: fvTenant + query_parameter: '?rsp-subtree-include=health,required' + metrics: + - + value_name: fvTenant.children.[healthInst].attributes.cur + value_calculation: "value / 100" + labels: + - property_name: fvTenant.attributes.dn + regex: "^(?P.*)" + staticlabels: + - key: class + value: fvTenant + + - ap: + class_name: fvAp + query_parameter: '?rsp-subtree-include=health,required' + metrics: + - + value_name: fvAp.children.[healthInst].attributes.cur + value_calculation: "value / 100" + labels: + - property_name: fvAp.attributes.dn + regex: "^uni/tn-(?P.*)/ap-(?P.*)" + staticlabels: + - key: class + value: fvAp + + - aepg: + class_name: fvAEPg + query_parameter: '?rsp-subtree-include=health,required' + metrics: + - + value_name: fvAEPg.children.[healthInst].attributes.cur + value_calculation: "value / 100" + labels: + - property_name: fvAEPg.attributes.dn + regex: "^uni/tn-(?P.*)/ap-(?P.*)/epg-(?P.*)" + staticlabels: + - key: class + value: fvAEPg + + diff --git a/config_node.d/vlan.yaml b/config_node.d/vlan.yaml new file mode 100644 index 0000000..637f2a4 --- /dev/null +++ b/config_node.d/vlan.yaml @@ -0,0 +1,115 @@ +class_queries: + vlans: + class_name: fvnsEncapBlk + metrics: + - name: vlans_from + value_name: fvnsEncapBlk.attributes.from + type: gauge + help: The from vlan + value_regex_transformation: "vlan-(.*)" + - name: vlans_to + value_name: fvnsEncapBlk.attributes.to + type: gauge + help: The to vlan + value_regex_transformation: "vlan-(.*)" + labels: + - property_name: fvnsEncapBlk.attributes.dn + regex: "^uni/infra/vlanns-\\[(?P.+)\\]-static/from-\\[(?P.+)\\]-to-\\[(?P.+)\\]" + + static_binding_info: + class_name: fvAEPg + query_parameter: "?rsp-subtree-include=required&rsp-subtree-class=fvRsPathAtt&rsp-subtree=children" + metrics: + - name: static_binding + value_name: fvAEPg.children.[fvRsPathAtt].attributes.encap + type: gauge + value_regex_transformation: "vlan-(.*)" + help: "Static binding info" + labels: + - property_name: fvAEPg.attributes.dn + regex: "^uni/tn-(?P.*)/ap-(?P.*)/epg-(?P.*)" + - property_name: fvAEPg.attributes.[.*].attributes.tDn + regex: "^topology/pod-(?P[1-9][0-9]*)/(protpaths|paths)-(?P[1-9][0-9].*)/pathep-\\[(?P.+)\\]" + - property_name: fvAEPg.attributes.[.*].attributes.encap + regex: "^(?P.*)" + + dynamic_binding_info: + class_name: vlanCktEp + query_parameter: '?rsp-subtree-include=required&rsp-subtree-class=l2RsPathDomAtt&rsp-subtree=children' + metrics: + - name: dynamic_binding + value_name: vlanCktEp.children.[l2RsPathDomAtt].attributes.operSt + type: gauge + value_transform: + 'unknown': 0 + 'down': 1 + 'up': 2 + 'link-up': 3 + labels: + - property_name: vlanCktEp.attributes.epgDn + regex: "^uni/tn-(?P.*)/ap-(?P.*)/epg-(?P.*)" + - property_name: vlanCktEp.attributes.encap + regex: "^vlan-(?P.*)" + - property_name: vlanCktEp.attributes.pcTag + regex: "^(?P.*)" + - property_name: vlanCktEp.children.[l2RsPathDomAtt].attributes.tDn + regex: "^topology/pod-(?P[1-9][0-9]*)/node-(?P[1-9][0-9]+)/sys/conng/path-\\[(?P[^\\]]+)\\]" + + epg_port_vlan_binding: + class_name: vlanCktEp + query_parameter: '?order-by=vlanCktEp.dn&rsp-subtree-include=required&rsp-subtree-class=l2RsPathDomAtt&rsp-subtree=children' + metrics: + - name: epg_port_vlan_binding + value_name: vlanCktEp.children.[l2RsPathDomAtt].attributes.operSt + type: gauge + value_transform: + 'unknown': 0 + 'down': 1 + 'up': 2 + 'link-up': 3 + labels: + - property_name: vlanCktEp.attributes.epgDn + regex: "^uni/tn-(?P.*)/ap-(?P.*)/epg-(?P.*)" + - property_name: vlanCktEp.attributes.encap + regex: "^vlan-(?P.*)" + - property_name: vlanCktEp.attributes.pcTag + regex: "^(?P.*)" + - property_name: vlanCktEp.children.[l2RsPathDomAtt].attributes.tDn + regex: "^topology/pod-(?P[1-9][0-9]*)/node-(?P[1-9][0-9]+)/sys/conng/path-\\[(?P[^\\]]+)\\]" + + epg_port_vxlan_binding: + class_name: vxlanCktEp + query_parameter: '?order-by=vxlanCktEp.dn&rsp-subtree-include=required&rsp-subtree-class=l2RsPathDomAtt&rsp-subtree=children' + metrics: + - name: epg_port_vxlan_binding + value_name: vxlanCktEp.children.[l2RsPathDomAtt].attributes.operSt + type: gauge + value_transform: + 'unknown': 0 + 'down': 1 + 'up': 2 + 'link-up': 3 + labels: + - property_name: vxlanCktEp.attributes.epgDn + regex: "^uni/tn-(?P.*)/ap-(?P.*)/epg-(?P.*)" + - property_name: vxlanCktEp.attributes.encap + regex: "^vxlan-(?P.*)" + - property_name: vxlanCktEp.attributes.pcTag + regex: "^(?P.*)" + - property_name: vxlanCktEp.children.[l2RsPathDomAtt].attributes.tDn + regex: "^topology/pod-(?P[1-9][0-9]*)/node-(?P[1-9][0-9]+)/sys/conng/path-\\[(?P[^\\]]+)\\]" + + epg_to_port: + # TODO + class_name: vlanCktEp + query_parameter: '?rsp-subtree-include=required&rsp-subtree-class=l2RsPathDomAtt&rsp-subtree=children' + #query_parameter: '' + metrics: + - name: dynamic_binding + value_name: vlanCktEp.attributes.pcTag + type: gauge + labels: + - property_name: vlanCktEp.attributes.epgDn + regex: "^uni/tn-(?P.*)/ap-(?P.*)/epg-(?P.*)" + - property_name: vlanCktEp.children.[l2RsPathDomAtt].attributes.tDn + regex: "^topology/pod-(?P[1-9][0-9]*)/node-(?P[1-9][0-9]+)/sys/conng/path-\\[(?P[^\\]]+)\\]" diff --git a/configclassqueries.go b/configclassqueries.go index 7bb27dc..0b8343e 100644 --- a/configclassqueries.go +++ b/configclassqueries.go @@ -8,8 +8,6 @@ // GNU General Public License for more details. // You should have received a copy of the GNU General Public License // along with this program. If not, see . -// -// Copyright 2020 Opsdis AB package main diff --git a/defaults.go b/defaults.go index 570d071..1c6c947 100644 --- a/defaults.go +++ b/defaults.go @@ -8,8 +8,6 @@ // GNU General Public License for more details. // You should have received a copy of the GNU General Public License // along with this program. If not, see . -// -// Copyright 2020 Opsdis AB package main @@ -49,6 +47,8 @@ func SetDefaultValues() { viper.BindEnv("logfile") viper.SetDefault("logformat", "json") viper.BindEnv("logformat") + viper.SetDefault("loglevel", "info") + viper.BindEnv("loglevel") viper.SetDefault("config", "config") viper.BindEnv("config") viper.SetDefault("config_dir", "config.d") @@ -69,6 +69,14 @@ func SetDefaultValues() { viper.SetDefault("HTTPClient.keepalive", 15) viper.BindEnv("HTTPClient.keepalive") + // The page size when using paging + viper.SetDefault("HTTPClient.pagesize", 1000) + viper.BindEnv("HTTPClient.pagesize") + + // If parallel paging is enabled the exporter will fetch multiple pages at the same time + viper.SetDefault("HTTPClient.parallel_paging", false) + viper.BindEnv("HTTPClient.parallel_paging") + // This is currently not used viper.SetDefault("HTTPClient.tlshandshaketimeout", 10) viper.BindEnv("HTTPClient.tlshandshaketimeout") @@ -83,4 +91,11 @@ func SetDefaultValues() { viper.SetDefault("httpserver.write_timeout", 0) viper.BindEnv("httpserver.write_timeout") + // Service discovery + viper.SetDefault("service_discovery.labels", []string{"address", "dn", "fabricDomain", "fabricId", "id", + "inbMgmtAddr", "name", "nameAlias", "nodeType", "oobMgmtAddr", "podId", "role", "serial", "siteId", "state", + "version", + }) + viper.SetDefault("service_discovery.target_fields", []string{"aci_exporter_fabric", "oobMgmtAddr"}) + viper.SetDefault("service_discovery.target_format", "%s#%s") } diff --git a/discovery.go b/discovery.go new file mode 100644 index 0000000..321b804 --- /dev/null +++ b/discovery.go @@ -0,0 +1,281 @@ +// This program is free software: you can redistribute it and/or modify +// it under the terms of the GNU General Public License as published by +// the Free Software Foundation, either version 3 of the License, or +// (at your option) any later version. +// This program is distributed in the hope that it will be useful, +// but WITHOUT ANY WARRANTY; without even the implied warranty of +// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +// GNU General Public License for more details. +// You should have received a copy of the GNU General Public License +// along with this program. If not, see . + +package main + +import ( + "context" + "encoding/json" + "fmt" + "reflect" + "strings" + + log "github.com/sirupsen/logrus" + "github.com/tidwall/gjson" +) + +type ServiceDiscovery struct { + Targets []string `json:"targets"` + Labels map[string]string `json:"labels"` +} + +func NewServiceDiscovery() ServiceDiscovery { + return ServiceDiscovery{ + Targets: make([]string, 0), + Labels: make(map[string]string), + } +} + +type DiscoveryConfiguration struct { + LabelsKeys []string `mapstructure:"labels"` + TargetFields []string `mapstructure:"target_fields"` + TargetFormat string `mapstructure:"target_format"` +} + +type Discovery struct { + Fabric string + Fabrics map[string]*Fabric +} + +func (d Discovery) DoDiscovery(ctx context.Context) ([]ServiceDiscovery, error) { + + var serviceDiscoveries []ServiceDiscovery + var topSystems []TopSystem + if d.Fabric != "" { + aci, err := d.getInfraCont(ctx, d.Fabric) + if err != nil { + return serviceDiscoveries, err + } + topSystems = d.getTopSystem(ctx, d.Fabric) + sds, _ := d.parseToDiscoveryFormat(d.Fabric, topSystems) + serviceDiscoveries = append(serviceDiscoveries, sds...) + // Add the fabric as a target + fabricSd := NewServiceDiscovery() + fabricSd.Targets = append(fabricSd.Targets, d.Fabric) + fabricSd.Labels["__meta_role"] = "aci_exporter_fabric" + fabricSd.Labels["__meta_fabricDomain"] = aci + serviceDiscoveries = append(serviceDiscoveries, fabricSd) + } else { + for key := range d.Fabrics { + aci, err := d.getInfraCont(ctx, key) + if err != nil { + continue + } + topSystems := d.getTopSystem(ctx, key) + sds, _ := d.parseToDiscoveryFormat(key, topSystems) + serviceDiscoveries = append(serviceDiscoveries, sds...) + fabricSd := NewServiceDiscovery() + fabricSd.Targets = append(fabricSd.Targets, key) + fabricSd.Labels["__meta_role"] = "aci_exporter_fabric" + fabricSd.Labels["__meta_fabricDomain"] = aci + serviceDiscoveries = append(serviceDiscoveries, fabricSd) + + } + } + + return serviceDiscoveries, nil +} + +// p.connection.GetByClassQuery("infraCont", "?query-target=self") +func (d Discovery) getInfraCont(ctx context.Context, fabricName string) (string, error) { + class := "infraCont" + query := "?query-target=self" + data, err := cliQuery(ctx, &fabricName, &class, &query) + + if err != nil { + log.WithFields(log.Fields{ + "function": "discovery", + "class": class, + "fabric": fabricName, + }).Error(err) + return "", err + } + + if len(gjson.Get(data, "imdata.#.infraCont.attributes.fbDmNm").Array()) == 0 { + err = fmt.Errorf("could not determine ACI name, no data returned from APIC") + log.WithFields(log.Fields{ + "function": "discovery", + "class": class, + "fabric": fabricName, + }).Error(err) + return "", err + } + + aciName := gjson.Get(data, "imdata.#.infraCont.attributes.fbDmNm").Array()[0].Str + + if aciName != "" { + return aciName, nil + } + + err = fmt.Errorf("could not determine ACI name") + log.WithFields(log.Fields{ + "function": "discovery", + "class": class, + "fabric": fabricName, + }).Error(err) + return "", err +} + +func (d Discovery) getTopSystem(ctx context.Context, fabricName string) []TopSystem { + class := "topSystem" + query := "" + data, err := cliQuery(ctx, &fabricName, &class, &query) + if err != nil { + log.WithFields(log.Fields{ + "function": "discovery", + "class": class, + "fabric": fabricName, + }).Error(err) + return nil + } + + var topSystems []TopSystem + result := gjson.Get(data, "imdata") + result.ForEach(func(key, value gjson.Result) bool { + topSystemJson := gjson.Get(value.Raw, "topSystem.attributes").Raw + topSystem := &TopSystem{} + topSystem.ACIExporterFabric = fabricName + _ = json.Unmarshal([]byte(topSystemJson), topSystem) + topSystems = append(topSystems, *topSystem) + return true + }) + + return topSystems +} + +func (d Discovery) parseToDiscoveryFormat(fabricName string, topSystems []TopSystem) ([]ServiceDiscovery, error) { + var serviceDiscovery []ServiceDiscovery + for _, topSystem := range topSystems { + sd := &ServiceDiscovery{} + targetValue := make([]interface{}, len(d.Fabrics[fabricName].DiscoveryConfig.TargetFields)) + for i, field := range d.Fabrics[fabricName].DiscoveryConfig.TargetFields { + val, err := d.getField(&topSystem, field) + targetValue[i] = val + if err != nil { + return serviceDiscovery, err + } + } + + sd.Targets = append(sd.Targets, fmt.Sprintf(d.Fabrics[fabricName].DiscoveryConfig.TargetFormat, targetValue...)) + + sd.Labels = make(map[string]string) + sd.Labels[fmt.Sprintf("__meta_%s", "aci_exporter_fabric")] = topSystem.ACIExporterFabric + + for _, labelName := range d.Fabrics[fabricName].DiscoveryConfig.LabelsKeys { + labelValue, err := d.getField(&topSystem, labelName) + if err != nil { + return serviceDiscovery, err + } + sd.Labels[fmt.Sprintf("__meta_%s", labelName)] = labelValue + } + serviceDiscovery = append(serviceDiscovery, *sd) + } + return serviceDiscovery, nil +} + +func (d Discovery) getField(item interface{}, fieldName string) (string, error) { + v := reflect.ValueOf(item).Elem() + if !v.CanAddr() { + log.WithFields(log.Fields{ + "function": "discovery", + "fabric": d.Fabric, + "fieldName": fieldName, + }).Error("cannot assign to the item passed, item must be a pointer in order to assign") + return "", fmt.Errorf("cannot assign to the item passed, item must be a pointer in order to assign") + } + // It's possible we can cache this, which is why precompute all these ahead of time. + findJsonName := func(t reflect.StructTag) (string, error) { + if jt, ok := t.Lookup("json"); ok { + return strings.Split(jt, ",")[0], nil + } + log.WithFields(log.Fields{ + "function": "discovery", + "fabric": d.Fabric, + "fieldName": fieldName, + }).Error("tag provided does not define a json tag") + return "", fmt.Errorf("tag provided does not define a json tag") + } + fieldNames := map[string]int{} + for i := 0; i < v.NumField(); i++ { + typeField := v.Type().Field(i) + tag := typeField.Tag + jname, _ := findJsonName(tag) + fieldNames[jname] = i + } + + fieldNum, ok := fieldNames[fieldName] + if !ok { + log.WithFields(log.Fields{ + "function": "discovery", + "fabric": d.Fabric, + "fieldName": fieldName, + }).Error("field does not exist within the provided item") + return "", fmt.Errorf("field does not exist within the provided item") + } + fieldVal := v.Field(fieldNum) + return fieldVal.String(), nil +} + +type TopSystem struct { + Address string `json:"address"` + BootstrapState string `json:"bootstrapState"` + ChildAction string `json:"childAction"` + ClusterTimeDiff string `json:"clusterTimeDiff"` + ConfigIssues string `json:"configIssues"` + ControlPlaneMTU string `json:"controlPlaneMTU"` + CurrentTime string `json:"currentTime"` + Dn string `json:"dn"` + EnforceSubnetCheck string `json:"enforceSubnetCheck"` + EtepAddr string `json:"etepAddr"` + FabricDomain string `json:"fabricDomain"` + FabricID string `json:"fabricId"` + FabricMAC string `json:"fabricMAC"` + ID string `json:"id"` + InbMgmtAddr string `json:"inbMgmtAddr"` + InbMgmtAddr6 string `json:"inbMgmtAddr6"` + InbMgmtAddr6Mask string `json:"inbMgmtAddr6Mask"` + InbMgmtAddrMask string `json:"inbMgmtAddrMask"` + InbMgmtGateway string `json:"inbMgmtGateway"` + InbMgmtGateway6 string `json:"inbMgmtGateway6"` + LastRebootTime string `json:"lastRebootTime"` + LastResetReason string `json:"lastResetReason"` + LcOwn string `json:"lcOwn"` + ModTs string `json:"modTs"` + Mode string `json:"mode"` + MonPolDn string `json:"monPolDn"` + Name string `json:"name"` + NameAlias string `json:"nameAlias"` + NodeType string `json:"nodeType"` + OobMgmtAddr string `json:"oobMgmtAddr"` + OobMgmtAddr6 string `json:"oobMgmtAddr6"` + OobMgmtAddr6Mask string `json:"oobMgmtAddr6Mask"` + OobMgmtAddrMask string `json:"oobMgmtAddrMask"` + OobMgmtGateway string `json:"oobMgmtGateway"` + OobMgmtGateway6 string `json:"oobMgmtGateway6"` + PodID string `json:"podId"` + RemoteNetworkID string `json:"remoteNetworkId"` + RemoteNode string `json:"remoteNode"` + RlOperPodID string `json:"rlOperPodId"` + RlRoutableMode string `json:"rlRoutableMode"` + RldirectMode string `json:"rldirectMode"` + Role string `json:"role"` + Serial string `json:"serial"` + ServerType string `json:"serverType"` + SiteID string `json:"siteId"` + State string `json:"state"` + Status string `json:"status"` + SystemUpTime string `json:"systemUpTime"` + TepPool string `json:"tepPool"` + UnicastXrEpLearnDisable string `json:"unicastXrEpLearnDisable"` + Version string `json:"version"` + VirtualMode string `json:"virtualMode"` + ACIExporterFabric string `json:"aci_exporter_fabric"` +} diff --git a/fabricconfig.go b/fabricconfig.go index 919d85f..a88ee61 100644 --- a/fabricconfig.go +++ b/fabricconfig.go @@ -8,14 +8,14 @@ // GNU General Public License for more details. // You should have received a copy of the GNU General Public License // along with this program. If not, see . -// -// Copyright 2020 Opsdis AB package main type Fabric struct { - Username string `mapstructure:"username"` - Password string `mapstructure:"password"` - Apic []string - AciName string `mapstructure:"aci_name"` + Username string `mapstructure:"username"` + Password string `mapstructure:"password"` + Apic []string `mapstructure:"apic"` + AciName string `mapstructure:"aci_name"` + FabricName string `mapstructure:"fabric_name"` + DiscoveryConfig DiscoveryConfiguration `mapstructure:"service_discovery"` } diff --git a/go.mod b/go.mod index ffc7b94..ac6a7ad 100644 --- a/go.mod +++ b/go.mod @@ -10,6 +10,7 @@ require ( github.com/spf13/viper v1.7.0 github.com/tidwall/gjson v1.9.3 github.com/umisama/go-regexpcache v0.0.0-20150417035358-2444a542492f + gopkg.in/yaml.v2 v2.3.0 ) require ( @@ -37,5 +38,4 @@ require ( golang.org/x/text v0.3.8 // indirect google.golang.org/protobuf v1.26.0-rc.1 // indirect gopkg.in/ini.v1 v1.51.0 // indirect - gopkg.in/yaml.v2 v2.3.0 // indirect ) diff --git a/httpclient.go b/httpclient.go index ca3f251..fc4d65c 100644 --- a/httpclient.go +++ b/httpclient.go @@ -8,8 +8,6 @@ // GNU General Public License for more details. // You should have received a copy of the GNU General Public License // along with this program. If not, see . -// -// Copyright 2020-2023 Opsdis package main diff --git a/metric.go b/metric.go index 44bcffd..0589c05 100644 --- a/metric.go +++ b/metric.go @@ -8,8 +8,6 @@ // GNU General Public License for more details. // You should have received a copy of the GNU General Public License // along with this program. If not, see . -// -// Copyright 2020 Opsdis AB package main diff --git a/prometheus/prometheus_nodes.yml b/prometheus/prometheus_nodes.yml new file mode 100644 index 0000000..ba18372 --- /dev/null +++ b/prometheus/prometheus_nodes.yml @@ -0,0 +1,96 @@ +# example for aci-exporter +global: + scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute. + evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute. + # scrape_timeout is set to the global default (10s). + +# A scrape configuration containing exactly one endpoint to scrape: +# Here it's Prometheus itself. +scrape_configs: + + # Job for APIC queries + - job_name: 'aci' + scrape_interval: 1m + scrape_timeout: 30s + metrics_path: /probe + params: + queries: + - health,fabric_node_info,object_count,max_capacity + + http_sd_configs: + # discovery all fabrics + # To discover an individual fabric use - url: "http://localhost:9643/sd?target=" + - url: "http://localhost:9643/sd" + refresh_interval: 5m + + relabel_configs: + - source_labels: [ __meta_role ] + # Only include the aci_exporter_fabric __meta_role + regex: "aci_exporter_fabric" + action: "keep" + + - source_labels: [ __address__ ] + target_label: __param_target + - source_labels: [ __param_target ] + target_label: instance + - target_label: __address__ + replacement: 127.0.0.1:9643 + + # Job for ACI nodes based on discovery + - job_name: 'aci_nodes' + scrape_interval: 1m + scrape_timeout: 30s + metrics_path: /probe + params: + # OBS make sure to specify queries that only works for nodes AND have correct label regex for node based response + queries: + - interface_info + - interface_rx_stats + - interface_tx_stats + - interface_rx_err_stats + - interface_tx_err_stats + + http_sd_configs: + # discovery all fabrics + # To discover an individual fabric use - url: "http://localhost:9643/sd?target=" + - url: "http://localhost:9643/sd" + refresh_interval: 5m + + relabel_configs: + - source_labels: [ __meta_role ] + # Only include the spine and leaf __meta_role + regex: "(spine|leaf)" + action: "keep" + + # Get the target param from __address__ that is # by default + - source_labels: [ __address__ ] + separator: "#" + regex: (.*)#(.*) + replacement: "$1" + target_label: __param_target + + # Get the node param from __address__ that is # by default + - source_labels: [ __address__ ] + separator: "#" + regex: (.*)#(.*) + replacement: "$2" + target_label: __param_node + + # Set instance to the ip/hostname from the __param_node + - source_labels: [ __param_node ] + target_label: instance + + # Add labels from discovery + - source_labels: [ __meta_fabricDomain ] + target_label: aci + - source_labels: [ __meta_id ] + target_label: nodeid + - source_labels: [ __meta_podId ] + target_label: podid + - source_labels: [ __meta_role ] + target_label: role + - source_labels: [ __meta_name ] + target_label: name + + - target_label: __address__ + replacement: 127.0.0.1:9643