[inputs.opcua] Regression between 1.18.3 and 1.19.1 #9523

R290 · 2021-07-20T19:23:18Z

Relevant telegraf.conf:

# Configuration for telegraf agent
[agent]
  ## Default data collection interval for all inputs
  interval = "10s"
  ## Rounds collection interval to 'interval'
  ## ie, if interval="10s" then always collect on :00, :10, :20, etc.
  round_interval = true

  ## Telegraf will send metrics to outputs in batches of at most
  ## metric_batch_size metrics.
  ## This controls the size of writes that Telegraf sends to output plugins.
  metric_batch_size = 1000

  ## Maximum number of unwritten metrics per output.  Increasing this value
  ## allows for longer periods of output downtime without dropping metrics at the
  ## cost of higher maximum memory usage.
  metric_buffer_limit = 10000

  ## Collection jitter is used to jitter the collection by a random amount.
  ## Each plugin will sleep for a random time within jitter before collecting.
  ## This can be used to avoid many plugins querying things like sysfs at the
  ## same time, which can have a measurable effect on the system.
  collection_jitter = "0s"

  ## Default flushing interval for all outputs. Maximum flush_interval will be
  ## flush_interval + flush_jitter
  flush_interval = "10s"
  ## Jitter the flush interval by a random amount. This is primarily to avoid
  ## large write spikes for users running a large number of telegraf instances.
  ## ie, a jitter of 5s and interval 10s means flushes will happen every 10-15s
  flush_jitter = "0s"

  ## By default or when set to "0s", precision will be set to the same
  ## timestamp order as the collection interval, with the maximum being 1s.
  ##   ie, when interval = "10s", precision will be "1s"
  ##       when interval = "250ms", precision will be "1ms"
  ## Precision will NOT be used for service inputs. It is up to each individual
  ## service input to set the timestamp at the appropriate precision.
  ## Valid time units are "ns", "us" (or "µs"), "ms", "s".
  precision = ""

  ## Log at debug level.
  debug = true
  ## Log only error level messages.
  # quiet = false

  ## Log target controls the destination for logs and can be one of "file",
  ## "stderr" or, on Windows, "eventlog".  When set to "file", the output file
  ## is determined by the "logfile" setting.
  # logtarget = "file"

  ## Name of the file to be logged to when using the "file" logtarget.  If set to
  ## the empty string then logs are written to stderr.
  # logfile = ""

  ## The logfile will be rotated after the time interval specified.  When set
  ## to 0 no time based rotation is performed.  Logs are rotated only when
  ## written to, if there is no log activity rotation may be delayed.
  # logfile_rotation_interval = "0d"

  ## The logfile will be rotated when it becomes larger than the specified
  ## size.  When set to 0 no size based rotation is performed.
  # logfile_rotation_max_size = "0MB"

  ## Maximum number of rotated archives to keep, any older logs are deleted.
  ## If set to -1, no archives are removed.
  # logfile_rotation_max_archives = 5

  ## Pick a timezone to use when logging or type 'local' for local time.
  ## Example: America/Chicago
  # log_with_timezone = ""

  ## Override default hostname, if empty use os.Hostname()
  hostname = ""
  ## If set to true, do no set the "host" tag in the telegraf agent.
  omit_hostname = false

[[inputs.opcua]]
  ## Metric name
  # name = "opcua"
  #
  ## OPC UA Endpoint URL
  endpoint = "opc.tcp://endpoint:4840"
  #
  ## Maximum time allowed to establish a connect to the endpoint.
  # connect_timeout = "10s"
  #
  ## Maximum time allowed for a request over the estabilished connection.
  # request_timeout = "5s"
  #
  ## Security policy, one of "None", "Basic128Rsa15", "Basic256",
  ## "Basic256Sha256", or "auto"
  # security_policy = "auto"
  #
  ## Security mode, one of "None", "Sign", "SignAndEncrypt", or "auto"
  security_mode = "None"
  #
  ## Path to cert.pem. Required when security mode or policy isn't "None".
  ## If cert path is not supplied, self-signed cert and key will be generated.
  # certificate = "/etc/telegraf/cert.pem"
  #
  ## Path to private key.pem. Required when security mode or policy isn't "None".
  ## If key path is not supplied, self-signed cert and key will be generated.
  # private_key = "/etc/telegraf/key.pem"
  #
  ## Authentication Method, one of "Certificate", "UserName", or "Anonymous".  To
  ## authenticate using a specific ID, select 'Certificate' or 'UserName'
  # auth_method = "Anonymous"
  #
  ## Username. Required for auth_method = "UserName"
  # username = ""
  #
  ## Password. Required for auth_method = "UserName"
  # password = ""
  #
  ## Node ID configuration
  ## name              - field name to use in the output
  ## namespace         - OPC UA namespace of the node (integer value 0 thru 3)
  ## identifier_type   - OPC UA ID type (s=string, i=numeric, g=guid, b=opaque)
  ## identifier        - OPC UA ID (tag as shown in opcua browser)
  ## Example:
  ## {name="ProductUri", namespace="0", identifier_type="i", identifier="2262"}
  nodes = [  # a few tens of nodes ]

System info:

Telegraf version 1.18.3 and 1.19.1 in a docker container
Siemens OPC UA server on a PLC

Steps to reproduce:

Start both the 1.18.3 and 1.19.1 telegraf container with the same agent and input configuration
Wait about an hour
A timeout occurs

Expected behavior:

Succesful recovery in the next collection interval. Version 1.18.3 would give the following output (with debug=false):

2021-07-19T08:04:20Z W! [inputs.opcua] Collection took longer than expected; not complete after interval of 10s
2021-07-19T08:04:20Z E! [inputs.opcua] Error in plugin: RegisterNodes Read failed: The operation timed out. StatusBadTimeout (0x800A0000)

Other than this output the data collection would resume the next interval. The timeout does resurface about every hour, I'm not sure if this related to the plugin or something else.

Actual behavior:

Collection completely stops. Version 1.19.1 gives the following output (with debug=true):

2021-07-19T08:04:12Z D! [outputs.file] Buffer fullness: 0 / 10000 metrics
2021-07-19T08:04:20Z D! [inputs.opcua] Previous collection has not completed; scheduled collection skipped
2021-07-19T08:04:20Z W! [inputs.opcua] Collection took longer than expected; not complete after interval of 10s
2021-07-19T08:04:20Z E! [inputs.opcua] Error in plugin: RegisterNodes Read failed: The operation timed out. StatusBadTimeout (0x800A0000)
2021-07-19T08:04:22Z D! [outputs.file] Buffer fullness: 0 / 10000 metrics
2021-07-19T08:04:30Z E! [inputs.opcua] Error in plugin: opcua: sechan: secure channel not open.
2021-07-19T08:04:32Z D! [outputs.file] Buffer fullness: 0 / 10000 metrics
2021-07-19T08:04:40Z E! [inputs.opcua] Error in plugin: opcua: sechan: secure channel not open.

The secure channel not open error keeps repeating for every collection interval until the container is manually restarted. No more data is passed to the output.

Additional info:

I preferred the behavior of 1.18.3. I also tested 1.19.1 with the open62541 dummy server and there no timeouts occur. Probably caused by 8e7da35. My proposed solution would be a similar approach as #8871: print the error but don't return it.

The text was updated successfully, but these errors were encountered:

srebhan · 2021-07-21T14:00:03Z

@R290 can you please check if PR #9524 fixes your problem?

R290 · 2021-07-22T20:48:18Z

Yes the problem is fixed, it has run for 24 hours without the need for a restart. Thank you for the quick PR. Now there is there remaining question of what is actually causing the timeouts, but that is not within the scope of this issue.

R290 added the bug unexpected problem or unintended behavior label Jul 20, 2021

telegraf-tiger bot added area/opcua platform/windows labels Jul 20, 2021

srebhan mentioned this issue Jul 21, 2021

fix: Do not return on disconnect to avoid breaking reconnect #9524

Merged

2 tasks

R290 mentioned this issue Jul 27, 2021

Do not skip good quality nodes after a bad quality node is encountered #9550

Merged

2 tasks

reimda added the regression something that used to work, but is now broken label Aug 3, 2021

reimda closed this as completed in #9524 Aug 5, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[inputs.opcua] Regression between 1.18.3 and 1.19.1 #9523

[inputs.opcua] Regression between 1.18.3 and 1.19.1 #9523

R290 commented Jul 20, 2021

srebhan commented Jul 21, 2021

R290 commented Jul 22, 2021

[inputs.opcua] Regression between 1.18.3 and 1.19.1 #9523

[inputs.opcua] Regression between 1.18.3 and 1.19.1 #9523

Comments

R290 commented Jul 20, 2021

Relevant telegraf.conf:

System info:

Steps to reproduce:

Expected behavior:

Actual behavior:

Additional info:

srebhan commented Jul 21, 2021

R290 commented Jul 22, 2021