Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[inputs.opcua] Regression between 1.18.3 and 1.19.1 #9523

Closed
R290 opened this issue Jul 20, 2021 · 2 comments · Fixed by #9524
Closed

[inputs.opcua] Regression between 1.18.3 and 1.19.1 #9523

R290 opened this issue Jul 20, 2021 · 2 comments · Fixed by #9524
Labels
area/opcua bug unexpected problem or unintended behavior platform/windows regression something that used to work, but is now broken

Comments

@R290
Copy link
Contributor

R290 commented Jul 20, 2021

Relevant telegraf.conf:

# Configuration for telegraf agent
[agent]
  ## Default data collection interval for all inputs
  interval = "10s"
  ## Rounds collection interval to 'interval'
  ## ie, if interval="10s" then always collect on :00, :10, :20, etc.
  round_interval = true

  ## Telegraf will send metrics to outputs in batches of at most
  ## metric_batch_size metrics.
  ## This controls the size of writes that Telegraf sends to output plugins.
  metric_batch_size = 1000

  ## Maximum number of unwritten metrics per output.  Increasing this value
  ## allows for longer periods of output downtime without dropping metrics at the
  ## cost of higher maximum memory usage.
  metric_buffer_limit = 10000

  ## Collection jitter is used to jitter the collection by a random amount.
  ## Each plugin will sleep for a random time within jitter before collecting.
  ## This can be used to avoid many plugins querying things like sysfs at the
  ## same time, which can have a measurable effect on the system.
  collection_jitter = "0s"

  ## Default flushing interval for all outputs. Maximum flush_interval will be
  ## flush_interval + flush_jitter
  flush_interval = "10s"
  ## Jitter the flush interval by a random amount. This is primarily to avoid
  ## large write spikes for users running a large number of telegraf instances.
  ## ie, a jitter of 5s and interval 10s means flushes will happen every 10-15s
  flush_jitter = "0s"

  ## By default or when set to "0s", precision will be set to the same
  ## timestamp order as the collection interval, with the maximum being 1s.
  ##   ie, when interval = "10s", precision will be "1s"
  ##       when interval = "250ms", precision will be "1ms"
  ## Precision will NOT be used for service inputs. It is up to each individual
  ## service input to set the timestamp at the appropriate precision.
  ## Valid time units are "ns", "us" (or "µs"), "ms", "s".
  precision = ""

  ## Log at debug level.
  debug = true
  ## Log only error level messages.
  # quiet = false

  ## Log target controls the destination for logs and can be one of "file",
  ## "stderr" or, on Windows, "eventlog".  When set to "file", the output file
  ## is determined by the "logfile" setting.
  # logtarget = "file"

  ## Name of the file to be logged to when using the "file" logtarget.  If set to
  ## the empty string then logs are written to stderr.
  # logfile = ""

  ## The logfile will be rotated after the time interval specified.  When set
  ## to 0 no time based rotation is performed.  Logs are rotated only when
  ## written to, if there is no log activity rotation may be delayed.
  # logfile_rotation_interval = "0d"

  ## The logfile will be rotated when it becomes larger than the specified
  ## size.  When set to 0 no size based rotation is performed.
  # logfile_rotation_max_size = "0MB"

  ## Maximum number of rotated archives to keep, any older logs are deleted.
  ## If set to -1, no archives are removed.
  # logfile_rotation_max_archives = 5

  ## Pick a timezone to use when logging or type 'local' for local time.
  ## Example: America/Chicago
  # log_with_timezone = ""

  ## Override default hostname, if empty use os.Hostname()
  hostname = ""
  ## If set to true, do no set the "host" tag in the telegraf agent.
  omit_hostname = false

[[inputs.opcua]]
  ## Metric name
  # name = "opcua"
  #
  ## OPC UA Endpoint URL
  endpoint = "opc.tcp://endpoint:4840"
  #
  ## Maximum time allowed to establish a connect to the endpoint.
  # connect_timeout = "10s"
  #
  ## Maximum time allowed for a request over the estabilished connection.
  # request_timeout = "5s"
  #
  ## Security policy, one of "None", "Basic128Rsa15", "Basic256",
  ## "Basic256Sha256", or "auto"
  # security_policy = "auto"
  #
  ## Security mode, one of "None", "Sign", "SignAndEncrypt", or "auto"
  security_mode = "None"
  #
  ## Path to cert.pem. Required when security mode or policy isn't "None".
  ## If cert path is not supplied, self-signed cert and key will be generated.
  # certificate = "/etc/telegraf/cert.pem"
  #
  ## Path to private key.pem. Required when security mode or policy isn't "None".
  ## If key path is not supplied, self-signed cert and key will be generated.
  # private_key = "/etc/telegraf/key.pem"
  #
  ## Authentication Method, one of "Certificate", "UserName", or "Anonymous".  To
  ## authenticate using a specific ID, select 'Certificate' or 'UserName'
  # auth_method = "Anonymous"
  #
  ## Username. Required for auth_method = "UserName"
  # username = ""
  #
  ## Password. Required for auth_method = "UserName"
  # password = ""
  #
  ## Node ID configuration
  ## name              - field name to use in the output
  ## namespace         - OPC UA namespace of the node (integer value 0 thru 3)
  ## identifier_type   - OPC UA ID type (s=string, i=numeric, g=guid, b=opaque)
  ## identifier        - OPC UA ID (tag as shown in opcua browser)
  ## Example:
  ## {name="ProductUri", namespace="0", identifier_type="i", identifier="2262"}
  nodes = [  # a few tens of nodes ]                 

System info:

  • Telegraf version 1.18.3 and 1.19.1 in a docker container
  • Siemens OPC UA server on a PLC

Steps to reproduce:

  1. Start both the 1.18.3 and 1.19.1 telegraf container with the same agent and input configuration
  2. Wait about an hour
  3. A timeout occurs

Expected behavior:

Succesful recovery in the next collection interval. Version 1.18.3 would give the following output (with debug=false):

2021-07-19T08:04:20Z W! [inputs.opcua] Collection took longer than expected; not complete after interval of 10s
2021-07-19T08:04:20Z E! [inputs.opcua] Error in plugin: RegisterNodes Read failed: The operation timed out. StatusBadTimeout (0x800A0000)

Other than this output the data collection would resume the next interval. The timeout does resurface about every hour, I'm not sure if this related to the plugin or something else.

Actual behavior:

Collection completely stops. Version 1.19.1 gives the following output (with debug=true):

2021-07-19T08:04:12Z D! [outputs.file] Buffer fullness: 0 / 10000 metrics
2021-07-19T08:04:20Z D! [inputs.opcua] Previous collection has not completed; scheduled collection skipped
2021-07-19T08:04:20Z W! [inputs.opcua] Collection took longer than expected; not complete after interval of 10s
2021-07-19T08:04:20Z E! [inputs.opcua] Error in plugin: RegisterNodes Read failed: The operation timed out. StatusBadTimeout (0x800A0000)
2021-07-19T08:04:22Z D! [outputs.file] Buffer fullness: 0 / 10000 metrics
2021-07-19T08:04:30Z E! [inputs.opcua] Error in plugin: opcua: sechan: secure channel not open.
2021-07-19T08:04:32Z D! [outputs.file] Buffer fullness: 0 / 10000 metrics
2021-07-19T08:04:40Z E! [inputs.opcua] Error in plugin: opcua: sechan: secure channel not open.

The secure channel not open error keeps repeating for every collection interval until the container is manually restarted. No more data is passed to the output.

Additional info:

I preferred the behavior of 1.18.3. I also tested 1.19.1 with the open62541 dummy server and there no timeouts occur. Probably caused by 8e7da35. My proposed solution would be a similar approach as #8871: print the error but don't return it.

@R290 R290 added the bug unexpected problem or unintended behavior label Jul 20, 2021
@srebhan
Copy link
Member

srebhan commented Jul 21, 2021

@R290 can you please check if PR #9524 fixes your problem?

@R290
Copy link
Contributor Author

R290 commented Jul 22, 2021

Yes the problem is fixed, it has run for 24 hours without the need for a restart. Thank you for the quick PR. Now there is there remaining question of what is actually causing the timeouts, but that is not within the scope of this issue.

@reimda reimda added the regression something that used to work, but is now broken label Aug 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/opcua bug unexpected problem or unintended behavior platform/windows regression something that used to work, but is now broken
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants