Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not Writing SNMP Data to Influx #4326

Closed
MrHamel opened this issue Jun 21, 2018 · 16 comments
Closed

Not Writing SNMP Data to Influx #4326

MrHamel opened this issue Jun 21, 2018 · 16 comments
Labels
area/snmp bug unexpected problem or unintended behavior

Comments

@MrHamel
Copy link

MrHamel commented Jun 21, 2018

Relevant telegraf.conf:

telegraf.conf Changes:

interval = "25s"

(influxdb)
urls = ["http://127.0.0.1:8086"]

Example SNMP Conf:

[[inputs.snmp]]
  agents = [ "(SNIPPED)" ]
  version = 2
  community = "(SNIPPED)"
  name = "snmp"

 [[inputs.snmp.field]]
    name = "hostname"
    oid = "RFC1213-MIB::sysName.0"
    is_tag = true

  [[inputs.snmp.table]]
    name = "snmp"
    inherit_tags = [ "hostname" ]
    oid = "IF-MIB::ifXTable"

    [[inputs.snmp.table.field]]
      name = "ifName"
      oid = "IF-MIB::ifName"
      is_tag = true

System info:

CentOS 6.9 64-Bit

Telegraf v1.7.0 (git: release-1.7 f4d22dd)
InfluxDB v1.5.2 (git: 1.5 02d7d4f043b34ecb4e9b2dbec298c6f9450c2a32)

Steps to reproduce:

Start Influx (with a default config) and Telegraf with SNMP configs in the telegraf.d folder.

Expected behavior:

SNMP data is stored in Influxdb like normal, so Grafana can extract and display it.

Actual behavior:

Influx will get the POST data for SNMP, acknowledge it with an HTTP response code 204, but not write it to disk, which results in Grafana not being able to show networking graphs, however my DNS and HTTP monitors work just fine.

Additional info:

My telegraf and influxdb are for the most part stock installs, with the only change to recently being a reboot of the server, and I'm puzzled on where to go on troubleshooting since strace on Influx forks off into oblivillian so I can't see where the problem is coming from.

CentOS 6
Grafana, Influx, and Telegraf were installed from their respective YUM repositories.

P.S. --debug and --test do not show any errors whatsoever.

@solune
Copy link

solune commented Jun 22, 2018

same problem with centos 7.5, telegraf-1.7.0-1, influxdb-1.5.2-1

@Audiobuzz
Copy link

Audiobuzz commented Jun 26, 2018

Same problem with raspbian stretch. Upgraded to telegraf 1.7 and now not all of my network stats are making their way into influx

influxdb 1.5.2
telegraf 1.7.0

[Update: it looks like its getting written but the tagging has changed so I can no longer filter by ifName in grafana]

@russorat russorat added the bug unexpected problem or unintended behavior label Jun 26, 2018
@danielnelson
Copy link
Contributor

@Audiobuzz Can you run telegraf --input-filter snmp --test with both Telegraf 1.7 and the version that was previously working so that we can examine the changes?

@mpetersen42
Copy link

mpetersen42 commented Jul 5, 2018

I'm seeing the same issue. network_interface is fine, but network_interface_x doesn't get there.

Telegraf 1.7:

> network_interface,env=PROD,owner=Team\ Gray,site=RIV,ifName=Ethernet47,switchname=chi7050t10G001 ifAdminStatus=1i,ifDescr="Ethernet47",ifInDiscards=0i,ifInErrors=0i,ifInNUcastPkts=825362i,ifInOctets=1674467248i,ifInUcastPkts=709374454i,ifInUnknownProtos=0i,ifLastChange=3330870583i,ifMtu=9214i,ifOperStatus=1i,ifOutDiscards=75128i,ifOutErrors=0i,ifOutNUcastPkts=3664089963i,ifOutOctets=1223040315i,ifOutQLen=0i,ifOutUcastPkts=914710215i,ifPhysAddress="00:1c:73:dd:e3:46",ifSpecific=".1.3.6.1.2.1.10.7",ifSpeed=4294967295i,ifType=6i 1530814001000000000
> network_interface_x,env=PROD,owner=Team\ Gray,site=RIV,ifName=Ethernet47,switchname=chi7050t10G001 ifAlias="CHIWKSPRD189_Grimes",ifConnectorPresent=1i,ifCounterDiscontinuityTime=0i,ifHCInBroadcastPkts=126990i,ifHCInMulticastPkts=698372i,ifHCInOctets=96163770320i,ifHCInUcastPkts=709374643i,ifHCOutBroadcastPkts=52775161i,ifHCOutMulticastPkts=12201249394i,ifHCOutOctets=2251785903419i,ifHCOutUcastPkts=914710215i,ifHighSpeed=10000i,ifInBroadcastPkts=126990i,ifInMulticastPkts=698372i,ifLinkUpDownTrapEnable=1i,ifOutBroadcastPkts=52775161i,ifOutMulticastPkts=3611314802i,ifPromiscuousMode=2i 1530814002000000000
> network_interface_stats,env=PROD,owner=Team\ Gray,site=RIV,ifName=Ethernet47,switchname=chi7050t10G001 dot3StatsAlignmentErrors=0i,dot3StatsCarrierSenseErrors=0i,dot3StatsDeferredTransmissions=0i,dot3StatsDuplexStatus=3i,dot3StatsEtherChipSet=".0.0",dot3StatsExcessiveCollisions=0i,dot3StatsFCSErrors=0i,dot3StatsFrameTooLongs=0i,dot3StatsInternalMacReceiveErrors=0i,dot3StatsInternalMacTransmitErrors=0i,dot3StatsLateCollisions=0i,dot3StatsMultipleCollisionFrames=0i,dot3StatsRateControlAbility=2i,dot3StatsRateControlStatus=3i,dot3StatsSQETestErrors=0i,dot3StatsSingleCollisionFrames=0i,dot3StatsSymbolErrors=0i 1530814002000000000

Telegraf 1.6.1:

> network_interface,env=PROD,owner=Team\ Gray,site=RIV,ifName=Ethernet47,switchname=chi7050t10G001 ifType=6i,ifInOctets=1677652782i,ifInUnknownProtos=0i,ifOutNUcastPkts=3664756309i,ifSpecific=".1.3.6.1.2.1.10.7",ifPhysAddress="00:1c:73:dd:e3:46",ifInErrors=0i,ifLastChange=3330870583i,ifInDiscards=0i,ifOutOctets=1327030272i,ifOutUcastPkts=914742115i,ifOutDiscards=75128i,ifMtu=9214i,ifAdminStatus=1i,ifOperStatus=1i,ifInUcastPkts=709398892i,ifInNUcastPkts=825367i,ifOutErrors=0i,ifOutQLen=0i,ifDescr="Ethernet47",ifSpeed=4294967295i 1530814173000000000
> network_interface_x,env=PROD,owner=Team\ Gray,site=RIV,ifName=Ethernet47,switchname=chi7050t10G001 ifHCInOctets=96166962889i,ifHCOutOctets=2251892630235i,ifPromiscuousMode=2i,ifCounterDiscontinuityTime=0i,ifInMulticastPkts=698377i,ifOutMulticastPkts=3612001519i,ifHCInMulticastPkts=698377i,ifHCOutMulticastPkts=12201936111i,ifHCOutBroadcastPkts=52775231i,ifHighSpeed=10000i,ifAlias="CHIWKSPRD189_Grimes",ifName="Ethernet47",ifInBroadcastPkts=126990i,ifOutBroadcastPkts=52775231i,ifHCInBroadcastPkts=126990i,ifHCOutUcastPkts=914742333i,ifHCInUcastPkts=709399107i,ifLinkUpDownTrapEnable=1i,ifConnectorPresent=1i 1530814174000000000
> network_interface_stats,env=PROD,owner=Team\ Gray,site=RIV,ifName=Ethernet47,switchname=chi7050t10G001 dot3StatsSingleCollisionFrames=0i,dot3StatsMultipleCollisionFrames=0i,dot3StatsDeferredTransmissions=0i,dot3StatsExcessiveCollisions=0i,dot3StatsDuplexStatus=3i,dot3StatsSQETestErrors=0i,dot3StatsLateCollisions=0i,dot3StatsInternalMacTransmitErrors=0i,dot3StatsInternalMacReceiveErrors=0i,dot3StatsSymbolErrors=0i,dot3StatsRateControlAbility=2i,dot3StatsFrameTooLongs=0i,dot3StatsAlignmentErrors=0i,dot3StatsFCSErrors=0i,dot3StatsCarrierSenseErrors=0i,dot3StatsEtherChipSet=".0.0",dot3StatsRateControlStatus=3i 1530814174000000000

@danielnelson
Copy link
Contributor

@mpetersen42 So when running Telegraf 1.7, select * from network_interface_x would have no results at all?

@mpetersen42
Copy link

mpetersen42 commented Jul 5, 2018

@danielnelson Technically you'd need a WHERE clause with a switchname collected by 1.7.0 and a time range after the 1.7.0 upgrade to get no results, but yes. network_interface_stats was fine as well.

@danielnelson
Copy link
Contributor

@mpetersen42 Could you test with 1.6.4? This version introduced the change between your two output samples. (ifName is only added as a tag)

@mpetersen42
Copy link

mpetersen42 commented Jul 5, 2018

@danielnelson

1.6.4 is NOT working.

> network_interface,env=PROD,owner=Team\ Gray,site=RIV,ifName=Ethernet47,switchname=chi7050t10G001 ifOutNUcastPkts=3669019002i,ifOutErrors=0i,ifSpecific=".1.3.6.1.2.1.10.7",ifInUcastPkts=710282107i,ifInUnknownProtos=0i,ifOutOctets=2155683084i,ifOutUcastPkts=915577703i,ifType=6i,ifPhysAddress="00:1c:73:dd:e3:46",ifOutDiscards=75128i,ifInErrors=0i,ifOutQLen=0i,ifMtu=9214i,ifSpeed=4294967295i,ifOperStatus=1i,ifInDiscards=0i,ifInNUcastPkts=826110i,ifDescr="Ethernet47",ifAdminStatus=1i,ifLastChange=3330870583i,ifInOctets=1810137734i 1530831730000000000
> network_interface_x,env=PROD,owner=Team\ Gray,site=RIV,ifName=Ethernet47,switchname=chi7050t10G001 ifHCInOctets=96299418246i,ifHCInUcastPkts=710282107i,ifHCOutMulticastPkts=12206133281i,ifPromiscuousMode=2i,ifOutMulticastPkts=3616198689i,ifHCInBroadcastPkts=127101i,ifHCOutBroadcastPkts=52820313i,ifConnectorPresent=1i,ifAlias="CHIWKSPRD189_Grimes",ifInMulticastPkts=699009i,ifInBroadcastPkts=127101i,ifOutBroadcastPkts=52820313i,ifHCInMulticastPkts=699009i,ifHCOutUcastPkts=915577703i,ifHCOutOctets=2252718546188i,ifLinkUpDownTrapEnable=1i,ifHighSpeed=10000i,ifCounterDiscontinuityTime=0i 1530831731000000000
> network_interface_stats,env=PROD,owner=Team\ Gray,site=RIV,ifName=Ethernet47,switchname=chi7050t10G001 dot3StatsFrameTooLongs=0i,dot3StatsDuplexStatus=3i,dot3StatsRateControlStatus=3i,dot3StatsSymbolErrors=0i,dot3StatsSQETestErrors=0i,dot3StatsLateCollisions=0i,dot3StatsInternalMacTransmitErrors=0i,dot3StatsCarrierSenseErrors=0i,dot3StatsInternalMacReceiveErrors=0i,dot3StatsFCSErrors=0i,dot3StatsSingleCollisionFrames=0i,dot3StatsExcessiveCollisions=0i,dot3StatsEtherChipSet=".0.0",dot3StatsRateControlAbility=2i,dot3StatsAlignmentErrors=0i,dot3StatsMultipleCollisionFrames=0i,dot3StatsDeferredTransmissions=0i 1530831731000000000

@mpetersen42
Copy link

mpetersen42 commented Jul 5, 2018

Here's my SNMP config (in /etc/telegraf/telegraf.d/):

  agents = [<snip switch names>]
  version = 2
  community = "<snip community string>"
  name = "network"
  tagexclude = [ "ifIndex", "dot3StatsIndex", "host", "agent_host" ]
  interval = "10s"

  [[inputs.snmp.field]]
    name = "switchname"
    oid = "RFC1213-MIB::sysName.0"
    is_tag = true

  [[inputs.snmp.table]]
    name = "network"
    inherit_tags = [ "switchname" ]
    oid = "HOST-RESOURCES-MIB::hrProcessorTable"

  [[inputs.snmp.table]]
    name = "network_interface"
    inherit_tags = [ "switchname" ]
    oid = "IF-MIB::ifTable"

    [[inputs.snmp.table.field]]
      name = "ifName"
      oid = "IF-MIB::ifName"
      is_tag = true

  # IF-MIB::ifXTable contains newer High Capacity (HC) counters that do not overflow as fast for a few of the ifTable counters
  [[inputs.snmp.table]]
    name = "network_interface_x"
    inherit_tags = [ "switchname" ]
    oid = "IF-MIB::ifXTable"

    # Interface tag - used to identify network in metrics database
    [[inputs.snmp.table.field]]
      name = "ifName"
      oid = "IF-MIB::ifName"
      is_tag = true

  # EtherLike-MIB::dot3StatsTable contains detailed ethernet-level information about what kind of errors have been logged on an network (such as FCS error, frame too long, etc)
  [[inputs.snmp.table]]
    name = "network_interface_stats"
    inherit_tags = [ "switchname" ]
    oid = "EtherLike-MIB::dot3StatsTable"

    # Interface tag - used to identify network in metrics database
    [[inputs.snmp.table.field]]
      name = "ifName"
      oid = "IF-MIB::ifName"
      is_tag = true

@danielnelson
Copy link
Contributor

Interesting, could you check 1.6.3 too then?

@mpetersen42
Copy link

mpetersen42 commented Jul 6, 2018

@danielnelson 1.6.3 seems to work.

> network_interface,env=PROD,owner=Team\ Gray,site=RIV,ifName=Ethernet47,switchname=chi7050t10G001 ifOutQLen=0i,ifDescr="Ethernet47",ifMtu=9214i,ifSpeed=1000000000i,ifAdminStatus=1i,ifInUcastPkts=711722197i,ifInNUcastPkts=827106i,ifInDiscards=0i,ifInErrors=0i,ifOutOctets=2377725104i,ifOutDiscards=76055i,ifOutErrors=0i,ifType=6i,ifOperStatus=1i,ifInOctets=2118086630i,ifInUnknownProtos=0i,ifOutUcastPkts=917337878i,ifOutNUcastPkts=3695348397i,ifPhysAddress="00:1c:73:dd:e3:46",ifLastChange=3383112189i,ifSpecific=".1.3.6.1.2.1.10.7" 1530887198000000000
> network_interface_x,env=PROD,owner=Team\ Gray,site=RIV,ifName=Ethernet47,switchname=chi7050t10G001 ifHighSpeed=1000i,ifConnectorPresent=1i,ifName="Ethernet47",ifInMulticastPkts=699791i,ifHCInUcastPkts=711722197i,ifHCInBroadcastPkts=127315i,ifHCOutUcastPkts=917337878i,ifOutMulticastPkts=3642459269i,ifPromiscuousMode=2i,ifCounterDiscontinuityTime=0i,ifOutBroadcastPkts=52889128i,ifHCInOctets=96607367142i,ifHCInMulticastPkts=699791i,ifHCOutBroadcastPkts=52889128i,ifAlias="CHIWKSPRD189_Grimes",ifInBroadcastPkts=127315i,ifHCOutOctets=2257235555504i,ifHCOutMulticastPkts=12232393861i,ifLinkUpDownTrapEnable=1i 1530887199000000000
> network_interface_stats,env=PROD,owner=Team\ Gray,site=RIV,ifName=Ethernet47,switchname=chi7050t10G001 dot3StatsFCSErrors=0i,dot3StatsMultipleCollisionFrames=0i,dot3StatsFrameTooLongs=0i,dot3StatsEtherChipSet=".0.0",dot3StatsDuplexStatus=3i,dot3StatsAlignmentErrors=0i,dot3StatsSingleCollisionFrames=0i,dot3StatsExcessiveCollisions=0i,dot3StatsSymbolErrors=0i,dot3StatsRateControlAbility=2i,dot3StatsSQETestErrors=0i,dot3StatsDeferredTransmissions=0i,dot3StatsLateCollisions=0i,dot3StatsInternalMacTransmitErrors=0i,dot3StatsCarrierSenseErrors=0i,dot3StatsInternalMacReceiveErrors=0i,dot3StatsRateControlStatus=3i 1530887199000000000

@danielnelson
Copy link
Contributor

So it was broken in 1.6.4, but whats odd is that both of these metrics are valid. Perhaps the data is still being added but the query you are using is now broken.

For instance, if you were doing this before:

select * from network_interface_x where ifName = 'wlan0'

I believe you will now need to do this because ifName is only a tag:

select * from network_interface_x where ifName::tag = 'wlan0'

@mpetersen42
Copy link

@danielnelson

I guess it is something with the query. I've always had ifName setup as a tag (see config above) so I don't know exactly what is going on or why the query results are different. With a fresh install of Telegaf 1.7.1 and InfluxDB 1.5.4 this gets results:

SELECT non_negative_derivative(mean("ifHCInOctets"), 1s) /8 AS "Traffic In", non_negative_derivative(mean("ifHCOutOctets"), 1s) /-8 AS "Traffic Out", non_negative_derivative(mean("ifHCInMulticastPkts"), 1s) /-8 AS "Multicast In", non_negative_derivative(mean("ifHCOutMulticastPkts"), 1s) /-8 AS "Multicast Out" FROM "network_interface_x" WHERE ("switchname" =~ /^CHI7150S002$/ AND "ifName" =~ /^Ethernet1$/) AND time >= now() - 1m GROUP BY time(15s) fill(null)

And this gets the same results:

SELECT non_negative_derivative(mean("ifHCInOctets"), 1s) /8 AS "Traffic In", non_negative_derivative(mean("ifHCOutOctets"), 1s) /-8 AS "Traffic Out", non_negative_derivative(mean("ifHCInMulticastPkts"), 1s) /-8 AS "Multicast In", non_negative_derivative(mean("ifHCOutMulticastPkts"), 1s) /-8 AS "Multicast Out" FROM "network_interface_x" WHERE ("switchname" =~ /^CHI7150S002$/ AND ifName::tag =~ /^Ethernet1$/) AND time >= now() - 1m GROUP BY time(15s) fill(null)

But on the existing install with Telegraf 1.7.1 and InfluxDB 1.5.4 this is empty:

SELECT non_negative_derivative(mean("ifHCInOctets"), 1s) /8 AS "Traffic In", non_negative_derivative(mean("ifHCOutOctets"), 1s) /-8 AS "Traffic Out", non_negative_derivative(mean("ifHCInMulticastPkts"), 1s) /-8 AS "Multicast In", non_negative_derivative(mean("ifHCOutMulticastPkts"), 1s) /-8 AS "Multicast Out" FROM "network_interface_x" WHERE ("switchname" =~ /^CHI7150S002$/ AND "ifName" =~ /^Ethernet1$/) AND time >= now() - 1m GROUP BY time(15s) fill(null)

While this gets expected results:

SELECT non_negative_derivative(mean("ifHCInOctets"), 1s) /8 AS "Traffic In", non_negative_derivative(mean("ifHCOutOctets"), 1s) /-8 AS "Traffic Out", non_negative_derivative(mean("ifHCInMulticastPkts"), 1s) /-8 AS "Multicast In", non_negative_derivative(mean("ifHCOutMulticastPkts"), 1s) /-8 AS "Multicast Out" FROM "network_interface_x" WHERE ("switchname" =~ /^CHI7150S002$/ AND ifName::tag =~ /^Ethernet1$/) AND time >= now() - 1m GROUP BY time(15s) fill(null)

@danielnelson
Copy link
Contributor

I believe this was accidentally caused by #4203, in which we fixed a bug that was causing snmp fields marked with is_tag to be added as both a tag and a field.

The requirement to add ::tag should not be present on queries that span the new InfluxDB Shards created after the change. Here is how you can check the length of your shard (168 hours):

$ influx
Connected to http://localhost:8086 version unknown
InfluxDB shell version: unknown
> use telegraf
Using database telegraf
> show retention policies
name    duration shardGroupDuration replicaN default
----    -------- ------------------ -------- -------
autogen 0s       168h0m0s           1        true

So in my case I would need to wait up to 168h and then any queries for this time and newer should work as before.

There isn't really a smooth way to transition from the prior situation without rewriting your historical data, even if we add a config switch it won't be possible to gradually migrate away. I believe it may be best to modify your queries to use ::tag at least until your data has expired, this will be safe for before and after the change.

However, I know this will be an issue for some to update all queries. I wonder if we should add an option that would cause queries to continue to add tags and both tags and fields?

@Touchedegris
Copy link

@mpetersen42 Could you test with 1.6.4? This version introduced the change between your two output samples. (ifName is only added as a tag)

In some examples, it would be nice to have tags also entered as values. In Grafana Dashboards, singlestat plugin cannot display tag values, only the field values. With dynamic dashboards (with variables), I was trying to show my tag value somewhere in the dashboard. Should this fix be optionnal? (with a parameter like is_tag_and_value = true)?

@danielnelson
Copy link
Contributor

@Touchedegris I don't think we will add an option for this, and instead rely on the query workaround.

I think you should be able to duplicate your tag to a field though using a combination of the regex and converter processors. First copy the tag using regex, then convert one of the copies to a field in converter. Use the order option to enforce ordering.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/snmp bug unexpected problem or unintended behavior
Projects
None yet
Development

No branches or pull requests

7 participants