Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Track Zkid counter reported from zookeeper srvr command #8938

Closed
matschaffer opened this issue Nov 6, 2018 · 20 comments
Closed

Track Zkid counter reported from zookeeper srvr command #8938

matschaffer opened this issue Nov 6, 2018 · 20 comments
Assignees
Labels
enhancement good first issue Indicates a good issue for first-time contributors Metricbeat Metricbeat module Team:Integrations Label for the Integrations team

Comments

@matschaffer
Copy link
Contributor

matschaffer commented Nov 6, 2018

It would be useful to track at least the Zxid counter reported from the srvr four letter word.

This could help identify cases where the transaction rate changes abruptly due to a poorly-coded zk client.

Opening this to track as an enhancement request.

@ruflin ruflin added the module label Nov 6, 2018
@ruflin ruflin added the Team:Integrations Label for the Integrations team label Nov 21, 2018
@ruflin
Copy link
Contributor

ruflin commented Jan 22, 2019

I had a quick look at the output of the command:

Clients:
 /172.17.0.1:36782[0](queued=0,recved=1,sent=0)

Latency min/avg/max: 0/0/0
Received: 2
Sent: 1
Connections: 1
Outstanding: 0
Zxid: 0x0
Mode: standalone
Node count: 4

@matschaffer To get started would the Zxid value be enough? Could you share a value of zxid which is not 0 to make sure we would match the right content? What are the other metrics you are most interested in?

@webmat
Copy link
Contributor

webmat commented Jan 22, 2019

LOL I see the wrong Mat was pinged.

I do have a suggestion nonetheless: IDs should always be keywords anyway :-)

Just my 2¢ :-)

@matschaffer
Copy link
Contributor Author

zkid is definitely the main one that's been called out. The latency is interesting though since mntr don't provide min/max afaik.

Of course if it's feasible to provide all of it from beats and we can later drop fields via config, that's great too. It's small enough that I'm not worried about transfer/storage sizes.

I'll pull some staging examples today and we can go from there.

@matschaffer
Copy link
Contributor Author

(for some definitions of "today") Anyway here's your output from one of our staging hosts:

Zookeeper version: 3.5.3-beta-8ce24f9e675cbefffb8f21a47e06b42864475a60, built on 04/03/2017 16:19 GMT
Clients:
 /127.0.0.1:55470[1](queued=0,recved=108675,sent=108675)
 /127.0.0.1:58008[1](queued=0,recved=22774,sent=23216)
 /127.0.0.1:56374[1](queued=0,recved=670785,sent=693985)
 /127.0.0.1:55510[1](queued=0,recved=38149,sent=38152)
 /127.0.0.1:34000[1](queued=0,recved=728189,sent=728196)
 /127.0.0.1:59224[1](queued=0,recved=14530,sent=14562)
 /127.0.0.1:53840[1](queued=0,recved=29889,sent=29891)
 /127.0.0.1:56312[1](queued=0,recved=648515,sent=648515)
 /127.0.0.1:60836[1](queued=0,recved=106564,sent=107135)
 /127.0.0.1:36820[1](queued=0,recved=33903,sent=36094)
 /127.0.0.1:58010[1](queued=0,recved=17896,sent=17988)
 /127.0.0.1:60604[1](queued=0,recved=93970,sent=94025)
 /127.0.0.1:60608[1](queued=0,recved=93932,sent=93967)
 /127.0.0.1:35020[1](queued=0,recved=4336,sent=4367)
 /127.0.0.1:34822[1](queued=0,recved=4530,sent=4561)
 /127.0.0.1:33784[1](queued=0,recved=728187,sent=728191)
 /127.0.0.1:59278[1](queued=0,recved=14365,sent=14404)
 /127.0.0.1:43770[1](queued=0,recved=141817,sent=141817)
 /127.0.0.1:59286[1](queued=0,recved=20932,sent=21313)
 /127.0.0.1:34872[1](queued=0,recved=4442,sent=4473)
 /127.0.0.1:56482[1](queued=0,recved=282269,sent=305702)
 /127.0.0.1:60704[1](queued=0,recved=148863,sent=149354)
 /127.0.0.1:60596[1](queued=0,recved=93973,sent=94028)
 /127.0.0.1:57350[1](queued=0,recved=641426,sent=641426)
 /127.0.0.1:57990[1](queued=0,recved=17793,sent=17793)
 /127.0.0.1:34828[1](queued=0,recved=4519,sent=4551)
 /127.0.0.1:57574[1](queued=0,recved=108795,sent=111849)
 /127.0.0.1:56416[1](queued=0,recved=670446,sent=693555)
 /127.0.0.1:57994[1](queued=0,recved=17918,sent=18012)
 /127.0.0.1:34848[1](queued=0,recved=4480,sent=4511)
 /127.0.0.1:56226[1](queued=0,recved=476748,sent=615826)
 /127.0.0.1:60712[1](queued=0,recved=153108,sent=153689)
 /127.0.0.1:59228[1](queued=0,recved=21022,sent=21405)
 /127.0.0.1:34178[1](queued=0,recved=3899,sent=3899)
 /127.0.0.1:56536[1](queued=0,recved=647234,sent=647234)
 /127.0.0.1:56346[1](queued=0,recved=648248,sent=648248)
 /127.0.0.1:39442[1](queued=0,recved=728117,sent=728119)
 /127.0.0.1:57224[1](queued=0,recved=23761,sent=24225)
 /127.0.0.1:34856[1](queued=0,recved=4465,sent=4498)
 /127.0.0.1:56298[1](queued=0,recved=648558,sent=648558)
 /127.0.0.1:56508[1](queued=0,recved=669866,sent=692876)
 /127.0.0.1:60698[1](queued=0,recved=148901,sent=149397)
 /127.0.0.1:60838[1](queued=0,recved=93375,sent=93375)
 /127.0.0.1:56210[1](queued=0,recved=672031,sent=695455)
 /127.0.0.1:60722[1](queued=0,recved=151429,sent=151977)
 /127.0.0.1:34858[1](queued=0,recved=4463,sent=4494)
 /127.0.0.1:60716[1](queued=0,recved=148894,sent=149393)
 /127.0.0.1:57816[1](queued=0,recved=661157,sent=683953)
 /127.0.0.1:59822[1](queued=0,recved=12527,sent=12527)
 /127.0.0.1:56206[1](queued=0,recved=649175,sent=649175)
 /127.0.0.1:35026[1](queued=0,recved=4323,sent=4355)
 /127.0.0.1:56300[1](queued=0,recved=671292,sent=694575)
 /127.0.0.1:56408[1](queued=0,recved=269207,sent=286879)
 /127.0.0.1:60642[1](queued=0,recved=93973,sent=94029)
 /127.0.0.1:57290[1](queued=0,recved=641892,sent=641892)
 /127.0.0.1:34860[1](queued=0,recved=4463,sent=4495)
 /127.0.0.1:33770[1](queued=0,recved=728255,sent=728267)
 /127.0.0.1:49636[1](queued=0,recved=584468,sent=584468)
 /127.0.0.1:60644[1](queued=0,recved=93965,sent=94022)
 /127.0.0.1:33262[1](queued=0,recved=91455,sent=91455)
 /127.0.0.1:41338[1](queued=0,recved=93389,sent=96221)
 /127.0.0.1:43606[1](queued=0,recved=142364,sent=142364)
 /127.0.0.1:60724[1](queued=0,recved=149341,sent=149964)
 /127.0.0.1:60638[1](queued=0,recved=93951,sent=94006)
 /127.0.0.1:57852[1](queued=0,recved=638753,sent=638753)
 /127.0.0.1:60672[1](queued=0,recved=93652,sent=93652)
 /127.0.0.1:60598[1](queued=0,recved=93982,sent=94041)
 /127.0.0.1:60602[1](queued=0,recved=93974,sent=94030)
 /127.0.0.1:57992[1](queued=0,recved=22785,sent=23227)
 /127.0.0.1:57844[1](queued=0,recved=660998,sent=683789)
 /127.0.0.1:60572[1](queued=0,recved=93979,sent=94034)
 /127.0.0.1:34892[1](queued=0,recved=4391,sent=4422)
 /127.0.0.1:56448[1](queued=0,recved=647559,sent=647559)
 /127.0.0.1:49832[1](queued=0,recved=583810,sent=583810)
 /127.0.0.1:42784[1](queued=0,recved=695274,sent=695274)
 /127.0.0.1:60522[1](queued=0,recved=93843,sent=93843)
 /127.0.0.1:60726[1](queued=0,recved=148837,sent=149327)
 /127.0.0.1:54558[1](queued=0,recved=27744,sent=27744)
 /127.0.0.1:49514[1](queued=0,recved=202401,sent=214305)
 /127.0.0.1:34050[1](queued=0,recved=41676,sent=42017)
 /127.0.0.1:56276[1](queued=0,recved=648664,sent=648664)
 /127.0.0.1:60714[1](queued=0,recved=148853,sent=149344)
 /127.0.0.1:32806[1](queued=0,recved=433546,sent=546517)
 /127.0.0.1:57572[1](queued=0,recved=108827,sent=111888)
 /127.0.0.1:33756[1](queued=0,recved=728253,sent=728267)
 /127.0.0.1:36828[1](queued=0,recved=34738,sent=37419)
 /127.0.0.1:57854[1](queued=0,recved=661012,sent=683802)
 /127.0.0.1:57496[1](queued=0,recved=254180,sent=263494)
 /127.0.0.1:60600[1](queued=0,recved=93974,sent=94029)
 /127.0.0.1:59284[1](queued=0,recved=14287,sent=14369)
 /127.0.0.1:56306[1](queued=0,recved=326326,sent=374945)
 /127.0.0.1:57896[1](queued=0,recved=638559,sent=638559)
 /127.0.0.1:57898[1](queued=0,recved=660790,sent=683551)
 /127.0.0.1:60606[1](queued=0,recved=93993,sent=94053)
 /127.0.0.1:34052[1](queued=0,recved=4310,sent=4310)
 /127.0.0.1:60718[1](queued=0,recved=148847,sent=149336)
 /127.0.0.1:57222[1](queued=0,recved=19903,sent=19948)
 /127.0.0.1:34890[1](queued=0,recved=4394,sent=4427)
 /127.0.0.1:49668[1](queued=0,recved=590502,sent=596779)
 /127.0.0.1:55942[1](queued=0,recved=651041,sent=651041)
 /127.0.0.1:39112[1](queued=0,recved=73614,sent=73614)
 /127.0.0.1:60574[1](queued=0,recved=94010,sent=94082)
 /127.0.0.1:60528[1](queued=0,recved=93829,sent=93829)
 /127.0.0.1:55954[1](queued=0,recved=650966,sent=650966)
 /127.0.0.1:57538[1](queued=0,recved=18669,sent=18669)
 /127.0.0.1:56414[1](queued=0,recved=647868,sent=647868)
 /127.0.0.1:58006[1](queued=0,recved=17772,sent=17772)
 /127.0.0.1:60570[1](queued=0,recved=93974,sent=94027)
 /127.0.0.1:43020[1](queued=0,recved=57919,sent=60145)
 /127.0.0.1:59330[1](queued=0,recved=14135,sent=14213)
 /127.0.0.1:34838[1](queued=0,recved=4497,sent=4528)
 /127.0.0.1:43652[1](queued=0,recved=153498,sent=192676)
 /127.0.0.1:34866[1](queued=0,recved=4448,sent=4480)
 /127.0.0.1:33910[1](queued=0,recved=728074,sent=728081)
 /127.0.0.1:57518[1](queued=0,recved=256239,sent=321891)
 /172.X.Y.Z:57664[0](queued=0,recved=1,sent=0)
 /127.0.0.1:57950[1](queued=0,recved=17891,sent=17891)
 /127.0.0.1:56394[1](queued=0,recved=262153,sent=271990)
 /127.0.0.1:54550[1](queued=0,recved=27751,sent=27751)
 /127.0.0.1:56472[1](queued=0,recved=670039,sent=693097)
 /127.0.0.1:60696[1](queued=0,recved=148811,sent=149298)
 /127.0.0.1:59810[1](queued=0,recved=12547,sent=12547)
 /127.0.0.1:56388[1](queued=0,recved=648051,sent=648051)
 /127.0.0.1:60640[1](queued=0,recved=93965,sent=94019)
 /127.0.0.1:48170[1](queued=0,recved=18722,sent=19241)
 /127.0.0.1:45966[1](queued=0,recved=224527,sent=224527)
 /127.0.0.1:60500[1](queued=0,recved=144436,sent=144458)
 /127.0.0.1:60720[1](queued=0,recved=148904,sent=149399)
 /172.X.Y.Z:55944[1](queued=0,recved=273064,sent=273068)
 /127.0.0.1:60690[1](queued=0,recved=150345,sent=150843)
 /127.0.0.1:56506[1](queued=0,recved=647387,sent=647387)
 /127.0.0.1:60700[1](queued=0,recved=165862,sent=166354)
 /127.0.0.1:57542[1](queued=0,recved=18803,sent=18907)
 /127.0.0.1:34876[1](queued=0,recved=4426,sent=4457)
 /127.0.0.1:60634[1](queued=0,recved=93974,sent=94031)

Latency min/avg/max: 0/0/3383
Received: 103508131
Sent: 109405284
Connections: 135
Outstanding: 0
Zxid: 0x700601132
Mode: follower
Node count: 68534

@ruflin
Copy link
Contributor

ruflin commented Jan 24, 2019

This is great, thanks. For Zxid I expec we ingest it as keyword the way it is here?

@ruflin
Copy link
Contributor

ruflin commented Jan 24, 2019

BTW: Seeing how many clients you have, this could be a separate metricset sending one event per entry for the above.

@matschaffer
Copy link
Contributor Author

Yeah, being able to identify those queued=0,recved=273064,sent=273068 per client seems like it could be useful.

I'd thought I saw @zenitraM mention the zkid is an incrementing number, but probably best if he (or @pmoust or @stejacks) weighed in. Not super familiar with how this data has been used in troubleshooting so far, but they should be 😄

@sayden
Copy link
Contributor

sayden commented Jan 24, 2019

Just my 2¢, maybe we should also add some more info (in different steps). According to Zookeeper documentation: Three of the more interesting commands: "stat" gives some general information about the server and connected clients, while "srvr" and "cons" give extended details on server and connections respectively

So maybe we should create stat, srvr and cons Metricsets (beginning with what @matschaffer was asking for first, of course 😄 )

@sayden
Copy link
Contributor

sayden commented Jan 25, 2019

According to zookeeper docs the zxid is a 64 bit number with two parts: high order 32-bits for an epoch (each leader in the cluster will have a different epoch) and the low order 32-bits for a counter (for the transactions "inspected" by the previous leader). Because it has two parts represent the zxid both as a number and as a pair of integers, (epoch, count). Wonderful. 🙄

Also written there, Zookeeper guarantees a total order of messages, zxid is the Zookeeper Transaction ID, which is the incremental number that @matschaffer was mentioning.

Ok so I have been having fun with regexp to capture all the data that is returned from those commands. Now I wonder how to structure it between metricsets. The idea I had is the following:

  • a server metricset (or some better name if you have an idea) that monitors the Zookeeper server as a side-car installation. It fetches and parses the information coming from srvr which includes zxid. The data would look like this:
{
	"version": "Zookeeper version: 3.4.13-2d71af4dbe22557fda74f9a9b4309b15a7487f03, built on 06/29/2018 04:05 GMT",
	"latency": {
		"min": 0,
		"avg": 0,
		"max": 0
	},
	"received": 9,
	"sent": 8,
	"connections": 1,
	"outstanding": 0,
	"zxid": {
		"epoch": 23,
		"count": 3442
	},
	"mode": "standalone",
	"node_count": 4,
	"proposal_sizes": {
		"last": -1,
		"min": -1,
		"max": -1
	}
}
  • a connected_clients metricset (or some better name if you have an idea) which is an event for each connected client with the info that appears there. Forgive me but I don't know what's that number between brackets [0] or [1], I was guessing that maybe it shows if the connection is still active or not. It with a look similar to this:
{
    "client":{
        "ip":"127.0.0.1",
        "port":1234
    },
    "queued":0,
    "received":1234,
    "sent":12345
}

I can start already with the server metricset that contains the zxid data and is pretty straighforward.

@ruflin
Copy link
Contributor

ruflin commented Jan 25, 2019

I like that idea. Was not aware there is also a srvr command. server metricset sounds good as a name. Let's focus on the server metricset first. Data structure also LGTM except that Zookeper version is inside version ;-).

@zenitraM @pmoust @stejacks Could you comment on the expected format for zxid?

@sayden For the connection metricet better open a separate issue so we can have a discussion there.

@pmoust
Copy link
Member

pmoust commented Jan 25, 2019

Could you comment on the expected format for zxid?

Per ZK docs;

The zxid has two parts: the epoch and a counter. In our implementation the zxid is a 64-bit number. We use the high order 32-bits for the epoch and the low order 32-bits for the counter. Because it has two parts represent the zxid both as a number and as a pair of integers, (epoch, count). The epoch number represents a change in leadership. Each time a new leader comes into power it will have its own epoch number. We have a simple algorithm to assign a unique zxid to a proposal: the leader simply increments the zxid to obtain a unique zxid for each proposal. Leadership activation will ensure that only one leader uses a given epoch, so our simple algorithm guarantees that every proposal will have a unique id.

I would like to have both representations. The single representation of zxid, and the hi/lo of epoch and count.
Reasoning being that zxid is how one would do a targeted search to a unique transaction, epoch points to a leader at the time of event (and can help infer leader events) and count is obvious.

_________________________________
|           zxid (64bit)        |
| epoch (32bit) | count (32bit) |
|_______________________________|

In that sense, I would see them as top level items of the server metricset - zxid being a long, epoch and count being integer. Does that resonate to you?

Note: I am making the assumption that ZK being a project before Java 8, when they mention 64bit integer the mean signed, so long ES datatype would be covering. If not we can make it a keyword I suppose. Elasticsearch won't be supporting unsigned 64bit AFAIK(?).

@ruflin
Copy link
Contributor

ruflin commented Jan 25, 2019

The only problem here is that we can't have zxid as key and object, so I would propose:

"zxid": {
	"original": "0x700601132",
	"epoch": 23,
	"count": 3442
},

For the original field, do you expect number typical queries / aggregations like > or sum? If not, we could leave it in the original format and ingest it as keyword.

@pmoust
Copy link
Member

pmoust commented Jan 25, 2019

The only problem here is that we can't have zxid as key and object

Agreed - this is why I mention above that the three of them should be top level items for that metricset. I am not that excited about having .zxid.original, seems more intuitive to be .zxid .epoch .count, i.e;

{
	"version": "Zookeeper version: 3.4.13-2d71af4dbe22557fda74f9a9b4309b15a7487f03, built on 06/29/2018 04:05 GMT",
	"latency": {
		"min": 0,
		"avg": 0,
		"max": 0
	},
	"received": 9,
	"sent": 8,
	"connections": 1,
	"outstanding": 0,
	"zxid": "0x700601132",
        "epoch": 23,
        "count": 3442,
	"mode": "standalone",
	"node_count": 4,
	"proposal_sizes": {
		"last": -1,
		"min": -1,
		"max": -1
	}
}

For the original field, do you expect number typical queries / aggregations like > or sum? If not, we could leave it in the original format and ingest it as keyword.

I 'd say that sum wouldn't make a lot of sense for .zxid (but would for count per epoch), perhaps comparisons would be used on .zxid but I don't have a solid example where I'd use that over a @timestamp.

In that spirit, I don't have a strong opinion on the numeric vs keyword type, if keyword makes more sense then fine by me. Deferring back to @stejacks @mattfield @matschaffer

@sayden
Copy link
Contributor

sayden commented Jan 25, 2019

PR have been opened here #10341 and it is still open to as many changes as necessary 😉

@pmoust
Copy link
Member

pmoust commented Jan 28, 2019

Closed by #10341

Thanks @sayden @ruflin @matschaffer

@pmoust pmoust closed this as completed Jan 28, 2019
@pmoust pmoust reopened this Jan 29, 2019
@pmoust
Copy link
Member

pmoust commented Jan 29, 2019

Sorry - this is to track stat not just srvr

@ruflin
Copy link
Contributor

ruflin commented Jan 30, 2019

@pmoust I assume the part missing now is the Client list stats? I suggest to still close this issue and open a follow up issue with only this inside to have a more focused discussion.

@matschaffer
Copy link
Contributor Author

matschaffer commented Feb 1, 2019

Looks like we don't include cons by default in our typical zk configs (interesting note, srvr seems to work regardless of 4lw.commands.whitelist), so will have to investigate that output first before I can be sure about followup tickets. But I do see Zxid in srvr which was the main thing we were targeting. Will re-title this ticket accordingly.

@matschaffer matschaffer changed the title Zookeeper stat metricset Track Zkid counter reported from zookeeper srvr command Feb 1, 2019
@matschaffer
Copy link
Contributor Author

Opened #10475 to followup. Will confirm the cons output there.

@ruflin
Copy link
Contributor

ruflin commented Feb 1, 2019

Thanks @matschaffer

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement good first issue Indicates a good issue for first-time contributors Metricbeat Metricbeat module Team:Integrations Label for the Integrations team
Projects
None yet
Development

No branches or pull requests

7 participants