-
Notifications
You must be signed in to change notification settings - Fork 801
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Text parser optimization (~4.5x perf) #282
Conversation
Those aspects of the text format are not optional, they must be implemented to have a correct parser. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Performance improvements would be great, however the format has many ways it can be represented and this code only parses a subset of the potential valid input. If you can manage to make it work with that, that'd be great.
@@ -181,6 +181,10 @@ def __eq__(self, other): | |||
self.type == other.type and | |||
self.samples == other.samples) | |||
|
|||
def __repr__(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A __str__
would make more sense I think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
renamed it to __str__
and having repr
call str
because it's useful for comparing unit test output:
E First differing element 0:
E <Metric name: a, documentation: help, type: counter, samples: [(u'a', {u'foo': u'bar'}, 1), (u'a', {u'foo': u'baez'}, 2), (u'a', {u'foo': u'buz'}, 3)]>
E <Metric name: a, documentation: help, type: counter, samples: [(u'a', {u'foo': u'bar'}, 1.0), (u'a', {u'foo': u'baz'}, 2.0), (u'a', {u'foo': u'buz'}, 3.0)]>
E
E - [<Metric name: a, documentation: help, type: counter, samples: [(u'a', {u'foo': u'bar'}, 1), (u'a', {u'foo': u'baez'}, 2), (u'a', {u'foo': u'buz'}, 3)]>]
E ? -
E
E + [<Metric name: a, documentation: help, type: counter, samples: [(u'a', {u'foo': u'bar'}, 1.0), (u'a', {u'foo': u'baz'}, 2.0), (u'a', {u'foo': u'buz'}, 3.0)]>]
E ?
instead of
E First differing element 0:
E <prometheus_client.core.CounterMetricFamily object at 0x108018c50>
E <prometheus_client.core.Metric object at 0x108018cd0>
E
E - [<prometheus_client.core.CounterMetricFamily object at 0x108018c50>]
E ? ------- ------ ^
E
E + [<prometheus_client.core.Metric object at 0x108018cd0>]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually str calls repr rather than the other way around. repr should usually be an instantiatable version of object, while str is more human readable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Modified repr
to fit what it's supposed to be, still way more readable for tests outputs 👍
prometheus_client/parser.py
Outdated
slash = True | ||
else: | ||
result.append(char) | ||
def _replace_escaping(s): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Help and label values have different escaping rules (double quote is the difference), you need two functions for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added another function specific for Help
prometheus_client/parser.py
Outdated
return ''.join(result) | ||
def _parse_labels(labels_string): | ||
labels = {} | ||
# return if we don't have valid labels |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please keep comments as full sentances
prometheus_client/parser.py
Outdated
|
||
# we don't have labels | ||
except ValueError: | ||
# detect what separator is used |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any mix of any number of spaces and tabs is permitted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some additional unit tests around this 👍
prometheus_client/parser.py
Outdated
label_start, label_end = text.index("{"), text.rindex("}") | ||
# the name is before the labels | ||
name = text[:label_start].strip() | ||
# we ignore the starting curly brace |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there can be whitespace after the brace, and basically everywhere else between things
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be covered, I added one additional test case in test_spaces
prometheus_client/parser.py
Outdated
state = 'name' | ||
# detect the labels in the text | ||
try: | ||
label_start, label_end = text.index("{"), text.rindex("}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A label value could contain a }
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This already taken into account with the use of rindex
, added a test case to validate this point.
prometheus_client/parser.py
Outdated
name = text[:label_start].strip() | ||
# we ignore the starting curly brace | ||
label = text[label_start + 1:label_end] | ||
# the value is after the label end (ignoring curly brace and space) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There can be a trailing comma after the last "
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be already covered, test_commas
is validating this
41edcb8
to
a25d444
Compare
Signed-off-by: Pierre Margueritte <mfpierre@gmail.com>
a25d444
to
1d7190c
Compare
i = 0 | ||
while i < len(value_substr): | ||
i = value_substr.index('"', i) | ||
if value_substr[i - 1] != "\\": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if if you have x=""
as the label? i - 1
will be -1, which might have unexpected results
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added a unit-test around empty labels, but this works fine 👍
Signed-off-by: Pierre <mfpierre@gmail.com>
Thanks! |
Hey @brian-brazil, any plans to do a release soon? 🙇 |
I've added it to my todo list |
Hi,
As we extensively use this library for parsing prometheus text format, we have observed that for big payloads (like https://github.com/kubernetes/kube-state-metrics for a big cluster) that can contains 400k+ lines the text parsing can be quite slow. (up to 27secs)
We tried to optimize the parser by dropping the state machine and leveraging native python functions such as
index
orfind
and it gives us an average of x5 performances.Here are some benchmark using
timeit
:For the KSM payload (400k lines) the parsing goes from ~27sec to ~4.7sec
Note: We could go up to almost 10x performance if we dropped some edge-cases treatment (like escaping, tab/space, etc...) could we consider a "strict" parsing mode that we could optionally use for "good citizens"?