Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Text parser optimization (~4.5x perf) #282

Merged
merged 2 commits into from
Jun 8, 2018

Conversation

mfpierre
Copy link
Contributor

@mfpierre mfpierre commented Jun 7, 2018

Hi,

As we extensively use this library for parsing prometheus text format, we have observed that for big payloads (like https://github.com/kubernetes/kube-state-metrics for a big cluster) that can contains 400k+ lines the text parsing can be quite slow. (up to 27secs)

We tried to optimize the parser by dropping the state machine and leveraging native python functions such as index or find and it gives us an average of x5 performances.

Here are some benchmark using timeit:

call (x100000): _parse_sample('simple_metric 1.513767429e+09')
previous implementation: 1.10845804214
new implementation: 0.277444839478
improvement: x3.99523755508

call (x100000): _parse_sample('kube_service_labels{label_app="kube-state-metrics",label_chart="kube-state-metrics-0.5.0",label_heritage="Tiller",label_release="ungaged-panther",namespace="default",service="ungaged-panther-kube-state-metrics"} 1')
previous implementation: 7.58089208603
new implementation: 1.48280119896
improvement: x5.11254785291

For the KSM payload (400k lines) the parsing goes from ~27sec to ~4.7sec

Note: We could go up to almost 10x performance if we dropped some edge-cases treatment (like escaping, tab/space, etc...) could we consider a "strict" parsing mode that we could optionally use for "good citizens"?

@brian-brazil
Copy link
Contributor

could we consider a "strict" parsing mode that we could optionally use for "good citizens"?

Those aspects of the text format are not optional, they must be implemented to have a correct parser.

Copy link
Contributor

@brian-brazil brian-brazil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Performance improvements would be great, however the format has many ways it can be represented and this code only parses a subset of the potential valid input. If you can manage to make it work with that, that'd be great.

@@ -181,6 +181,10 @@ def __eq__(self, other):
self.type == other.type and
self.samples == other.samples)

def __repr__(self):
Copy link
Contributor

@brian-brazil brian-brazil Jun 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A __str__ would make more sense I think

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed it to __str__ and having repr call str because it's useful for comparing unit test output:

E       First differing element 0:
E       <Metric name: a, documentation: help, type: counter, samples: [(u'a', {u'foo': u'bar'}, 1), (u'a', {u'foo': u'baez'}, 2), (u'a', {u'foo': u'buz'}, 3)]>
E       <Metric name: a, documentation: help, type: counter, samples: [(u'a', {u'foo': u'bar'}, 1.0), (u'a', {u'foo': u'baz'}, 2.0), (u'a', {u'foo': u'buz'}, 3.0)]>
E
E       - [<Metric name: a, documentation: help, type: counter, samples: [(u'a', {u'foo': u'bar'}, 1), (u'a', {u'foo': u'baez'}, 2), (u'a', {u'foo': u'buz'}, 3)]>]
E       ?                                                                                                                  -
E
E       + [<Metric name: a, documentation: help, type: counter, samples: [(u'a', {u'foo': u'bar'}, 1.0), (u'a', {u'foo': u'baz'}, 2.0), (u'a', {u'foo': u'buz'}, 3.0)]>]
E       ?

instead of

E       First differing element 0:
E       <prometheus_client.core.CounterMetricFamily object at 0x108018c50>
E       <prometheus_client.core.Metric object at 0x108018cd0>
E
E       - [<prometheus_client.core.CounterMetricFamily object at 0x108018c50>]
E       ?                          -------      ------                    ^
E
E       + [<prometheus_client.core.Metric object at 0x108018cd0>]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually str calls repr rather than the other way around. repr should usually be an instantiatable version of object, while str is more human readable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modified repr to fit what it's supposed to be, still way more readable for tests outputs 👍

slash = True
else:
result.append(char)
def _replace_escaping(s):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Help and label values have different escaping rules (double quote is the difference), you need two functions for this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added another function specific for Help

return ''.join(result)
def _parse_labels(labels_string):
labels = {}
# return if we don't have valid labels
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please keep comments as full sentances


# we don't have labels
except ValueError:
# detect what separator is used
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any mix of any number of spaces and tabs is permitted

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some additional unit tests around this 👍

label_start, label_end = text.index("{"), text.rindex("}")
# the name is before the labels
name = text[:label_start].strip()
# we ignore the starting curly brace
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there can be whitespace after the brace, and basically everywhere else between things

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be covered, I added one additional test case in test_spaces

state = 'name'
# detect the labels in the text
try:
label_start, label_end = text.index("{"), text.rindex("}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A label value could contain a }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This already taken into account with the use of rindex, added a test case to validate this point.

name = text[:label_start].strip()
# we ignore the starting curly brace
label = text[label_start + 1:label_end]
# the value is after the label end (ignoring curly brace and space)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There can be a trailing comma after the last "

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be already covered, test_commas is validating this

@mfpierre mfpierre force-pushed the JulienBalestra/parser branch 5 times, most recently from 41edcb8 to a25d444 Compare June 7, 2018 14:54
Signed-off-by: Pierre Margueritte <mfpierre@gmail.com>
@mfpierre mfpierre force-pushed the JulienBalestra/parser branch from a25d444 to 1d7190c Compare June 7, 2018 15:00
i = 0
while i < len(value_substr):
i = value_substr.index('"', i)
if value_substr[i - 1] != "\\":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if if you have x="" as the label? i - 1 will be -1, which might have unexpected results

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a unit-test around empty labels, but this works fine 👍

Signed-off-by: Pierre <mfpierre@gmail.com>
@mfpierre mfpierre changed the title Text parser optimization Text parser optimization (~5x perf) Jun 7, 2018
@mfpierre mfpierre changed the title Text parser optimization (~5x perf) Text parser optimization (~4.5x perf) Jun 7, 2018
@brian-brazil brian-brazil merged commit dc15164 into prometheus:master Jun 8, 2018
@brian-brazil
Copy link
Contributor

Thanks!

@mfpierre
Copy link
Contributor Author

mfpierre commented Jul 9, 2018

Hey @brian-brazil, any plans to do a release soon? 🙇

@brian-brazil
Copy link
Contributor

I've added it to my todo list

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants