-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Uses pull query metrics for all paths, not just /query #5983
fix: Uses pull query metrics for all paths, not just /query #5983
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the patch. Should we be adding tests which check that metrics are being computed correctly and through every API path?
ksqldb-rest-app/src/main/java/io/confluent/ksql/rest/server/KsqlRestApplication.java
Show resolved
Hide resolved
Yes, we probably should. The main difficulty is that |
statement, | ||
executionContext, | ||
serviceContext, | ||
pullQueryContext, | ||
routingOptions | ||
); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a reason for these measurements to be in the StreamedQueryResource
. They measure a more realistic latency that accounts also for query parsing etc. , basically all that happens before the request reaches the PullQueryExecutor
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My original thinking was that such CPU bound things like parsing should be an order of magnitude or two faster than IO like reading from disk or network. I do recall that in earlier benchmarking, we didn't seem to be IO bound, but that's probably a separate issue.
If what we really want is latency for the whole endpoint, which I agree is preferable, it seems like it would be good to have a more generic latency metric for all resources -- that way we don't special case this. Not sure if that's possible with Vert.x, but I imagine it is. So as to keep consistency with existing metric naming and make it work with the websocket (which is probably harder to make generic), I'll stick to the pull query use case for now. Ok, I'll change this part back and add in the web socket.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point - might be worth asking @purplefox and/or googling about vert.x to see if there is a generic way to get metrics on all requests. I know this is possible in Jetty, but don't know vert.x.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a bit of a PITA that we now have two ways of handling pull queries. However, be that as it may, we obviously need to capture metrics from both, in some semblance of a standard way. As @vpapavas says, the moving the latency capture completely into the executor means we're no longer including the time taken to do things like auth checks in the latency. This is obviously bad!
The best solution would be to remove the old code path for pull queries. But this isn't possible short term, (though we should at least start this process, e.g. get the UI team to move to the new HTTP2 endpoint and mark the old websocket endpoint as deprecated).
Given the two code paths, the challenge is therefore to change the code in a way that encapsulates the collection of the metrics and has the minimum duplicate code in the two paths.
Looking at the two code paths, they are drastically different in the way they are architected. This leads me to believe the best approach is for each path to capture the start time of the request (in nanoseconds), and pass this down the call stack. This start time can be passed down into PullQueryExecutor
so it can log the latency. I don't actually like this pattern, but I don't think capturing the metrics at a higher level is feasible. I think this only works as there isn't any substantial processing after the executor has done its thing, but this is worth double checking. Ideally, we should be capturing any JSON serialization costs, etc.
ksqldb-rest-app/src/main/java/io/confluent/ksql/rest/server/KsqlRestApplication.java
Outdated
Show resolved
Hide resolved
ksqldb-rest-app/src/main/java/io/confluent/ksql/rest/server/KsqlRestApplication.java
Outdated
Show resolved
Hide resolved
ksqldb-rest-app/src/main/java/io/confluent/ksql/rest/server/execution/PullQueryExecutor.java
Outdated
Show resolved
Hide resolved
statement, | ||
executionContext, | ||
serviceContext, | ||
pullQueryContext, | ||
routingOptions | ||
); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
ksqldb-rest-app/src/main/java/io/confluent/ksql/rest/server/execution/PullQueryExecutor.java
Outdated
Show resolved
Hide resolved
ksqldb-rest-app/src/main/java/io/confluent/ksql/rest/server/execution/PullQueryExecutor.java
Outdated
Show resolved
Hide resolved
I agree with the general sentiment, though I'm surprised we don't have a framework level latency measurer for web socket requests and endpoint requests. I'd also be surprised if this made much of a difference, since in theory we should be IO bound, at least for large tables.
I had a version I tried where I separated out the timing and did it after serialization, and in some sense I prefer this since the timing isn't encapsulated anyway. But per all of your other suggestions about encapsulating the metrics object in the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
The auth check is potentially a cross machine call! ;) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with a few suggestions.
final TableRows tableRows = result.getTableRows(); | ||
|
||
return new PullQueryPublisher( | ||
final QueryPublisher queryPublisher = new PullQueryPublisher( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: inline queryPublisher
again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
||
public PullQueryExecutor( | ||
final KsqlExecutionContext executionContext, | ||
final KsqlEngine ksqlEngine, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably better to keep this as KsqlExecutionContext
and pass in the service Id... less coupling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
pullQueryMetrics.ifPresent(metrics -> | ||
metrics.recordLatency(Time.SYSTEM.nanoseconds() - startTimeNanos)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just have:
pullQueryMetrics.ifPresent(metrics -> | |
metrics.recordLatency(Time.SYSTEM.nanoseconds() - startTimeNanos)); | |
pullQueryMetrics.ifPresent(metrics -> | |
metrics.recordLatency(startTimeNanos)); |
And change recordLatency
to take the end time in ns, encapsulating the Time.SYSTEM.nanoseconds()
into a single place...?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
||
public void recordLatency(final double value) { | ||
this.latencySensor.record(value); | ||
public void recordLatency(final long nanoSeconds) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As suggested above, change to taking the end time, rather than the elapsed time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@@ -80,10 +81,11 @@ public void setUp() { | |||
publisher = new PullQueryPublisher( | |||
serviceContext, | |||
statement, | |||
pullQueryExecutor); | |||
pullQueryExecutor, | |||
0L); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we go with a non-zero start time please? Stick it in a constant, e.g. START_TIME = 1234L
. Update tests to use START_TIME
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@@ -75,8 +74,7 @@ public void shouldThrowExceptionIfConfigDisabled() { | |||
// When: | |||
final Exception e = assertThrows( | |||
KsqlException.class, | |||
() -> pullQueryExecutor.execute(query, engine.getServiceContext(), Optional.empty(), | |||
Optional.empty()) | |||
() -> pullQueryExecutor.execute(query, engine.getServiceContext(), Optional.empty(), 0L) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be good to test its passing an appropriate start time here. Can we inject a clock?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@Test | ||
public void shouldRecordRequestRate() { | ||
// Given: | ||
pullMetrics.recordRate(3); | ||
|
||
// When: | ||
final double rate = getMetricValue("-rate"); | ||
|
||
// Then: | ||
assertThat(rate, closeTo(0.03, 0.001)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than deleting this test, change it to:
@Test
public void shouldRecordRequestRate() {
// Given:
pullMetrics. recordLatency(3000);
// When:
final double rate = getMetricValue("-rate");
// Then:
assertThat(rate, closeTo(0.03, 0.001)); // <-- numbers need changing.
}
Otherwise there's no test testing that the rate is updated when recordLatency
is called!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had removed the rate variable since it didn't appear to be read at all in practice, and since you had recommended removing it, but I'm now thinking you just meant the method recordRate
. Alright, will add this variable to the recordLatecy call and add back this test.
…tinc#5983) * fix: Uses pull query metrics for all paths, not just /query
Description
Fixes issue with pull query metrics not being used in every pull query codepath. Now it should be used not only on
/query
by also the web socket code path.Testing done
Ran normal tests. Also, verified locally that it's now in use.
Reviewer checklist