-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add tests for benchmarks #1341
Add tests for benchmarks #1341
Conversation
12a4218
to
a86eef7
Compare
SonarCloud Quality Gate failed. |
Have you looked into https://json-schema.org/? This is a way to validate json objects, so it could be used instead of the manual checking currently done. |
We don't have a fixed schema, because the input object can contain information from different benchmarks (matrix_statistics, spmv, solver, ...) and we only check for what is necessary. However, I am not doing any validation in here (only sanitation to make the output deterministic), are you talking about the other PR? |
You are validating the benchmark output by comparing it to some predefined output. I think this is exactly the use case for this schema validation. Of course we would have different schemas for the different benchmark types. Here is an example for the blas schema (it might not be complete since I'm missing the other operations): {
"$schema": "https://json-schema.org/draft/2020-12/schema",
"type": "array",
"items": {
"type": "object",
"properties": {
"n": {"type": "integer"},
"blas": {
"type": "object",
"patternProperties": {
"^copy|axpy|scal": {
"type": "object",
"properties": {
"time": {"type": "number"},
"flops": {"type": "number"},
"bandwidth": {"type": "number"},
"repetitions": {"type": "integer"},
"completed": {"type": "boolean"}
},
"required": [
"time",
"flops",
"bandwidth",
"repetitions",
"completed"
]
}
}
}
},
"required": [
"n",
"blas"
]
}
} |
@MarcelKoch Schemas might be more resilient to change, but requires significantly more work in adapting to changes (we can't just regenerate the output) and doesn't allow checking whether we left other parts of the input unchanged. On top of that, we have multiple types of output sections (text, DEBUG logger output, JSON arrays, JSON objects) that would need to be parsed. |
Not generating the expected output can also be a good thing, because we would keep errors from incorrect benchmarks otherwise. IMO changing the schema would not be too much in most cases. And I think the schema would be a more common approach to the test.
The schema would only be for the JSON arrays and objects. The other parts would be tested as it currently is. Although thb I'm not sure what the value is of testing the debug logger output. But I need to think about that a bit more. |
Having the reference output be part of the repository and checking stderr and stdout is a really fundamental goal of this approach, because it lets us notice when something in the output changes, either because we changed something on purpose or because we caused an accidental change. Either way the test fails, and we need to explicitly change the reference outputs. The change also gets preserved in git, so you see in the code review which parts of the benchmark (and its internals in the DEBUG case) changed how. |
All of that would also be valid with the schema approach.
I'm not sure what the first part means, but, on the second part, do we really care about the output order? This would also depend on the JSON implementation, how it handles object properties. At least, I don't think that there is a predefined order on the properties. (I think it would be possible to check the order or array entries) |
nlohmann-json and RapidJSON both support preserving the order of objects in the output, and a change in the order is usually the side-effect of a change in the code. Whether the side-effect was intended or accidental, it's still useful to be notified of it. |
From our offline discussion, the schema based approach will be put on the backlog, and, for now, the purely text based approach is sufficient. Perhaps after #1323 would be a good point to reevaluate the usage of schema. |
Could that be summarized somehow? Or is it only the previous points? Either here or in notion. I've also recently seen fruitful usages of the JSON schema, the more info on the topic the better. |
The important bullet points for me were
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a full review, just some thoughts so far. In general I think if test_framework.py grows in size using pytest is the better alternative. And I would assume that the whole sanitizing could be simplified by splitting the traversal of the dictionary/lists and the actual sanitization. Also, some docstrings would improve readability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a bit unsure which route to go here. In general, I would recommend to use something like pytest https://docs.pytest.org/en/7.3.x/ because it ships all kind of useful functionality, but if the scope of these python tests stays limited and doesn't increase over time it might be OK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
our containers currently don't have pytest installed, but long-term, this might be useful. Does pytest provide any kind of utilities for long text diffs? I don't see any right now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question, I don't know about long text diffs, but comparing dictionaries is quite good. And IMO that should be the preferred route anyway. Read json to python dicts (reference and test output) -> sanitize -> compare both dictionaries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The thing I am interested here is text output, the JSON objects are far from the only important part. The only reason to parse the JSON input is that it is easier to sanitize the results in parsed form. Text diffs on pretty-printed JSON are really convenient, I'm not sure there is much to be improved by moving to dicts?
benchmark/test/test_framework.py.in
Outdated
return sanitize_json(value, sanitize_all) | ||
|
||
|
||
def sanitize_json(parsed_input, sanitize_all=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function name is a bit misleading, since the parsed input has been already serialised to a python object. I would probably write a function like this
def recursive_apply(parsed_input: dict, sanitizer: collections.abc.Callable) -> dict
"""Recurses into parsed_input and applys sanitizer if the values are not dictionary or list types"""
...
Then the sanitizer should also know sanitzie_all was set. This way you could untangle the calls from sanitze_json
to sanitize_json_single
and potentially back to. sanitize_json
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be really helpful in case the sanitation needs to become more advanced, but I believe with the complexity involved in such a stateful sanitizer
object, the current implementation is still easier to maintain. I am also not sure if we will need more complex sanitation in the forseeable future, since we basically only need get rid of a handful of individual floats, arrays of floats and implementation-dependent strings.
Regarding the |
- no more -detailed information in the output - moved the range annotation closer to the hot loop
Co-authored-by: Gregor Olenik <gregor.olenik@kit.edu>
- add missing newline - remove disable test outputs - fix docstrings - fix duplicate matrix_statistics test Co-authored-by: Yuhsiang M. Tsai <yhmtsai@gmail.com>
Kudos, SonarCloud Quality Gate passed! |
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## develop #1341 +/- ##
===========================================
+ Coverage 90.35% 91.26% +0.91%
===========================================
Files 598 598
Lines 50781 50706 -75
===========================================
+ Hits 45881 46276 +395
+ Misses 4900 4430 -470
☔ View full report in Codecov by Sentry. |
Pulled out of #1323, this is everything from that PR without nlohmann-json and flag changes