-
Notifications
You must be signed in to change notification settings - Fork 780
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Audit takes an unreasonable amount of time. Limited to 1 CPU core #2502
Comments
Are you able to determine whether this is caused by requests to the API server getting throttled? Unfortunately this would be hard to tease out, since there are no logs that signify scraping the API server has finished, but maybe the K8s client logs throttling? @ritazh should we add a "resource listing complete" log line? |
I can confirm that there are NO no instances of client-side throttling or API server-side throttling. |
How were you able to determine this? |
Another way to experiment with whether this is due to single-threaded Rego execution vs. something about the execution process would be to dump the contents of the cluster to disk and run Another thought, what are your constraint templates like? Do they use external data? Any Rego calls to http.Send()? |
When there is throttling on OPA GK, we typically see error message like |
thanks for the speedy response. @maxsmythe |
Not sure if there is a faster way than looping over
Audit controller is single-threaded, but it doesn't necessarily have to be. Mostly I want to be sure we're addressing the correct problem and am trying to figure out what we can learn without needing to wait for code that gives better profiling data. Generally, pure Rego shouldn't take ~O(hours) to execute even over large datasets unless there are a lot of referential constraints that scale poorly with the size of the data footprint. It could also be network latency/throttling, which tends to be more likely for severe slowdowns. It could also be blocking calls (such as I/O requests), in which case some workqueueing could help. There are a number of possibilities for the root cause, each with different solutions. |
If you are willing to post the logs from the audit pod, that might be interesting to look at as well. |
Log dump
|
@maxsmythe the place in the log where you see |
Thank you for the logs! It looks like the runtime is ~80 minutes. TBH I don't think I have enough data to root cause as-is. Adding a log line for when scraping is complete and evaluation begins would help disambiguate slow prep from slow execution. Also, @acpana, the ability to surface per-template metrics like here: would help us know if there is a specific constraint template that is slow and that might imply a specific cause/fix. @bhattchaitanya are you able to run a binary built off a PR rather than a proper release? |
do the PR versions have a docker image in docker hub? if yes, then we can test it @maxsmythe |
IIRC we create an image for every merged PR. Lemme create a PR that adds a log line... |
#2503 should at least give us an idea of where the majority of time is spent |
Sorry for the delay and thank you for opening the PR! |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions. |
What steps did you take and what happened:
Audit manager locks onto 1 CPU core and slows down the audit. It does not spin up multiple goroutines to make use of all the CPU cores.
GOMAXPROCS and --max-serving-threads had no effect.
What did you expect to happen:
The audit controller should spin up multiple goroutines in the audit cycle based on constraint kinds or other factors to reduce the audit time. In large clusters, audits take hours to complete which is a serious limitation.
Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]
Environment:
kubectl version
): 1.24The text was updated successfully, but these errors were encountered: