-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FEATURE: Refresh aws token #54
Comments
Hi Victor, Hum since in this operator the client is internally defined, we can create a method there to refresh the token, not affecting the operator in general. You can find the client @ kubernetes_job_operator/client. Sadly I have little time at the moment to address this, but if you find a solution we can integrate it. |
Hi Zav, Thanks for your quick response! Yup, i've already saw that the original client is wrapped in operator's Anyway, knowing that you're not fully available right now, i will try to write some code here. If some other contributor wants to join me, would be a good help. Wish me luck :) |
I've got something that forks for us, but it's completely custom for our tricky use case. Apart from that, i was forced to do the fix from your 2.0 version because for some reason newer ones are not working. Maybe my college @carlosliarte (I think you already know him) can give you more details on that. Given I'am in a hurry, i've decided to fork your repo and evolve our custom version from your 2.0 until we both have more time to look at all this details carefully (I wasn't able to run any test here to check if our changes break some other supported usage). The idea is to join those changes in your last version, make that version work in any case (including ours), and get back to this source. Probably by that time, the main issue in the oficial k8s client will be solved and we will just need to upgrade the client version, do some QA and publish a new release. For the record, this are the (ugly) changes that worked for us: https://github.com/duferdev/KubernetesJobOperator/pull/1 I will leave it in our preproduction enviroment for a while to check if they're stable Feel free to close this issue if you want |
Hi Tnx, I'll grab a look at that and see if I can integrate this idea into the operator and how. I would prefer it to be an option. The recreation of the token can be done, or the recreation of the client in the case of disconnect. It would take a while though, so apologies for that. I have started a new project recently and the load is high. |
Hi I see you propagated the config file. Can you share your config? I have not read the amazon documentation yet. |
Hi Zav,
Yep, absolutely. Another improvement would be to catch 401 responses in order to reload the config and retry the request once per Unauthorized error.
Don't worry, seems that we have it under control right now, also, this is OS right? :) we understand your situation. You are doing enough and we thank you for that.
Do you mean the kube conf file? I already posted in the issue description. This is the config format. Shadowed parameters are personal tokens, usernames and so. I don't see how that could be helpful. Anyway, if i'm wrong or you mean something else just tell me. Thanks! |
Hi I think I still need to grab a look at the underlining process. The PR you did over there that showed the changes to make the config work were helpfull. I may have time for this in two three weeks. Otherwise, if you find a solution I would love a PR. Best |
Also, Could you try a more simple command and verify that the issue repeats. E.g. ...
kind: Pod
...
spec:
containers:
- ...
command: |
num=1
while true; do
echo sleep 10
sleep 10
num=$((num +1))
if [ num -eq 100 ]; then break; fi
done And just sleep for a very long time in your pod? Will that create the same error? |
Yes, for sure it will if the execution last longer than 15 min because AWS EKS security constrains. Check: https://aws.github.io/aws-eks-best-practices/security/docs/iam/
Right now, seems that the only way we can refresh those credentials is by reloading the whole config and creating a new client with it. Current open issue posted in the issue description explains why in detail. But again, this is something that happen in some corner cases where you are using k8s python client against AWS EKS cluster. If you want to reproduce this locally you will need to raise an k8s cluster that emulates AWS behaviour, expiring k8s api tokens in 15 min. |
But that would not matter. We can recreate the client internally in the wrapper and download/update new creds. If you dont mind trying that with the sleep command I sent, that would produce an example that we can put in examples until I am available to solve the issue for good. |
Would love a PR on that last one. |
Hi still have not gotten time to fix this. Also I have no access to the AWS could as of now. |
Feature description
So basically we are facing this issue located in the k8s python client:
kubernetes-client/python#741
This has been opened for more than one year. As far as I understand, once the client is instanced, there is no way to refresh it's credentials given that it caches them. Because of that, when credentials expire, client calls respond with 401. Meanwhile, people started to develop workarounds on their solutions. The most popular seems to be creating a new client with fresh credentials every time we need to call the client.
So basically that's what i'm proposing. Because this workaround could be ugly for people using the operator that is not suffering this issue, we should consider make some kind of feature toggling from operator params or something like that.
This is my naive proposal given my short knowledge on this but any other proposal would be appreciated.
Because, in our case, we can't run any job that last more than 15 minutes, so our pipelines crash and our pods weren't deleted and got staled (incrementing our infra costs since we are using aws fargate).
Describe alternatives you've considered
Try to catch 401 responses and refresh credentials there somehow.
Additional context
We are running everything on aws. We have MWAA (airflow 2.0) with dags running tasks like this:
Yaml file looks like
As you can see, we are running spark applications.
Here is the kube conf file:
Here is the way we get the token and the reason why it last for just 15 min
https://awscli.amazonaws.com/v2/documentation/api/latest/reference/eks/get-token.html
And here is one example for a failed client call log looks like:
The text was updated successfully, but these errors were encountered: