-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Azure AD authentication support (and SDK upgrade) #9
Conversation
c6aa5f7
to
ff9d776
Compare
In case it's helpful context, I added If the removal of those two fields truly the only breaking change in this PR I hope the Flyte team will take this PR as soon as possible. Right now the only way you can authenticate into Azure with stow is with Azure connection strings. It is much better security practice to use workload identities. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for making this change! This has been bothering me for a while...
Can you detail how you tested this change? What kind of workflows have you ran against an installation with this stow version?
@gvashishtha Thanks for that context and makes sense! If something like LocalStack existed for Azure Blob Services, then those options would be key for local testing. Alas, there is no such thing. |
The testing section covers most of the testing that was put into this. For the scope of this change, I really wanted to stick to the stow fork completely independent of Flyte. I considered that a separate level of testing and one I'd like help with.
I did happen to test a Flyte build together with this branch and threw together a few basic workflows (the gist in the testing section shows one). However, I am new to flyte so if there are specific things to test for this type of change, I don't have the context for that. Also, the workflows I wrote were in python and so I seemed to be testing out the FlyteKit python SDK more so than anything to do with this code. |
This is awesome! The one additional test I would like to see is a map task. The reason being, there is a different code path that runs to collect outputs of a map task in propeller. Something like this would do: @task
def my_task(i: int) -> string:
return f"Hello: {i}"
@workflow
def my_wf(numbers: typing.List[int]) -> typing.List[str]:
return map_task(my_task)(i=numbers) Besides the few comments above, this looks great to me! I don't think we need to delay it further. Let's address the back compat fail early issue and optionally the nits and mark it for review! |
@EngHabu - Great to hear, thank you!
I updated the gist about local testing to include that. Below is a console screen shot of what the example task looked like after being ran. One of the first tasks we had for this updated stow code within our environment was to use a Also, we have gone through the exercise of getting a build of flyte with this branch deployed using the Aside from that, I wanted to call out a few things:
|
… Azure AD or shared keys. This required a larger-than-expected series of changes because the Azure SDK now very different than the one stow used before. This is an early commit to get changes out there for discussion. Included here is: - An upgrade of the Azure SDK used for the project (needed to support AD auth). This, unfortunately also brought in a whole series if indirect upgrades so testing will be needed. - Support for authenticating via either Azure AD or Shared keys - Removal of several azure config values (ConfigAPIVersion, ConfigUseHTTPS) and addition of ConfigUploadConcurrency. - Removal of the multi-part upload code for Azure, this is now included in the SDK - Addition of new presigned url tests into the general stow test/ module. - Addition of a new terraform module to stand up an azure storage account for testing. Signed-off-by: Terence Kent <terence.kent@mcg.com>
Signed-off-by: Terence Kent <terence.kent@mcg.com>
Accepting len(var)==0 vs var=="" Co-authored-by: Haytham Abuelfutuh <haytham@afutuh.com> Signed-off-by: Terence Kent <terence.kent@mcg.com>
Accepting improved error message Co-authored-by: Haytham Abuelfutuh <haytham@afutuh.com> Signed-off-by: Terence Kent <terence.kent@mcg.com>
…ity at this point: - Removed all inline context.Background() calls when using Azure SDK methods. Now set an explicit `ctx` variable at the top of each function. - Added comments around the clock skew code, and moved all clock skews to 15 minutes (the azure-recommended value). Also left a comment explaining the decision. - Made the use of removed config values an error state for validation. Added a test to ensure this happens. - Moved some errant logging in the `checkMetadata` test method to only show up in inequality, to keep test output clean. Signed-off-by: Terence Kent <terence.kent@mcg.com>
1bf10b8
to
db70879
Compare
Yeah noticed that, I believe that's alright... Depending on how you configure flyte, it can avoid issuing presigned urls to begin with..
Yeah that will be an issue (annoyance) I suspect. Why do you need to manually restart? why does it not recover on its own (i.e. Go into a crashloop backoff and eventually work) ? |
Oh, I should have been clear - it's only when running locally that I have to restart. When it's deployed in a cluster the pod just restarts so you might not even notice it. |
@tkent makes sense.. Is that a regression after moving to the "new" APIs? or did that behavior exist before? |
… ignored failures due to creating an already-existing container! The code from the old SDK to ignore "already exists" was preserved, but not working with the new SDK. - Updated the check to work with the new SDK. - Added a general test to confirm this behavior. Note: It's debatable if this should be the default behavior for all stow implementations. However, it seems like whatever the desired behavior is, it should be consistent across all implementations. Signed-off-by: Terence Kent <terence.kent@mcg.com>
@EngHabu Ha, you know, I never checked! Turns out it that this is a regression. The previous implementation silently ignored "Already Exists" errors and the new one did not. I've updated the code to match the old behavior, added a test for it, and the issue no longer shows up. There is a little awkwardness around ignoring "Already Exists" failures since it doesn't appear to be consistent between the various stow implementations, but it sure seems like the desired behavior. |
Thank you for the update, @tkent! |
@EngHabu Yup! The changes should be minimal from last time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for addressing all the feedback! and for your contribution
What's here
This PR adds support for using Azure AD authentication to the
azure
module.About the changes
The latest version of the azure golang SDK was needed for this and, unfortunately, this required a very large set of changes. The
azure
module previously used thestorage
SDK, which has been removed, and the new SDK is very different. The key changes included are:RequestPreSigner
)use_https
andapi_version
configuration options (more on that below)upload_concurrency
configuration option.base_url
todomain_suffix
.ContainerPreSignRequest
)BigFileUpload
)Things to talk about
The Azure SDK changes forced a few decisions that I want to call out:
The removal of
use_https
This just seems like a misconfiguration risk with no clear benefit. MSFT recommends using HTTPS instead of HTTP for storage account access (here) and I can't imagine a scene where
http
would be preferred.The removal of
api_version
Newer versions of the SDK do not really support changing the API version. If you check the SDK code you see this comment:
Again, this seems like a misconfiguration risk without a clear benefit.
Presigned Requests and the shared keys disabled flag
When shared keys are disabled on a storage account, you might expect presigned keys to be blocked as well. However, "user delegated SAS tokens" will continue to work even in this case (docs here).
There is just one big caveat. The Azure AD identity used for authentication to the storage account must have the
Microsoft.Storage/storageAccounts/blobServices/generateUserDelegationKey
privilege (docs here).Performance of the Presigned Request implementation when using AD auth
The current presigned request implementation for AD auth will not perform well if a large volume of requests are needed. Each pre-signing requires a distinct delegate key which is an API call to Azure. This could be mitigated, though not easily, by obtaining a long-lived delegate key and renewing it as necessary. I opted not to include that, assuming large volumes of pre-signed requests are unlikely.
Lack of
context
in the stow interfacesThe methods of the stow interfaces generally do not include a
context.Context
in their argument signature. The newer azure SDK, however, requires acontext.Context
for nearly every operation. I chose to usecontext.Background
every time this came up, butcontext.TODO
would have been just as appropriate.Testing
A big challenge for this particular change is testing. To make it a bit easier for a reviewer to kick the tires, a terraform module was added under the
tests/
directory. In there is a readme for using the module to stand up and teardown a storage account.Unfortunately, even with the fixture and existing stow tests, there is quite a bit of real-world testing this SDK swap will need.
Update
I've put together some instructions for testing this change end to end within a local
flyte
build using the master branch offlyteorg/flyte
. See this gist.A few things to note if following that guide:
For some reason, when Flyte first initializes the configured blob container in the Azure Storage Account, you will get errors (appears to be a race condition). The container is successfully created though. I haven't looked into this.