Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: introduces OSSM plugin for KfDef #515

Closed
wants to merge 41 commits into from

Conversation

bartoszmajsak
Copy link
Contributor

@bartoszmajsak bartoszmajsak commented Sep 5, 2023

⚠️ Important notes

The operator bundle has not been officially released yet. If you wish to test it, please use for the time being:

operator-sdk run bundle quay.io/maistra-dev/opendatahub-operator-bundle:v0.0.10-plugin --namespace openshift-operators --timeout 5m0s

It will need opendatahub-io/odh-manifests#932 to be merged alongside to use service-mesh-integration branch.

Description

This PR introduces a KfDef plugin aimed at enhancing the configuration and deployment functionalities of OpenShift Service Mesh. The development was inspired by analysis of how other platforms like GCP and AWS manage integrations. We've designed these improvements to be easily integrated into the upcoming version of the opendatahub-operator, while minimizing changes to the existing codebase. All new features are contained within dedicated packages.

The plugin addresses several limitations we encountered when using the conventional manifest and init-job solution, specifically:

  • Autoconfiguration: Simplifies the process of obtaining cluster-specific data for auto-configuring authorization settings.
  • Reversible Changes: Enables flexible and easy reversion of applied configurations.
    • Utilizes owner references wherever applicable.
    • Enables custom cleanup logic hooks implemented in Golang, such as undoing patches.
  • Ease of Integration: Allows a straightforward addition to the KfDef file. An example is provided below.

This PR includes #501 fix.

Manifests used for the plugin can be found here https://github.com/maistra/odh-manifests/tree/ossm_plugin_templates. PR opendatahub-io/odh-manifests#932

KfDef example

apiVersion: kfdef.apps.kubeflow.org/v1
kind: KfDef
metadata:
  name: odh-mesh
spec:
  plugins:
    - kind: KfOssmPlugin
      spec:
        mesh:
          name: minimal
          namespace: istio-system
          installationMode: minimal # (1)
          certificate:
            generate: true # (2)
        auth:
          name: authorino
          namespace: auth-provider
          authorino:
            label: authorino/topic=odh
  applications:
    - kustomizeConfig:
        repoRef:
          name: manifests
          path: odh-common
      name: odh-common
    - kustomizeConfig:
        overlays:
          - dev # (3)
        repoRef:
          name: manifests
          path: odh-dashboard
      name: odh-dashboard
    - kustomizeConfig:
        repoRef:
          name: manifests
          path: odh-notebook-controller
      name: odh-notebook-controller
    - kustomizeConfig:
        repoRef:
          name: manifests
          path: odh-project-controller
      name: odh-project-controller
    - kustomizeConfig:
        repoRef:
          name: manifests
          path: notebook-images
      name: notebook-images
  repos:
    - name: manifests
      uri: https://github.com/maistra/odh-manifests/tarball/kf_ossm_plugin
  version: kf_ossm_plugin

(1) - installationMode: currently it's either minimal which deploys our own, very basic control plane, or assumes there's one already which we should use (installationMode: pre-installed)

(2) - This will generate a self-signed certificate used by Mesh internally. This will need further enhancements.

(3) - notice there's no need to explicitly add service-mesh overlay. If an application has one defined it will be detected and enabled automatically

How Has This Been Tested?

  • manual end-to-end testing
  • automated integration tests
    • new target test-ossm has been provided

Merge criteria:

  • The commits are squashed in a cohesive manner and have meaningful messages.
  • Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
  • The developer has manually tested the changes and verified that the changes work

Comment on lines +92 to 94
// +optional
// +kubebuilder:pruning:PreserveUnknownFields
Spec *runtime.RawExtension `json:"spec,omitempty"`
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without this option we are not able to use custom structs for configuration.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the plugin being defined as the platform is not participating in Apply and Delete phases by default, we need to hook into those in order to configure Service Mesh. GCP integration is done similarly.

Copy link
Contributor Author

@bartoszmajsak bartoszmajsak Sep 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a copy of pkg/utils/k8utils.go, but I wanted to keep it isolated from the main codebase. It has additional func for applying a patch file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the central place of kf plugin integration.

bartoszmajsak and others added 21 commits September 6, 2023 10:22
* add data science migration to init

Co-authored-by: Bartosz Majsak <bartosz.majsak@gmail.com>
* chore: removes unused test setup

* feat(test): sets up testing for ossm plugin

* feat(make): adds test-ossm target
* feat: introduces OssmResourceTracker resource

It is a cluster-scoped resource for tracking objects created by Ossm plugin.
Primarily used as owner reference for resources created across namespaces
so that they can be garbage collected by Kubernetes when they are not needed anymore.

* feat: uses OssmResourceTracker to cleanup owned resources

- adds cleanup hook for Delete phase of KfDef operator
- moves k8s_utils code to ossm package to keep it in one place for future extraction of the logic

* chore: renames plugin to OssmInstaller

* chore: moves cleanup logic to single file

* feat: handles token volume removal in SMCP

* chore: adds plugin name as extra key-value to logger

* feat: removes extAuthzProvider on deletion

* chore: minor naming fixes
embed templates to OSSM plugin

Co-authored-by: Bartosz Majsak <bartosz.majsak@gmail.com>
they might have been removed in some merge (?)
* add prereq checks to init, add tests
---------

Co-authored-by: Bartosz Majsak <bartosz.majsak@gmail.com>
- easy builder to define what particular feature consists of
- enables the use of existing YAML manifests
- allows the definition of additional resources programmatically with a simple func registration
- cleanup functions are now part of each feature and are called one after the other
- each feature has its own resource tracker object
- ability to test smaller parts of the whole thing by using `Feature`s in integration tests
* chore: extracts smcp GVR to a var

and reuse it in smcp-related funcs

* fix: returns err on oauth creation failure

instead of nil by mistake

* feat: adds ability to enable feature based on predicates

in certain cases this information can only be determined at runtime, for example based on user-defined values of the plugin spec.

* feat: introduces control plane installation

coupled with wait condition

* fix: makes feature enabled by default

* chore: reworks smcp creation

It relies on owner reference so can be cleaned up as other resources

* fix: applies cleanups only if feature is enabled

Additionally moves OssmResourceTracker creation to .Apply func

* feat: reworks plugin integration to use Apply and Delete funcs of KfApp

This way we can cleanup the resources in case deploying service mesh manifests because creating those in Generate phase leaves them hanging due to failing Delete hook

* fix(reconcile): corrects KfDef status in case of any error

Additionally propagates the error through reconcile to keep trying, as it was ignored.

* fix: reworks SMCP component readiness check

* chore: improves KfApp func docs

* chore: reworks logging in verifiers

* chore: simplifies CRD existence check

* chore: adds tests for feature enablement

* fix: sets interval for SMCP polling to 5s

* fix: handles types which are derived for built-in ones

for example `type InstallationMode string` was previously failing when
converting the default value defined in a custom tag. Now there's a
check performed and if value can be convereted to a given type it will.

In all other cases it will return an error.

* fix: sets default control plane installation mode rely on existing
installation

Proposed installation modes are:

- minimal
- pre-installed
Without this setting certain environments are not working,
as service account issuer might be set to different endpoint than kubernetes.default.svc
… cluster

without this fix we could fetch OdhDashboardConfig from different ns and try to perform an update on selected app ns, which leads to error and blocks operator from performing entire OSSM setup.

Oh well...
otherwise we create route for ns.domain but expects hosts to be opendatahub.domain which obviously does not work. Doh!
Creates ossm.env in each overlays/service-mesh of defined apps so that it can be used e.g. to overwrite configmap values
Copy link
Contributor

@cam-garrison cam-garrison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great - nice fixes in recent commits. Have also tested this E2E in the past with success using the Feature implementation.

pkg/kfapp/ossm/OWNERS Show resolved Hide resolved
pkg/kfconfig/ossmplugin/OWNERS Show resolved Hide resolved
@openshift-ci
Copy link

openshift-ci bot commented Sep 19, 2023

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: cam-garrison
Once this PR has been reviewed and has the lgtm label, please assign anishasthana for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Co-authored-by: Cameron Garrison <cgarriso@redhat.com>
@openshift-ci openshift-ci bot removed the lgtm label Sep 19, 2023
@openshift-ci
Copy link

openshift-ci bot commented Sep 19, 2023

New changes are detected. LGTM label has been removed.

Co-authored-by: Cameron Garrison <cgarriso@redhat.com>
@openshift-ci
Copy link

openshift-ci bot commented Sep 19, 2023

@bartoszmajsak: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/opendatahub-manifests-e2e 35460f5 link false /test opendatahub-manifests-e2e

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@zdtsw
Copy link
Member

zdtsw commented Sep 22, 2023

sorry for late chime in @bartoszmajsak :
we are planning to sunset v1 operator in a couple of months.
not sure we want to add more new features to it.

@bartoszmajsak
Copy link
Contributor Author

sorry for late chime in @bartoszmajsak : we are planning to sunset v1 operator in a couple of months. not sure we want to add more new features to it.

@zdtsw that's a valid concern. However, the context here is broader than just sunsetting v1. For ODH users still on v1 who require service mesh, the suggested approach offers clear advantages over our current init-job method, as seen in the service-mesh-integration branch of the ODH Manifests repo.

Based on the comment from @LaVLaS I understood we need both. It would be great if he could provide more clarity on this matter.

@zdtsw
Copy link
Member

zdtsw commented Sep 22, 2023

sorry for late chime in @bartoszmajsak : we are planning to sunset v1 operator in a couple of months. not sure we want to add more new features to it.

@zdtsw that's a valid concern. However, the context here is broader than just sunsetting v1. For ODH users still on v1 who require service mesh, the suggested approach offers clear advantages over our current init-job method, as seen in the service-mesh-integration branch of the ODH Manifests repo.

Based on the comment from @LaVLaS I understood we need both. It would be great if he could provide more clarity on this matter.

ok, I see. I was not aware of this issue. but yeah, lets wait for more inputs.

btw, for the manifests, we are doing the transition to move away from odh-manifests repo, see opendatahub-io/architecture-decision-records#16. from linked converstation seems you are still updating with the old approach ?

@bartoszmajsak
Copy link
Contributor Author

sorry for late chime in @bartoszmajsak : we are planning to sunset v1 operator in a couple of months. not sure we want to add more new features to it.

@zdtsw that's a valid concern. However, the context here is broader than just sunsetting v1. For ODH users still on v1 who require service mesh, the suggested approach offers clear advantages over our current init-job method, as seen in the service-mesh-integration branch of the ODH Manifests repo.
Based on the comment from @LaVLaS I understood we need both. It would be great if he could provide more clarity on this matter.

ok, I see. I was not aware of this issue. but yeah, lets wait for more inputs.

btw, for the manifests, we are doing the transition to move away from odh-manifests repo, see opendatahub-io/architecture-decision-records#16. from linked converstation seems you are still updating with the old approach ?

I am aware of this (found out yesterday by accident ;)) and we are already working on the split on our end. But I see not everything is ported yet, e.g. notebook controllers?

# comment out below logic once we have all component manifests ready to get from source git repo
MANIFEST_RELEASE="master"
MANIFESTS_TARBALL_URL="${GITHUB_URL}/${MANIFEST_ORG}/odh-manifests/tarball/${MANIFEST_RELEASE}"
mkdir -p ./.odh-manifests-tmp/ ./odh-manifests/
wget -q -c ${MANIFESTS_TARBALL_URL} -O - | tar -zxv -C ./.odh-manifests-tmp/ --strip-components 1 > /dev/null
cp -r ./.odh-manifests-tmp/model-mesh/ ./odh-manifests
cp -r ./.odh-manifests-tmp/odh-model-controller/ ./odh-manifests
cp -r ./.odh-manifests-tmp/modelmesh-monitoring/ ./odh-manifests
cp -r ./.odh-manifests-tmp/odh-notebook-controller/ ./odh-manifests
ls -lat ./odh-manifests/
rm -rf ${MANIFEST_RELEASE}.tar.gz ./.odh-manifests-tmp/

@zdtsw
Copy link
Member

zdtsw commented Sep 22, 2023

sorry for late chime in @bartoszmajsak : we are planning to sunset v1 operator in a couple of months. not sure we want to add more new features to it.

@zdtsw that's a valid concern. However, the context here is broader than just sunsetting v1. For ODH users still on v1 who require service mesh, the suggested approach offers clear advantages over our current init-job method, as seen in the service-mesh-integration branch of the ODH Manifests repo.
Based on the comment from @LaVLaS I understood we need both. It would be great if he could provide more clarity on this matter.

ok, I see. I was not aware of this issue. but yeah, lets wait for more inputs.
btw, for the manifests, we are doing the transition to move away from odh-manifests repo, see opendatahub-io/architecture-decision-records#16. from linked converstation seems you are still updating with the old approach ?

I am aware of this (found out yesterday by accident ;)) and we are already working on the split on our end. But I see not everything is ported yet, e.g. notebook controllers?

# comment out below logic once we have all component manifests ready to get from source git repo
MANIFEST_RELEASE="master"
MANIFESTS_TARBALL_URL="${GITHUB_URL}/${MANIFEST_ORG}/odh-manifests/tarball/${MANIFEST_RELEASE}"
mkdir -p ./.odh-manifests-tmp/ ./odh-manifests/
wget -q -c ${MANIFESTS_TARBALL_URL} -O - | tar -zxv -C ./.odh-manifests-tmp/ --strip-components 1 > /dev/null
cp -r ./.odh-manifests-tmp/model-mesh/ ./odh-manifests
cp -r ./.odh-manifests-tmp/odh-model-controller/ ./odh-manifests
cp -r ./.odh-manifests-tmp/modelmesh-monitoring/ ./odh-manifests
cp -r ./.odh-manifests-tmp/odh-notebook-controller/ ./odh-manifests
ls -lat ./odh-manifests/
rm -rf ${MANIFEST_RELEASE}.tar.gz ./.odh-manifests-tmp/

good to know that!

notebook-controller actually is ready , i am waiting for "approval" on #543 to close it.
but for the model-mesh part that will take longer time ( i guess)

it is up to when servicemesh can finish "split" work.
But no matter in which case, we will need to add it in the get_all_manifests.sh either from your repo or from odh-manifests

@bartoszmajsak
Copy link
Contributor Author

@zdtsw is this split going to affect v1/KfDef operator as well?

@zdtsw
Copy link
Member

zdtsw commented Sep 22, 2023

@zdtsw is this split going to affect v1/KfDef operator as well?

yes, the PR for v1 is here #544
similar way to handle it, but different components in v1

@bartoszmajsak
Copy link
Contributor Author

Withdrawing in favor of Operator V2 support #605

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

3 participants