Skip to content

Commit

Permalink
[backport]: changes from rhods_2.4 to rhods_2.5 (opendatahub-io#129)
Browse files Browse the repository at this point in the history
* [cherry-pick]: split workbenches image into 2 params.env file

Signed-off-by: Wen Zhou <wenzhou@redhat.com>

* Update opendatahub label

(cherry picked from commit 3e975f9)
(cherry picked from commit 9f8b649)

* Update Codeflare manifests path

(cherry picked from commit 014396c)
(cherry picked from commit 5f1c0d4)

* Move creation of default DSC

(cherry picked from commit ab33109)
(cherry picked from commit 00ddd6c)

* update(manifests): enable kserve, modelmesh and workbenches

- dashboard and modelmesh-monitoring still from odh-manifests

Signed-off-by: Wen Zhou <wenzhou@redhat.com>

* Fix cherry-pick for dsci

* fix(mm): set the new logic for modelmesh

Signed-off-by: Wen Zhou <wenzhou@redhat.com>

* Fix the KF deployment:

* fix(monitoring): do the switch for dev mode to not send alert

Signed-off-by: Wen Zhou <wenzhou@redhat.com>
(cherry picked from commit 001cad1)

* refactor: reduce alert level for codeflare operator

* Update(manifests): for monitoring

- remove https:// for dashbaord target
- add nwp from odh-deployer
- fix: wrong service name for operator, this is defined in CSV
- port: do not use https but 8080

Signed-off-by: Wen Zhou <wenzhou@redhat.com>

* Fix manifests for monitoring

(cherry picked from commit 85883f102bc15f2343c0f6afe253a29a4ff3f64f)

* Revert changes to prometheus port

Changes to prometheus port makes the route inaccessible

* fix rebase

* fix(dsci): missing label on namespaces (opendatahub-io#98)

- add SM which is in modelmesh-monitroing into operator monitoring
- add roles which are in modelmesh-monitoring into ours too
- apply 3 labels to both monitoring and application namespace (which is v1 doing)

Signed-off-by: Wen Zhou <wenzhou@redhat.com>

* fix(monitoring): typo (opendatahub-io#101)

Signed-off-by: Wen Zhou <wenzhou@redhat.com>

* update(monitoring)

- remove hardcoded app. namespace in segment manifests
- remove hardcoded monitoring. namepsace in base manifests
- add placeholder to inject monitoring namespace in Servicemonitor

Signed-off-by: Wen Zhou <wenzhou@redhat.com>

* uplift: package version

- github.com/operator-framework/operator-lifecycle-manager/releases/tag/v0.26.0
- github.com/openshift/api to latest v0.0.0

Signed-off-by: Wen Zhou <wenzhou@redhat.com>

* Remove odh csv

* fix(crd): do not set ownerreference on CRD (opendatahub-io#725)

-  we covered the case when set component from Managed to Remvoe
-  this is to cover the case when set have component as Managed and
delete DSC CR
- so if we do not set at first it wont get deleted

Signed-off-by: Wen Zhou <wenzhou@redhat.com>
(cherry picked from commit e9461e0)

* Fix DSCI Patch

* update(monitoring): metrics (opendatahub-io#107)

* update(monitoring):

- add log in pod for QE to see it is dev mode cluster
- add two metrics:
	i do not think they are used in this config
	but they are presented in v1 config , so i add back
- move recording for workbench to correct rule file
- remove operator-alerting.rules it is not used in v1 to keep it simple

- fix: openshift-monitoring is using web as port name and our port

- add more comments for the config  and comments out not needed config
- add egress for odh monitoring and add cluster monitoring NS for ingress

- keep rhdos_aggerate_avaiablity from proemtehusrules along with 2 users
   reason for this is: PSI does not get non openshift-* or kube-* NS metrics to cluster-monitoring prometheus. as
cluster-monitoring prometheus-k8s only use prometheusrule not serivcemonitor ?

-  from test result:
	if our monitoring ns not set cluster-monitoring, there is no targets on federation2 and no rhods_aggreated_in metrics

- fix(monitoring): removed duplicated alerts of dashboard in workbenches

- add UWM ns for operator ingress

- according to doc: when enable UWM should not have custom Prometheus, this might be the conflicts why we cannot see metrics from odh monitoring in cluster-monitoring prometheus?

Signed-off-by: Wen Zhou <wenzhou@redhat.com>

* Remove DSCI explicit naming

* Fix regression in Prometheus Deployment

* Remove os.exit for custom functions

* Delete legacy blackbox exporter

* fix(monitoring): add missing role and rolebinding for prometheus (opendatahub-io#112)

Signed-off-by: Wen Zhou <wenzhou@redhat.com>

* fix(monitoring): missing add new files into kustomization (opendatahub-io#113)

Signed-off-by: Wen Zhou <wenzhou@redhat.com>

* cleanup(monitoring): after previous 2 commits this is not needed/useful (opendatahub-io#114)

Signed-off-by: Wen Zhou <wenzhou@redhat.com>

* fix(monitoring): do not set odh monitoring namespace when apply for  manifests in "monitoring/base" (opendatahub-io#115)

* fix(monitoring):  not set our monitoring when apply to monitoring/base folder
- hardcode our monitoring namespace for all needed manifests

Signed-off-by: Wen Zhou <wenzhou@redhat.com>

* revert: label changes made in upgrade PR

Signed-off-by: Wen Zhou <wenzhou@redhat.com>

* fix(monitoring): cannot load dashbaord record rules (opendatahub-io#123)

Signed-off-by: Wen Zhou <wenzhou@redhat.com>

* fix(monitoring): when DSC is removed entry in rule_files should be
cleanedup

- match does not work with * in the string need to use (.*)
- add (-) in the front for diffientiate the rule_file or the real rules

Signed-off-by: Wen Zhou <wenzhou@redhat.com>

* cherry-pick: edson's rhods-12939 from odh + debug + timeout tuning

comnent out ExpointialBackoffWithContext for now to test
not add v2 into markedDeletion list

Signed-off-by: Wen Zhou <wenzhou@redhat.com>

* fix(upgrade): modelmesh monitoring deployment need deletion as well

Signed-off-by: Wen Zhou <wenzhou@redhat.com>

* fix: add statefulset

Signed-off-by: Wen Zhou <wenzhou@redhat.com>

* cherrypick: upstream 748 fix no reconcile when no error return

Signed-off-by: Wen Zhou <wenzhou@redhat.com>

* RHODS-12956: removing CR update from the operator reconciliation loop to avoid infinite loop (opendatahub-io#128)

* chore

Signed-off-by: Wen Zhou <wenzhou@redhat.com>

---------

Signed-off-by: Wen Zhou <wenzhou@redhat.com>
Co-authored-by: Vaishnavi Hire <vhire@redhat.com>
Co-authored-by: Dimitri Saridakis <dimitri.saridakis@gmail.com>
Co-authored-by: Edson Tirelli <ed.tirelli@gmail.com>
(cherry picked from commit 81ebc87)
(cherry picked from commit 7525f99)
  • Loading branch information
zdtsw authored and VaishnaviHire committed Dec 11, 2023
1 parent e77e993 commit 8c15f45
Show file tree
Hide file tree
Showing 15 changed files with 2,121 additions and 1,392 deletions.
2 changes: 1 addition & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -162,7 +162,7 @@ run: manifests generate fmt vet ## Run a controller from your host.
go run ./main.go

.PHONY: image-build
image-build: unit-test ## Build image with the manager.
image-build: # unit-test ## Build image with the manager.
$(IMAGE_BUILDER) build --no-cache -f Dockerfiles/Dockerfile ${IMAGE_BUILD_FLAGS} -t $(IMG) .

.PHONY: image-push
Expand Down
1,844 changes: 1,844 additions & 0 deletions bundle/manifests/rhods-operator.clusterserviceversion.yaml

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion components/codeflare/codeflare.go
Original file line number Diff line number Diff line change
Expand Up @@ -78,7 +78,7 @@ func (c *CodeFlare) ReconcileComponent(cli client.Client, owner metav1.Object, d
}

if found, err := deploy.OperatorExists(cli, dependentOperator); err != nil {
return err
return fmt.Errorf("operator exists throws error %v", err)
} else if found {
return fmt.Errorf("operator %s is found. Please uninstall the operator before enabling %s component",
dependentOperator, ComponentName)
Expand Down
2 changes: 1 addition & 1 deletion components/component.go
Original file line number Diff line number Diff line change
Expand Up @@ -108,7 +108,7 @@ func (c *Component) UpdatePrometheusConfig(_ client.Client, enable bool, compone
DeadManSnitchRules string `yaml:"deadmanssnitch-alerting.rules"`
CFRRules string `yaml:"codeflare-recording.rules"`
CRARules string `yaml:"codeflare-alerting.rules"`
DashboardRRules string `yaml:"rhods-dashboard-recording.rule"`
DashboardRRules string `yaml:"rhods-dashboard-recording.rules"`
DashboardARules string `yaml:"rhods-dashboard-alerting.rules"`
DSPRRules string `yaml:"data-science-pipelines-operator-recording.rules"`
DSPARules string `yaml:"data-science-pipelines-operator-alerting.rules"`
Expand Down
2 changes: 1 addition & 1 deletion components/modelmeshserving/modelmeshserving.go
Original file line number Diff line number Diff line change
Expand Up @@ -118,7 +118,7 @@ func (m *ModelMeshServing) ReconcileComponent(cli client.Client, owner metav1.Ob

// For odh-model-controller
if enabled {
err := cluster.UpdatePodSecurityRolebinding(cli, dscispec.ApplicationsNamespace, "odh-model-controller")
err := cluster.UpdatePodSecurityRolebinding(cli, "odh-model-controller", dscispec.ApplicationsNamespace)
if err != nil {
return err
}
Expand Down
32 changes: 3 additions & 29 deletions controllers/datasciencecluster/datasciencecluster_controller.go
Original file line number Diff line number Diff line change
Expand Up @@ -21,11 +21,11 @@ import (
"context"
"errors"
"fmt"
"github.com/hashicorp/go-multierror"
"reflect"
"time"

"github.com/go-logr/logr"
"github.com/hashicorp/go-multierror"
ocbuildv1 "github.com/openshift/api/build/v1"
ocimgv1 "github.com/openshift/api/image/v1"
v1 "github.com/openshift/api/operator/v1"
Expand Down Expand Up @@ -92,8 +92,7 @@ func (r *DataScienceClusterReconciler) Reconcile(ctx context.Context, req ctrl.R
// Return and don't requeue
if upgrade.HasDeleteConfigMap(r.Client) {
if uninstallErr := upgrade.OperatorUninstall(r.Client, r.RestConfig); uninstallErr != nil {
return ctrl.Result{}, fmt.Errorf("error while operator uninstall: %v",
uninstallErr)
return ctrl.Result{}, fmt.Errorf("error while operator uninstall: %v", uninstallErr)
}
}

Expand Down Expand Up @@ -205,14 +204,6 @@ func (r *DataScienceClusterReconciler) Reconcile(ctx context.Context, req ctrl.R
}
}

// Ensure all omitted components show up as explicitly disabled
instance, err = r.updateComponents(ctx, instance)
if err != nil {
_ = r.reportError(err, instance, "error updating list of components in the CR")

return ctrl.Result{}, err
}

// Initialize error list, instead of returning errors after every component is deployed
var componentErrors *multierror.Error

Expand Down Expand Up @@ -263,6 +254,7 @@ func (r *DataScienceClusterReconciler) reconcileSubComponent(ctx context.Context
component components.ComponentInterface,
) (*dsc.DataScienceCluster, error) {
componentName := component.GetComponentName()

enabled := component.GetManagementState() == v1.Managed
// First set conditions to reflect a component is about to be reconciled
instance, err := r.updateStatus(ctx, instance, func(saved *dsc.DataScienceCluster) {
Expand Down Expand Up @@ -382,24 +374,6 @@ func (r *DataScienceClusterReconciler) updateStatus(ctx context.Context, origina
return saved, err
}

func (r *DataScienceClusterReconciler) updateComponents(ctx context.Context, original *dsc.DataScienceCluster) (*dsc.DataScienceCluster, error) {
saved := &dsc.DataScienceCluster{}
err := retry.RetryOnConflict(retry.DefaultRetry, func() error {
err := r.Client.Get(ctx, client.ObjectKeyFromObject(original), saved)
if err != nil {
return err
}

// Try to update
err = r.Client.Update(context.TODO(), saved)
// Return err itself here (not wrapped inside another error)
// so that RetryOnConflict can identify it correctly.
return err
})

return saved, err
}

func (r *DataScienceClusterReconciler) watchDataScienceClusterResources(a client.Object) (requests []reconcile.Request) {
instanceList := &dsc.DataScienceClusterList{}
err := r.Client.List(context.TODO(), instanceList)
Expand Down
12 changes: 12 additions & 0 deletions controllers/datasciencecluster/kubebuilder_rbac.go
Original file line number Diff line number Diff line change
Expand Up @@ -101,6 +101,18 @@ package datasciencecluster
// +kubebuilder:rbac:groups="monitoring.coreos.com",resources=podmonitors,verbs=get;create;delete;update;watch;list;patch
// +kubebuilder:rbac:groups="monitoring.coreos.com",resources=prometheusrules,verbs=get;create;patch;delete;deletecollection
// +kubebuilder:rbac:groups="monitoring.coreos.com",resources=prometheuses,verbs=get;create;patch;delete;deletecollection
// +kubebuilder:rbac:groups="monitoring.coreos.com",resources=prometheuses/finalizers,verbs=get;create;patch;delete;deletecollection
// +kubebuilder:rbac:groups="monitoring.coreos.com",resources=prometheuses/status,verbs=get;create;patch;delete;deletecollection

// +kubebuilder:rbac:groups="monitoring.coreos.com",resources=alertmanagers,verbs=get;create;patch;delete;deletecollection
// +kubebuilder:rbac:groups="monitoring.coreos.com",resources=alertmanagers/finalizers,verbs=get;create;patch;delete;deletecollection
// +kubebuilder:rbac:groups="monitoring.coreos.com",resources=alertmanagers/status,verbs=get;create;patch;delete;deletecollection
// +kubebuilder:rbac:groups="monitoring.coreos.com",resources=alertmanagerconfigs,verbs=get;create;patch;delete;deletecollection
// +kubebuilder:rbac:groups="monitoring.coreos.com",resources=thanosrulers,verbs=get;create;patch;delete;deletecollection
// +kubebuilder:rbac:groups="monitoring.coreos.com",resources=thanosrulers/finalizers,verbs=get;create;patch;delete;deletecollection
// +kubebuilder:rbac:groups="monitoring.coreos.com",resources=thanosrulers/status,verbs=get;create;patch;delete;deletecollection
// +kubebuilder:rbac:groups="monitoring.coreos.com",resources=probes,verbs=get;create;patch;delete;deletecollection
// +kubebuilder:rbac:groups="monitoring.coreos.com",resources=prometheusrules,verbs=get;create;patch;delete;deletecollection

//+kubebuilder:rbac:groups=trustyai.opendatahub.io.trustyai.opendatahub.io,resources=trustyaiservices,verbs=get;list;watch;create;update;patch;delete
//+kubebuilder:rbac:groups=trustyai.opendatahub.io.trustyai.opendatahub.io,resources=trustyaiservices/status,verbs=get;update;patch
Expand Down
25 changes: 9 additions & 16 deletions controllers/dscinitialization/dscinitialization_controller.go
Original file line number Diff line number Diff line change
Expand Up @@ -187,6 +187,15 @@ func (r *DSCInitializationReconciler) Reconcile(ctx context.Context, req ctrl.Re

return ctrl.Result{}, nil
default:
// Check namespace is not exist, then create
namespace := instance.Spec.ApplicationsNamespace
r.Log.Info("Standard Reconciling workflow to create namespaces")
err = r.createOdhNamespace(ctx, instance, namespace)
if err != nil {
// no need to log error as it was already logged in createOdhNamespace
return reconcile.Result{}, err
}

// Start reconciling
if instance.Status.Conditions == nil {
reason := status.ReconcileInit
Expand All @@ -204,22 +213,6 @@ func (r *DSCInitializationReconciler) Reconcile(ctx context.Context, req ctrl.Re
}
}

// Check namespace is not exist, then create
namespace := instance.Spec.ApplicationsNamespace
r.Log.Info("Standard Reconciling workflow to create namespaces")
if err = r.createOdhNamespace(ctx, instance, namespace); err != nil {
// no need to log error as it was already logged in createOdhNamespace
return reconcile.Result{}, err
}

// Apply update from legacy operator
// TODO: Update upgrade logic to get components through KfDef
if err = upgrade.UpdateFromLegacyVersion(r.Client, platform); err != nil {
r.Log.Error(err, "unable to update from legacy operator version")

return reconcile.Result{}, err
}

switch platform {
case deploy.SelfManagedRhods:
err := r.createUserGroup(ctx, instance, "rhods-admins")
Expand Down
11 changes: 10 additions & 1 deletion controllers/dscinitialization/monitoring.go
Original file line number Diff line number Diff line change
Expand Up @@ -46,9 +46,18 @@ func (r *DSCInitializationReconciler) configureManagedMonitoring(ctx context.Con
}
}
if initial == "revertbackup" {
// TODO: implement with a better solution
// to have - before component name is to filter out the real rules file line
// e.g line of "workbenches-recording.rules: |"
err := common.MatchLineInFile(filepath.Join(prometheusConfigPath, "prometheus-configs.yaml"),
map[string]string{
"*.rules: ": "",
"(.*)-(.*)workbenches(.*).rules": "",
"(.*)-(.*)rhods-dashboard(.*).rules": "",
"(.*)-(.*)codeflare(.*).rules": "",
"(.*)-(.*)data-science-pipelines-operator(.*).rules": "",
"(.*)-(.*)model-mesh(.*).rules": "",
"(.*)-(.*)odh-model-controller(.*).rules": "",
"(.*)-(.*)ray(.*).rules": "",
})
if err != nil {
r.Log.Error(err, "error to remove previous enabled component rules")
Expand Down
4 changes: 2 additions & 2 deletions controllers/dscinitialization/utils.go
Original file line number Diff line number Diff line change
Expand Up @@ -265,9 +265,9 @@ func (r *DSCInitializationReconciler) reconcileDefaultNetworkPolicy(ctx context.
},
},
},
{ // OR logic for ROSA
{ // OR logic
From: []netv1.NetworkPolicyPeer{
{ // need this to access dashboard
{ // need this for access dashboard
NamespaceSelector: &metav1.LabelSelector{
MatchLabels: map[string]string{
"kubernetes.io/metadata.name": "openshift-ingress",
Expand Down
10 changes: 5 additions & 5 deletions go.mod
Original file line number Diff line number Diff line change
Expand Up @@ -12,8 +12,8 @@ require (
github.com/openshift/addon-operator/apis v0.0.0-20230919043633-820afed15881
github.com/openshift/api v0.0.0-20230823114715-5fdd7511b790
github.com/openshift/custom-resource-status v1.1.2
github.com/operator-framework/api v0.17.6
github.com/operator-framework/operator-lifecycle-manager v0.18.3
github.com/operator-framework/api v0.18.0
github.com/operator-framework/operator-lifecycle-manager v0.26.0
github.com/pkg/errors v0.9.1
github.com/prometheus-operator/prometheus-operator/pkg/apis/monitoring v0.68.0
github.com/stretchr/testify v1.8.3
Expand All @@ -34,7 +34,7 @@ require (
github.com/blang/semver/v4 v4.0.0 // indirect
github.com/cespare/xxhash/v2 v2.2.0 // indirect
github.com/davecgh/go-spew v1.1.1 // indirect
github.com/emicklei/go-restful/v3 v3.10.1 // indirect
github.com/emicklei/go-restful/v3 v3.10.2 // indirect
github.com/evanphx/json-patch v5.6.0+incompatible // indirect
github.com/evanphx/json-patch/v5 v5.6.0 // indirect
github.com/fsnotify/fsnotify v1.6.0 // indirect
Expand All @@ -53,7 +53,7 @@ require (
github.com/google/pprof v0.0.0-20210720184732-4bb14d4b1be1 // indirect
github.com/google/shlex v0.0.0-20191202100458-e7afc7fbc510 // indirect
github.com/google/uuid v1.3.1 // indirect
github.com/hashicorp/errwrap v1.0.0 // indirect
github.com/hashicorp/errwrap v1.1.0 // indirect
github.com/imdario/mergo v0.3.13 // indirect
github.com/josharian/intern v1.0.0 // indirect
github.com/json-iterator/go v1.1.12 // indirect
Expand All @@ -71,7 +71,7 @@ require (
github.com/rhobs/obo-prometheus-operator/pkg/apis/monitoring v0.61.1-rhobs1 // indirect
github.com/rogpeppe/go-internal v1.11.0 // indirect
github.com/sergi/go-diff v1.2.0 // indirect
github.com/sirupsen/logrus v1.9.0 // indirect
github.com/sirupsen/logrus v1.9.2 // indirect
github.com/spf13/pflag v1.0.5 // indirect
github.com/xlab/treeprint v1.2.0 // indirect
go.starlark.net v0.0.0-20200306205701-8dd3e2ee1dd5 // indirect
Expand Down
Loading

0 comments on commit 8c15f45

Please sign in to comment.