Skip to content

Commit

Permalink
[backport]: changes from rhods_2.4 to rhods_2.5 (#129)
Browse files Browse the repository at this point in the history
* [cherry-pick]: split workbenches image into 2 params.env file

Signed-off-by: Wen Zhou <wenzhou@redhat.com>

* Update opendatahub label

(cherry picked from commit 3e975f9)
(cherry picked from commit 9f8b649)

* Update Codeflare manifests path

(cherry picked from commit 014396c)
(cherry picked from commit 5f1c0d4)

* Move creation of default DSC

(cherry picked from commit ab33109)
(cherry picked from commit 00ddd6c)

* update(manifests): enable kserve, modelmesh and workbenches

- dashboard and modelmesh-monitoring still from odh-manifests

Signed-off-by: Wen Zhou <wenzhou@redhat.com>

* Fix cherry-pick for dsci

* fix(mm): set the new logic for modelmesh

Signed-off-by: Wen Zhou <wenzhou@redhat.com>

* Fix the KF deployment:

* fix(monitoring): do the switch for dev mode to not send alert

Signed-off-by: Wen Zhou <wenzhou@redhat.com>
(cherry picked from commit 001cad1)

* refactor: reduce alert level for codeflare operator

* Update(manifests): for monitoring

- remove https:// for dashbaord target
- add nwp from odh-deployer
- fix: wrong service name for operator, this is defined in CSV
- port: do not use https but 8080

Signed-off-by: Wen Zhou <wenzhou@redhat.com>

* Fix manifests for monitoring

(cherry picked from commit 85883f102bc15f2343c0f6afe253a29a4ff3f64f)

* Revert changes to prometheus port

Changes to prometheus port makes the route inaccessible

* fix rebase

* fix(dsci): missing label on namespaces (#98)

- add SM which is in modelmesh-monitroing into operator monitoring
- add roles which are in modelmesh-monitoring into ours too
- apply 3 labels to both monitoring and application namespace (which is v1 doing)

Signed-off-by: Wen Zhou <wenzhou@redhat.com>

* fix(monitoring): typo (#101)

Signed-off-by: Wen Zhou <wenzhou@redhat.com>

* update(monitoring)

- remove hardcoded app. namespace in segment manifests
- remove hardcoded monitoring. namepsace in base manifests
- add placeholder to inject monitoring namespace in Servicemonitor

Signed-off-by: Wen Zhou <wenzhou@redhat.com>

* uplift: package version

- github.com/operator-framework/operator-lifecycle-manager/releases/tag/v0.26.0
- github.com/openshift/api to latest v0.0.0

Signed-off-by: Wen Zhou <wenzhou@redhat.com>

* Remove odh csv

* fix(crd): do not set ownerreference on CRD (#725)

-  we covered the case when set component from Managed to Remvoe
-  this is to cover the case when set have component as Managed and
delete DSC CR
- so if we do not set at first it wont get deleted

Signed-off-by: Wen Zhou <wenzhou@redhat.com>
(cherry picked from commit e9461e0)

* Fix DSCI Patch

* update(monitoring): metrics (#107)

* update(monitoring):

- add log in pod for QE to see it is dev mode cluster
- add two metrics:
	i do not think they are used in this config
	but they are presented in v1 config , so i add back
- move recording for workbench to correct rule file
- remove operator-alerting.rules it is not used in v1 to keep it simple

- fix: openshift-monitoring is using web as port name and our port

- add more comments for the config  and comments out not needed config
- add egress for odh monitoring and add cluster monitoring NS for ingress

- keep rhdos_aggerate_avaiablity from proemtehusrules along with 2 users
   reason for this is: PSI does not get non openshift-* or kube-* NS metrics to cluster-monitoring prometheus. as
cluster-monitoring prometheus-k8s only use prometheusrule not serivcemonitor ?

-  from test result:
	if our monitoring ns not set cluster-monitoring, there is no targets on federation2 and no rhods_aggreated_in metrics

- fix(monitoring): removed duplicated alerts of dashboard in workbenches

- add UWM ns for operator ingress

- according to doc: when enable UWM should not have custom Prometheus, this might be the conflicts why we cannot see metrics from odh monitoring in cluster-monitoring prometheus?

Signed-off-by: Wen Zhou <wenzhou@redhat.com>

* Remove DSCI explicit naming

* Fix regression in Prometheus Deployment

* Remove os.exit for custom functions

* Delete legacy blackbox exporter

* fix(monitoring): add missing role and rolebinding for prometheus (#112)

Signed-off-by: Wen Zhou <wenzhou@redhat.com>

* fix(monitoring): missing add new files into kustomization (#113)

Signed-off-by: Wen Zhou <wenzhou@redhat.com>

* cleanup(monitoring): after previous 2 commits this is not needed/useful (#114)

Signed-off-by: Wen Zhou <wenzhou@redhat.com>

* fix(monitoring): do not set odh monitoring namespace when apply for  manifests in "monitoring/base" (#115)

* fix(monitoring):  not set our monitoring when apply to monitoring/base folder
- hardcode our monitoring namespace for all needed manifests

Signed-off-by: Wen Zhou <wenzhou@redhat.com>

* revert: label changes made in upgrade PR

Signed-off-by: Wen Zhou <wenzhou@redhat.com>

* fix(monitoring): cannot load dashbaord record rules (#123)

Signed-off-by: Wen Zhou <wenzhou@redhat.com>

* fix(monitoring): when DSC is removed entry in rule_files should be
cleanedup

- match does not work with * in the string need to use (.*)
- add (-) in the front for diffientiate the rule_file or the real rules

Signed-off-by: Wen Zhou <wenzhou@redhat.com>

* cherry-pick: edson's rhods-12939 from odh + debug + timeout tuning

comnent out ExpointialBackoffWithContext for now to test
not add v2 into markedDeletion list

Signed-off-by: Wen Zhou <wenzhou@redhat.com>

* fix(upgrade): modelmesh monitoring deployment need deletion as well

Signed-off-by: Wen Zhou <wenzhou@redhat.com>

* fix: add statefulset

Signed-off-by: Wen Zhou <wenzhou@redhat.com>

* cherrypick: upstream 748 fix no reconcile when no error return

Signed-off-by: Wen Zhou <wenzhou@redhat.com>

* RHODS-12956: removing CR update from the operator reconciliation loop to avoid infinite loop (#128)

* chore

Signed-off-by: Wen Zhou <wenzhou@redhat.com>

---------

Signed-off-by: Wen Zhou <wenzhou@redhat.com>
Co-authored-by: Vaishnavi Hire <vhire@redhat.com>
Co-authored-by: Dimitri Saridakis <dimitri.saridakis@gmail.com>
Co-authored-by: Edson Tirelli <ed.tirelli@gmail.com>
  • Loading branch information
4 people authored Nov 21, 2023
1 parent d0e761a commit 81ebc87
Show file tree
Hide file tree
Showing 30 changed files with 533 additions and 190 deletions.
2 changes: 1 addition & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -164,7 +164,7 @@ run: manifests generate fmt vet ## Run a controller from your host.
go run ./main.go

.PHONY: image-build
image-build: unit-test ## Build image with the manager.
image-build: # unit-test ## Build image with the manager.
$(IMAGE_BUILDER) build --no-cache -f Dockerfiles/Dockerfile ${IMAGE_BUILD_FLAGS} -t $(IMG) .

.PHONY: image-push
Expand Down
5 changes: 3 additions & 2 deletions bundle/manifests/rhods-operator.clusterserviceversion.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -73,9 +73,9 @@ metadata:
"metadata": {
"labels": {
"app.kubernetes.io/created-by": "opendatahub-operator",
"app.kubernetes.io/instance": "default-feature",
"app.kubernetes.io/instance": "default",
"app.kubernetes.io/managed-by": "kustomize",
"app.kubernetes.io/name": "featuretracker",
"app.kubernetes.io/name": "default-feature",
"app.kubernetes.io/part-of": "opendatahub-operator"
},
"name": "default-feature"
Expand Down Expand Up @@ -607,6 +607,7 @@ spec:
verbs:
- create
- delete
- get
- list
- update
- watch
Expand Down
2 changes: 1 addition & 1 deletion components/codeflare/codeflare.go
Original file line number Diff line number Diff line change
Expand Up @@ -77,7 +77,7 @@ func (c *CodeFlare) ReconcileComponent(cli client.Client, owner metav1.Object, d
}

if found, err := deploy.OperatorExists(cli, dependentOperator); err != nil {
return err
return fmt.Errorf("operator exists throws error %v", err)
} else if found {
return fmt.Errorf("operator %s found. Please uninstall the operator before enabling %s component",
dependentOperator, ComponentName)
Expand Down
2 changes: 1 addition & 1 deletion components/component.go
Original file line number Diff line number Diff line change
Expand Up @@ -106,7 +106,7 @@ func (c *Component) UpdatePrometheusConfig(cli client.Client, enable bool, compo
DeadManSnitchRules string `yaml:"deadmanssnitch-alerting.rules"`
CFRRules string `yaml:"codeflare-recording.rules"`
CRARules string `yaml:"codeflare-alerting.rules"`
DashboardRRules string `yaml:"rhods-dashboard-recording.rule"`
DashboardRRules string `yaml:"rhods-dashboard-recording.rules"`
DashboardARules string `yaml:"rhods-dashboard-alerting.rules"`
DSPRRules string `yaml:"data-science-pipelines-operator-recording.rules"`
DSPARules string `yaml:"data-science-pipelines-operator-alerting.rules"`
Expand Down
2 changes: 1 addition & 1 deletion components/modelmeshserving/modelmeshserving.go
Original file line number Diff line number Diff line change
Expand Up @@ -117,7 +117,7 @@ func (m *ModelMeshServing) ReconcileComponent(cli client.Client, owner metav1.Ob

// For odh-model-controller
if enabled {
err := cluster.UpdatePodSecurityRolebinding(cli, dscispec.ApplicationsNamespace, "odh-model-controller")
err := cluster.UpdatePodSecurityRolebinding(cli, "odh-model-controller", dscispec.ApplicationsNamespace)
if err != nil {
return err
}
Expand Down
1 change: 1 addition & 0 deletions components/workbenches/workbenches.go
Original file line number Diff line number Diff line change
Expand Up @@ -138,6 +138,7 @@ func (w *Workbenches) ReconcileComponent(cli client.Client, owner metav1.Object,
if enabled {
if dscispec.DevFlags.ManifestsUri == "" && len(w.DevFlags.Manifests) == 0 {
if platform == deploy.ManagedRhods || platform == deploy.SelfManagedRhods {
// for kf-notebook-controller image
if err := deploy.ApplyParams(notebookControllerPath, w.SetImageParamsMap(imageParamMap), false); err != nil {
return err
}
Expand Down
1 change: 1 addition & 0 deletions config/monitoring/alertmanager/alertmanager-configs.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -629,6 +629,7 @@ data:
smtp_require_tls: true
# The root route on which each incoming alert enters.
# TODO: check why need email_to
route:
group_by: ['alertname', 'cluster', 'service', 'job', 'email_to']
Expand Down
2 changes: 1 addition & 1 deletion config/monitoring/alertmanager/alertmanager-service.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ metadata:
labels:
name: alertmanager
name: alertmanager
namespace: "redhat-ods-monitoring"
namespace: redhat-ods-monitoring
spec:
ports:
- name: alertmanager
Expand Down
2 changes: 2 additions & 0 deletions config/monitoring/base/kustomization.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4,3 +4,5 @@ resources:
- cluster-monitor-rolebinding.yaml
- rhods-prometheusrules.yaml
- rhods-servicemonitor.yaml
- rhods-prometheus-role.yaml
- rhods-prometheus-rolebinding.yaml
17 changes: 17 additions & 0 deletions config/monitoring/base/rhods-prometheus-role.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# this is role for cluster-monitoring to read rhods prometheus service by cluster-monitoring service account
kind: Role
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: rhods-prometheus-cluster-monitoring-viewer
namespace: redhat-ods-monitoring
rules:
- verbs:
- get
- watch
- list
apiGroups:
- ''
resources:
- pods
- services
- endpoints
14 changes: 14 additions & 0 deletions config/monitoring/base/rhods-prometheus-rolebinding.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
# this is rolebingding to rhods-prometheus-cluster-monitoring-viewer for cluster-monitoring to read rhods prometheus service
kind: RoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: rhods-prometheus-cluster-monitoring-viewer-binding
namespace: redhat-ods-monitoring
subjects:
- kind: ServiceAccount
name: prometheus-k8s
namespace: openshift-monitoring
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: rhods-prometheus-cluster-monitoring-viewer
14 changes: 14 additions & 0 deletions config/monitoring/base/rhods-prometheusrules.yaml
Original file line number Diff line number Diff line change
@@ -1,3 +1,8 @@
# rhods_aggregate_availability, rhods_total_users, rhods_actvie_users should not be needed
# they should be from traditional prometheus pod but from prometheus operator
# but to get PSI work with some, put it here
# TODO: revisit if when we decomision customized prometheus instance
---
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
Expand All @@ -6,19 +11,28 @@ metadata:
role: recording-rules
app: rhods
name: rhods-rules
namespace: redhat-ods-monitoring
spec:
groups:
- name: rhods-usage.rules
rules:
- record: cluster:usage:consumption:rhods:cpu:seconds:rate1h
expr: sum(rate(container_cpu_usage_seconds_total{container="",pod=~"jupyter-nb.*",namespace="rhods-notebooks"}[1h]))
labels:
instance: jupyter-notebooks
- record: cluster:usage:consumption:rhods:pod:up
expr: count(kube_pod_container_status_ready{namespace="rhods-notebooks", pod=~"jupyter-nb.*",container=~"jupyter-nb-.*"}==1)
labels:
instance: jupyter-notebooks
- record: cluster:usage:consumption:rhods:active_users
expr: count(kube_statefulset_replicas{namespace=~"rhods-notebooks", statefulset=~"jupyter-nb-.*"} ==1)
labels:
instance: jupyter-notebooks
- record: cluster:usage:consumption:rhods:cpu_requests_runtime
expr: sum(kube_pod_container_resource_requests{namespace="rhods-notebooks",resource="cpu", container=~"jupyter-nb-.*"} * on(pod) kube_pod_status_phase{phase="Running", namespace="rhods-notebooks"})
labels:
instance: jupyter-notebooks
- record: cluster:usage:consumption:rhods:cpu_limits_runtime
expr: sum(kube_pod_container_resource_limits{namespace="rhods-notebooks",resource="cpu", container=~"jupyter-nb-.*"} * on(pod) kube_pod_status_phase{phase="Running", namespace="rhods-notebooks"})
labels:
instance: jupyter-notebooks
77 changes: 66 additions & 11 deletions config/monitoring/base/rhods-servicemonitor.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@ apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: rhods-monitor-federation
namespace: redhat-ods-monitoring
labels:
monitor-component: rhods-resources
team: rhods
Expand All @@ -27,32 +28,86 @@ spec:
interval: 30s
namespaceSelector:
matchNames:
- redhat-ods-monitoring
- <odh_monitoring_project>
selector:
matchLabels:
app: prometheus
---
# servicemonitoring for rhods operator
# this is not in use, we need to implement operator metrics in logic first
# apiVersion: monitoring.coreos.com/v1
# kind: ServiceMonitor
# metadata:
# labels:
# control-plane: controller-manager
# name: rhods-controller-manager-metrics-monitor
# namespace: redhat-ods-operator
# spec:
# endpoints:
# - path: /metrics
# port: metrics
# scheme: https
# bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
# tlsConfig:
# insecureSkipVerify: true
# params:
# 'match[]':
# - '{__name__= "redhat-ods-operator-controller-manager-metrics-service"}'
# namespaceSelector:
# matchNames:
# - redhat-ods-operator
# selector:
# matchLabels:
# control-plane: controller-manager

---
# servicemonitoring for openshift-monitoring scrap
# move from modelmesh-monitoring
# this one is duplicated as the old modelmesh-federated-metrics
# in order to keep metrics there if user set modelmesh to Removed
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: rhods-monitor-federation2
namespace: redhat-ods-monitoring
labels:
control-plane: controller-manager
name: rhods-controller-manager-metrics-monitor
namespace: redhat-ods-operator
monitor-component: rhods-resources
team: rhods
spec:
endpoints:
- path: /metrics
port: '8080'
scheme: https
- interval: 30s
params:
'match[]':
- '{__name__= "haproxy_backend_http_average_response_latency_milliseconds"}'
- '{__name__= "haproxy_backend_http_responses_total"}'
- '{__name__= "container_cpu_usage_seconds_total"}'
- '{__name__= "container_memory_working_set_bytes"}'
- '{__name__= "node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate"}'
- '{__name__= "cluster:namespace:pod_cpu:active:kube_pod_container_resource_limits"}'
- '{__name__= "cluster:namespace:pod_cpu:active:kube_pod_container_resource_requests"}'
- '{__name__= "cluster:namespace:pod_memory:active:kube_pod_container_resource_requests"}'
- '{__name__= "cluster:namespace:pod_memory:active:kube_pod_container_resource_limits"}'
- '{__name__= "kube_persistentvolumeclaim_resource_requests_storage_bytes"}'
- '{__name__= "kubelet_volume_stats_used_bytes"}'
- '{__name__= "kubelet_volume_stats_capacity_bytes"}'
honorLabels: true
scrapeTimeout: 10s
bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
bearerTokenSecret:
key: ""
path: /federate
port: web
scheme: https
tlsConfig:
ca: {}
cert: {}
insecureSkipVerify: true
namespaceSelector:
matchNames:
- redhat-ods-operator
- openshift-monitoring
selector:
matchLabels:
control-plane: controller-manager


app.kubernetes.io/component: prometheus
app.kubernetes.io/instance: k8s
app.kubernetes.io/name: prometheus
app.kubernetes.io/part-of: openshift-monitoring
16 changes: 15 additions & 1 deletion config/monitoring/networkpolicy/monitoring/monitoring.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,8 @@
# the services residing in redhat-ods-monitoring. namespaceSelector
# ensures that traffic from only the desired namespaces is allowed
# 9114 for blackbox or user_facing_endpoints* all down
# 9115 for blackbox health
# 10443 and 9091 for web
---
kind: NetworkPolicy
apiVersion: networking.k8s.io/v1
Expand Down Expand Up @@ -30,5 +32,17 @@ spec:
- namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: openshift-monitoring
- namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: openshift-user-workload-monitoring
- namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: redhat-ods-operator
- namespaceSelector:
matchLabels:
opendatahub.io/generated-namespace: "true"
egress:
- {}
policyTypes:
- Ingress
- Ingress
- Egress
6 changes: 6 additions & 0 deletions config/monitoring/networkpolicy/operator/operator.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -16,5 +16,11 @@ spec:
- namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: openshift-monitoring
- namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: openshift-user-workload-monitoring
- namespaceSelector:
matchLabels:
opendatahub.io/generated-namespace: "true"
policyTypes:
- Ingress
Loading

0 comments on commit 81ebc87

Please sign in to comment.