As of now OpenShift UI will not display the runbook URL field for our alerts
(docs and google will though), so let's make sure we provide a better entry point
by being verbose in what is actually visible to the user.
Signed-off-by: Alex Kalenyuk <akalenyu@redhat.com>
* Add alert for DataImportCron failing
DataImportCrons now have conditions (particularly UpToDate) that tell us if
things are going as planned. We can utilize those to alert whenever were not UpToDate for a while.
Signed-off-by: Alex Kalenyuk <akalenyu@redhat.com>
* Address CR review; don't List, increment when needed via corresponding instance
Signed-off-by: Alex Kalenyuk <akalenyu@redhat.com>
* Address review & bugfix: don't update metric if err occurs
Signed-off-by: Alex Kalenyuk <akalenyu@redhat.com>
* upToDateCondition => prevUpToDateCondition so it's clear we're deciding if we should inc/dec based on that
Signed-off-by: Alex Kalenyuk <akalenyu@redhat.com>
* Don't store state in controller; change metric type to GaugeVec (bool metric per DIC)
Signed-off-by: Alex Kalenyuk <akalenyu@redhat.com>
* Add alert for incomplete storage profiles
Signed-off-by: Alex Kalenyuk <akalenyu@redhat.com>
* Run metric tests on both openshift and k8s
Signed-off-by: Alex Kalenyuk <akalenyu@redhat.com>
* Add functional test for storageprofile metrics
Signed-off-by: Alex Kalenyuk <akalenyu@redhat.com>
* Delete profile as a follow up to storage class getting deleted
Signed-off-by: Alex Kalenyuk <akalenyu@redhat.com>
* Address review, alter tests to cover List metric approach
Signed-off-by: Alex Kalenyuk <akalenyu@redhat.com>
* Address review; individually loop over metric decrement, shorten reconcile.Result{}
Signed-off-by: Alex Kalenyuk <akalenyu@redhat.com>
* Address review; deletion timestamp not possible when err/teardown in AfterEach
Signed-off-by: Alex Kalenyuk <akalenyu@redhat.com>
* Add degraded alert
Signed-off-by: Alex Kalenyuk <akalenyu@redhat.com>
* Add unusual restart count metric
Signed-off-by: Alex Kalenyuk <akalenyu@redhat.com>
* Add actual firing alerts (degraded/restartcount)
Signed-off-by: Alex Kalenyuk <akalenyu@redhat.com>
* Test newly added metrics
Signed-off-by: Alex Kalenyuk <akalenyu@redhat.com>
* Review: Rename metric to match conventions, func to check if test is eligible to run metric tests
Signed-off-by: Alex Kalenyuk <akalenyu@redhat.com>
* Get rid of similar funcs, reconcile more generally
Signed-off-by: Alex Kalenyuk <akalenyu@redhat.com>
* Deploy alerts infra as part of our installation
Conditionally deploy the infrastructure that is needed to fire alerts for our users
when bad things are happening to CDI.
Testing with `KUBEVIRT_DEPLOY_PROMETHEUS=true`
Signed-off-by: Alex Kalenyuk <akalenyu@redhat.com>
* Watch and unit test all prometheus related resources
Signed-off-by: Alex Kalenyuk <akalenyu@redhat.com>
* add gateway for changing monitoring namespace (rbac purposes)
Signed-off-by: Alex Kalenyuk <akalenyu@redhat.com>
* refactor test to check for exact alert name and firing state
Signed-off-by: Alex Kalenyuk <akalenyu@redhat.com>
* Align pattern of ensuring prometheus resource exists for all
Signed-off-by: Alex Kalenyuk <akalenyu@redhat.com>
* Remove potential noisy event
Signed-off-by: Alex Kalenyuk <akalenyu@redhat.com>
* Extract duplicate code to function
Signed-off-by: Alex Kalenyuk <akalenyu@redhat.com>
* Dont use empty value for prometheus label due to open issue
https://github.com/prometheus-operator/prometheus-operator/issues/4325
Signed-off-by: Alex Kalenyuk <akalenyu@redhat.com>