Commit Graph

8 Commits

Author SHA1 Message Date
akalenyu
5dbbb10ee9
Adjust alert summary that is visible in UI to be more informative (#2149)
As of now OpenShift UI will not display the runbook URL field for our alerts
(docs and google will though), so let's make sure we provide a better entry point
by being verbose in what is actually visible to the user.

Signed-off-by: Alex Kalenyuk <akalenyu@redhat.com>
2022-02-14 16:41:50 +01:00
João Vilaça
2b7f786ffe
Add tool to generate metrics documentation (#2043)
Signed-off-by: João Vilaça <jvilaca@redhat.com>
2022-01-26 01:23:10 +01:00
Michael Henriksen
d56e0cca05
23 libs (#2077)
Signed-off-by: Michael Henriksen <mhenriks@redhat.com>
2022-01-07 16:56:25 +01:00
akalenyu
3bff27dd43
Add alert for DataImportCron not being up to date (#2063)
* Add alert for DataImportCron failing

DataImportCrons now have conditions (particularly UpToDate) that tell us if
things are going as planned. We can utilize those to alert whenever were not UpToDate for a while.

Signed-off-by: Alex Kalenyuk <akalenyu@redhat.com>

* Address CR review; don't List, increment when needed via corresponding instance

Signed-off-by: Alex Kalenyuk <akalenyu@redhat.com>

* Address review & bugfix: don't update metric if err occurs

Signed-off-by: Alex Kalenyuk <akalenyu@redhat.com>

* upToDateCondition => prevUpToDateCondition so it's clear we're deciding if we should inc/dec based on that

Signed-off-by: Alex Kalenyuk <akalenyu@redhat.com>

* Don't store state in controller; change metric type to GaugeVec (bool metric per DIC)

Signed-off-by: Alex Kalenyuk <akalenyu@redhat.com>
2021-12-24 01:10:52 +01:00
Assaf Admi
639c6a1bd1
Add common labels into alert definitions (#2039)
We want to be able to list all kubevirt alerts so we added labels to
differentiate them.

Signed-off-by: assafad <aadmi@redhat.com>
2021-12-13 18:05:08 +01:00
akalenyu
38af724f1c
Add alert for incomplete storage profiles / delete profile when corresponding SC gone (#2027)
* Add alert for incomplete storage profiles

Signed-off-by: Alex Kalenyuk <akalenyu@redhat.com>

* Run metric tests on both openshift and k8s

Signed-off-by: Alex Kalenyuk <akalenyu@redhat.com>

* Add functional test for storageprofile metrics

Signed-off-by: Alex Kalenyuk <akalenyu@redhat.com>

* Delete profile as a follow up to storage class getting deleted

Signed-off-by: Alex Kalenyuk <akalenyu@redhat.com>

* Address review, alter tests to cover List metric approach

Signed-off-by: Alex Kalenyuk <akalenyu@redhat.com>

* Address review; individually loop over metric decrement, shorten reconcile.Result{}

Signed-off-by: Alex Kalenyuk <akalenyu@redhat.com>

* Address review; deletion timestamp not possible when err/teardown in AfterEach

Signed-off-by: Alex Kalenyuk <akalenyu@redhat.com>
2021-12-01 21:54:59 +01:00
akalenyu
fd332a3165
Degraded/unusual restartcount alerts (#2009)
* Add degraded alert

Signed-off-by: Alex Kalenyuk <akalenyu@redhat.com>

* Add unusual restart count metric

Signed-off-by: Alex Kalenyuk <akalenyu@redhat.com>

* Add actual firing alerts (degraded/restartcount)

Signed-off-by: Alex Kalenyuk <akalenyu@redhat.com>

* Test newly added metrics

Signed-off-by: Alex Kalenyuk <akalenyu@redhat.com>

* Review: Rename metric to match conventions, func to check if test is eligible to run metric tests

Signed-off-by: Alex Kalenyuk <akalenyu@redhat.com>

* Get rid of similar funcs, reconcile more generally

Signed-off-by: Alex Kalenyuk <akalenyu@redhat.com>
2021-11-18 01:05:01 +01:00
akalenyu
50c93e8b0e
Deploy alerts infra as part of our installation (#1979)
* Deploy alerts infra as part of our installation

Conditionally deploy the infrastructure that is needed to fire alerts for our users
when bad things are happening to CDI.

Testing with `KUBEVIRT_DEPLOY_PROMETHEUS=true`

Signed-off-by: Alex Kalenyuk <akalenyu@redhat.com>

* Watch and unit test all prometheus related resources

Signed-off-by: Alex Kalenyuk <akalenyu@redhat.com>

* add gateway for changing monitoring namespace (rbac purposes)

Signed-off-by: Alex Kalenyuk <akalenyu@redhat.com>

* refactor test to check for exact alert name and firing state

Signed-off-by: Alex Kalenyuk <akalenyu@redhat.com>

* Align pattern of ensuring prometheus resource exists for all

Signed-off-by: Alex Kalenyuk <akalenyu@redhat.com>

* Remove potential noisy event

Signed-off-by: Alex Kalenyuk <akalenyu@redhat.com>

* Extract duplicate code to function

Signed-off-by: Alex Kalenyuk <akalenyu@redhat.com>

* Dont use empty value for prometheus label due to open issue

https://github.com/prometheus-operator/prometheus-operator/issues/4325

Signed-off-by: Alex Kalenyuk <akalenyu@redhat.com>
2021-10-26 21:26:07 +02:00