Would like to define Warning and Critical threshold ranges within a single Event or SmartAlert

If a customer wants to create both a Warning and Critical threshold for the same underlying metric, they create either 2 Smart Alerts or 2 Events. For example, create a Warning Event when CPU usage is between 80% and 90% utilization and Critical Event when CPU usage is greater than 90%. Similar things can be done with Smart Alerts. For example, create 2 Smart Alerts with Warning when latency is between 500ms and 1000ms and Critical when latency is greater than 1000ms.

This functions, but can cause problems from an Alerting perspective. If the metrics change back and forth between the Warning and Critical ranges, it can cause multiple Events and Alerts to trigger. Ultimately, this can lead to multiple Tickets in a ticketing system. Here is an example:

CPU thresholds are Warning from 80% to 90% and Critical greater than 90%

CPU usage is 88% - An Event triggers and an Alert is sent
CPU increases to 92% - Another Event triggers and another Alert is sent. And, the original Event MAY close depending on the grace period
CPU decreases to 87% - If the grace period for the first Event has passed, a NEW Event and Alert open. And, depending on the grace period of the critical event, the 2nd Event closes.

It would be MUCH better if a Single Event and Alert could be updated to reflect the severity of the underlying conditions. This could be accomplished by having a single Event or SmartAlert definition that has 2 different severities and value ranges associated with it.

Idea priority

Low

Post comment

Guest

Sep 4, 2024

The response states that this ask was not delivered and the idea was relegated as a nice to have. Within version 2 81, there is an option to configure an event or smart alert as a critical or warning but that would be independent of the other so yes, the warning will remain open till the metric value decreases below the threshold and the same with critical.
Secondary, yes, there will be multiple events open/closed if the value bounces above and then below thresholds. Using grace period, ignore for, while novel in flattening out the event cycles.
To conclude, this was not delivered and was shuffled to "nice to have", thus, to create the functionality which all other APM software have, we will need to use the work around, which is two events or two smart alerts and then contend with the chatter and trying to line up the critical close to the critical open, and the warning open to the warning close, and with the possibility of multiple opens and closes.
We are staring down over 2,000 alerts in our current APM and to convert them, we will need to create 4,000+ smart alerts manually since smart alerts do not have a terraform API. Or we could use 4000 events, but then since there is no placeholder for the affected server/instance, we would need to basically create the event set for each context, so instead of the base 2000, we are looking to (5 metrics x 500 servers=2500 base) + 5 x 400 application/services = 2000 base, then x2 for critical/warning, 9,000. Yeah the work around adds far more than a "nice to have"

Reply
Hide replies

Guest

May 20, 2024

Just for completeness, this is a follow-up of https://automation-management.ideas.ibm.com/ideas/INSTANA-I-1773. And it was identified as a nice-to-have, but not an ITM blocker, due to the existing workaround of creating creating 2 alert-configurations with different thresholds/severities.
In contrast to what is described in the Idea description, there is hower one thing that is different. We don't allow defining a range (e.g. WARN from 80%-90%), and we also don't intend to do it that way. Instead, we intend to define an escalation, of different thresholds all using the same threshold operator, to ensure there are no gaps/conflicts in the threshold, when then cause the described problems in the Idea description.
> And, the original Event MAY close depending on the grace period
This would therefore not happen, neither in the current (workaround) solution of 2 Cutom Events, because the rule would not be defined as WARNING in range of 80-90%, but WARNING > 80%, which would be still the case when the metric changed to 92%. Both defined Custom Events (WARN if > 80%; CRITICAL if > 90%) would be active at that time, as both conditions are met.

Reply
Hide replies

By clicking the "Post Comment" or "Submit Idea" button, you are agreeing to the IBM Ideas Portal Terms of Use.
Do not place IBM confidential, company confidential, or personal information into any field.

Shape the future of IBM!

Search existing ideas

Post your ideas

Specific links you will want to bookmark for future use

Would like to define Warning and Critical threshold ranges within a single Event or SmartAlert

Please enter your email address

RELATED IDEAS

Would like to define Warning and Critical threshold ranges within a single Event or SmartAlert