Self-recovery in case of OOM error to improve reliability of the instana agent service

Background

Instana agent has default JAVA heap as follows:

JAVA_MIN_MEM=64m export JAVA_MIN_MEM JAVA_MAX_MEM=160m export JAVA_MAX_MEM

Despite setting more than 3 times the heap size for JAVA_MAX_MEM (eg. 512m) over the nominal memory usage (eg. average below 140m) as observed in a production environment, it can happen under peak loads or several circumstances that the agent gets into OOM. To recover from such issue, a manual systemctl restart of the service yield to restoration of the service to "normal" resource consumption usage.

It is worth noting that automatic restart depends on the how the agent got started. An agent running via kubernetes could be restarted based on the configuration. But when the agent is extracted and/or installed via package them the agent is not able to recover from a hard crash like OOM.

Proposal

The proposal is to for the agent to detect OOM status to self-recover from OOM crash. The systemctl service has auto-restart service unit but the agent does not seem to return correct failure code under OOM situation. It would be desired to have auto-recovery means for the agent.

The benefit of doing so would be:

prevent manual restart operation (toil)
prevent "out-of-the-band" automation script to recover (ie clean self recovery)
improve reliability of the service

Idea priority

High

Post comment

Admin

Henning Treu

Reply
| Feb 23, 2023

Hi,
doing automatic restarts for OOM errors is a two-edged sword. It will improve reliability when the OOM happens in very rare cases caused by abnormal system state.
On the other hand, the automatic restart will cover bugs in the agent (code, configuration, sensors) which will not be noticed as such.
I urge you to open a support ticket on any OOM occasion you encounter with the agent. This gives us the ability to improve agent & sensor quality and other customers will also benefit.
Don't get me wring, this is not meant to be quality control on your side! We do test the agent and our sensors on a lot of different environments and system combinations. Although it will never mirror all of our customers setups, which may introduce those issues.

Best regards
Henning Treu - Product Manager Agent & Application Perspectives

reply Hide replies

By clicking the "Post Comment" or "Submit Idea" button, you are agreeing to the IBM Ideas Portal Terms of Use.
Do not place IBM confidential, company confidential, or personal information into any field.

Shape the future of IBM!

Search existing ideas

Post your ideas

Specific links you will want to bookmark for future use

Self-recovery in case of OOM error to improve reliability of the instana agent service

Background

Proposal

Please enter your email address

RELATED IDEAS

Self-recovery in case of OOM error to improve reliability of the instana agent service

Background

Proposal