Skip to Main Content
Cloud Management and AIOps


This is an IBM Automation portal for Cloud Management and AIOps products. To view all of your ideas submitted to IBM, create and manage groups of Ideas, or create an idea explicitly set to be either visible by all (public) or visible only to you and IBM (private), use the IBM Unified Ideas Portal (https://ideas.ibm.com).

Shape the future of IBM!

We invite you to shape the future of IBM, including product roadmaps, by submitting ideas that matter to you the most. Here's how it works:

Search existing ideas

Start by searching and reviewing ideas and requests to enhance a product or service. Take a look at ideas others have posted, and add a comment, vote, or subscribe to updates on them if they matter to you. If you can't find what you are looking for,

Post your ideas
  1. Post an idea.

  2. Get feedback from the IBM team and other customers to refine your idea.

  3. Follow the idea through the IBM Ideas process.

Specific links you will want to bookmark for future use

Welcome to the IBM Ideas Portal (https://www.ibm.com/ideas) - Use this site to find out additional information and details about the IBM Ideas process and statuses.

IBM Unified Ideas Portal (https://ideas.ibm.com) - Use this site to view all of your ideas, create new ideas for any IBM product, or search for ideas across all of IBM.

ideasibm@us.ibm.com - Use this email to suggest enhancements to the Ideas process or request help from IBM for submitting your Ideas.

Status Not under consideration
Workspace Instana
Categories Agent
Created by Guest
Created on Apr 6, 2022

Self-recovery in case of OOM error to improve reliability of the instana agent service

Background

Instana agent has default JAVA heap as follows:

JAVA_MIN_MEM=64m
export JAVA_MIN_MEM
JAVA_MAX_MEM=160m
export JAVA_MAX_MEM


Despite setting more than 3 times the heap size for JAVA_MAX_MEM (eg. 512m) over the nominal memory usage (eg. average below 140m) as observed in a production environment, it can happen under peak loads or several circumstances that the agent gets into OOM. To recover from such issue, a manual systemctl restart of the service yield to restoration of the service to "normal" resource consumption usage.


It is worth noting that automatic restart depends on the how the agent got started. An agent running via kubernetes could be restarted based on the configuration. But when the agent is extracted and/or installed via package them the agent is not able to recover from a hard crash like OOM.


Proposal

The proposal is to for the agent to detect OOM status to self-recover from OOM crash. The systemctl service has auto-restart service unit but the agent does not seem to return correct failure code under OOM situation. It would be desired to have auto-recovery means for the agent.


The benefit of doing so would be:

  • prevent manual restart operation (toil)

  • prevent "out-of-the-band" automation script to recover (ie clean self recovery)

  • improve reliability of the service

Idea priority High
  • Admin
    Henning Treu
    Reply
    |
    Feb 23, 2023

    Hi,

    doing automatic restarts for OOM errors is a two-edged sword. It will improve reliability when the OOM happens in very rare cases caused by abnormal system state.

    On the other hand, the automatic restart will cover bugs in the agent (code, configuration, sensors) which will not be noticed as such.

    I urge you to open a support ticket on any OOM occasion you encounter with the agent. This gives us the ability to improve agent & sensor quality and other customers will also benefit.

    Don't get me wring, this is not meant to be quality control on your side! We do test the agent and our sensors on a lot of different environments and system combinations. Although it will never mirror all of our customers setups, which may introduce those issues.


    Best regards

    Henning Treu - Product Manager Agent & Application Perspectives