Skip to Main Content
Cloud Management and AIOps


This is an IBM Automation portal for Cloud Management and AIOps products. To view all of your ideas submitted to IBM, create and manage groups of Ideas, or create an idea explicitly set to be either visible by all (public) or visible only to you and IBM (private), use the IBM Unified Ideas Portal (https://ideas.ibm.com).

Shape the future of IBM!

We invite you to shape the future of IBM, including product roadmaps, by submitting ideas that matter to you the most. Here's how it works:

Search existing ideas

Start by searching and reviewing ideas and requests to enhance a product or service. Take a look at ideas others have posted, and add a comment, vote, or subscribe to updates on them if they matter to you. If you can't find what you are looking for,

Post your ideas
  1. Post an idea.

  2. Get feedback from the IBM team and other customers to refine your idea.

  3. Follow the idea through the IBM Ideas process.

Specific links you will want to bookmark for future use

Welcome to the IBM Ideas Portal (https://www.ibm.com/ideas) - Use this site to find out additional information and details about the IBM Ideas process and statuses.

IBM Unified Ideas Portal (https://ideas.ibm.com) - Use this site to view all of your ideas, create new ideas for any IBM product, or search for ideas across all of IBM.

ideasibm@us.ibm.com - Use this email to suggest enhancements to the Ideas process or request help from IBM for submitting your Ideas.

Status Not under consideration
Workspace IBM Turbonomic ARM
Created by Guest
Created on Sep 15, 2022

Capture Restarts for PODs across time

To give a customer a visibility over how their PODs are performing, if there was any restart happening due to lack of Memory which could help the customer how application pods are performing and also can be used in the POV as how we are helping the customer avoid POD restarts by correctly sizing the Memory Limits to overcome that challenge.
Idea priority Medium
  • Guest
    Reply
    |
    Sep 16, 2022

    Thank you Eva for your fast response.

    The reason why i was asking about that is because of the following use case:

    I had a customer who implemented instana and Turbo ( integrated ) and there was an issue with the app where one of the pods is calling a service on another POD which at certain point give them error 503 ( Service unavailable ). What i understood from the instana guy working on that case is that instana is not able to capture the reason why the service is unavailable ( As we suspect that the POD restarted ). And i was wondering why don't we capture that kind of information like POD restarts across time as that should give a good visibility over the PODS behavior across time.

    Understanding that POD restarts can happen for other reasons , but still that kind of information should provide a good visibility.

    Are you looking to just show restarts on the same scheduled pod (like a killed OOM event, or CrashLoopBackoff?) in the same way we can see restart counts from kubectl get pods? ( Yes i think that would be great )

    Given that this pod could go away at some point, like a redeployment, where would you want to persist this information over time? ( Am not sure about that point if it will be valuable or not )

    How would you want to handle a restart based on a new replicaset or a change in configuration? That will schedule a whole new pod, rather than have the existing one restart - what would be the purpose of tracking that? ( I think that should be a reset to the data that we have as you mentioned it is a new configuration and as developer i would need to understand if my new configuration works as intended or not, how are we handling that currently in Turbo? )


    I think the idea would be to give the customer the visibility on the environment without the need to login to the k8s cluster and get that simple information.

    If you like to discuss more let me know.

  • Admin
    Eva Tuczai
    Reply
    |
    Sep 15, 2022

    OOMs are a challenge. What if we consider a Mem Limit resize up action that is specifically addressing an OOM risk? You are correct, that we can even detect OOM (its a condition reported in the Pod), and this resize up action could be given more "priority" rather than waiting for enough data points for percentile analysis to kick in. Note that sometimes OOMs are not reported as such and are just terminated with no descriptive error message. We explored this 2 years ago, and could revisit addressing OOMs in a specific way, similar to what we do for CPU Limits when CPU Throttling is detected.


    It would be good to get some clarification on restarts. Are you looking to just show restarts on the same scheduled pod (like a killed OOM event, or CrashLoopBackoff?) in the same way we can see restart counts from kubectl get pods? Given that this pod could go away at some point, like a redeployment, where would you want to persist this information over time? How would you want to handle a restart based on a new replicaset or a change in configuration? That will schedule a whole new pod, rather than have the existing one restart - what would be the purpose of tracking that?