Capture Restarts for PODs across time

To give a customer a visibility over how their PODs are performing, if there was any restart happening due to lack of Memory which could help the customer how application pods are performing and also can be used in the POV as how we are helping the customer avoid POD restarts by correctly sizing the Memory Limits to overcome that challenge.

Idea priority

Medium

Post comment

Guest

Sep 16, 2022

Thank you Eva for your fast response.
The reason why i was asking about that is because of the following use case:
I had a customer who implemented instana and Turbo ( integrated ) and there was an issue with the app where one of the pods is calling a service on another POD which at certain point give them error 503 ( Service unavailable ). What i understood from the instana guy working on that case is that instana is not able to capture the reason why the service is unavailable ( As we suspect that the POD restarted ). And i was wondering why don't we capture that kind of information like POD restarts across time as that should give a good visibility over the PODS behavior across time.
Understanding that POD restarts can happen for other reasons , but still that kind of information should provide a good visibility.
Are you looking to just show restarts on the same scheduled pod (like a killed OOM event, or CrashLoopBackoff?) in the same way we can see restart counts from kubectl get pods? ( Yes i think that would be great )
Given that this pod could go away at some point, like a redeployment, where would you want to persist this information over time? ( Am not sure about that point if it will be valuable or not )
How would you want to handle a restart based on a new replicaset or a change in configuration? That will schedule a whole new pod, rather than have the existing one restart - what would be the purpose of tracking that? ( I think that should be a reset to the data that we have as you mentioned it is a new configuration and as developer i would need to understand if my new configuration works as intended or not, how are we handling that currently in Turbo? )

I think the idea would be to give the customer the visibility on the environment without the need to login to the k8s cluster and get that simple information.
If you like to discuss more let me know.

Reply
Hide replies

Guest

Sep 15, 2022

OOMs are a challenge. What if we consider a Mem Limit resize up action that is specifically addressing an OOM risk? You are correct, that we can even detect OOM (its a condition reported in the Pod), and this resize up action could be given more "priority" rather than waiting for enough data points for percentile analysis to kick in. Note that sometimes OOMs are not reported as such and are just terminated with no descriptive error message. We explored this 2 years ago, and could revisit addressing OOMs in a specific way, similar to what we do for CPU Limits when CPU Throttling is detected.

It would be good to get some clarification on restarts. Are you looking to just show restarts on the same scheduled pod (like a killed OOM event, or CrashLoopBackoff?) in the same way we can see restart counts from kubectl get pods? Given that this pod could go away at some point, like a redeployment, where would you want to persist this information over time? How would you want to handle a restart based on a new replicaset or a change in configuration? That will schedule a whole new pod, rather than have the existing one restart - what would be the purpose of tracking that?

Reply
Hide replies

By clicking the "Post Comment" or "Submit Idea" button, you are agreeing to the IBM Ideas Portal Terms of Use.
Do not place IBM confidential, company confidential, or personal information into any field.

Shape the future of IBM!

Search existing ideas

Post your ideas

Specific links you will want to bookmark for future use

Capture Restarts for PODs across time

Please enter your email address

RELATED IDEAS

Capture Restarts for PODs across time