How to prevent system lock-ups: when a process eats up all the memory

A couple of days ago, I had the painful experience of having a process slowly but surely eating up all the available memory, filling up the swap memory, living me agonizing in front of a computer that wouldn't respond, too busy as it was to slowly die. If I take the trouble to relate the experience, it's because I believe a proper Linux distribution should take pro-active measures so prevent this kind of occurrence and allow the experienced user to take measures to recover from a potentially disastrous mistake.

There I was, two days ago, minding my own business... err... I mean minding my own code, trying to debug a very complex Drupal contrib module (namely Views 2). As I was trying to understand the internals of the module, I used dpm() to dump complex variables (objects and deeply-nested arrays) onto the screen.

At some stage, I wanted to understand at what stage a particular object was created, so within the object construtor, I placed a debug code like this:

<?php
class views_handler_filter extends views_handler {
  function
init(&$view, $options) {  // I also tried in __construct() {}
  
$bt = debug_bactrace();
   
dpm($bt, 'bactrace');
   
// ...
 
}
?>

Unfortunately, loading the page simply lead to a White Screen of Death, due to the Server's PHP memory resources being exhausted. Apparently, one cannot safely place a backtrace in a PHP object constructor (???).

I am stupid!

There! I said it. So no need to call me names as I narrate the next part, the time when the real troubles started.

I saw that my browser wouldn't display the backtrace with my usual tools, dpm() and krumo. So, instead, I tried with a simple print_r($bt), which in printed something unreadable in my browser, because the arrays were not properly nested. I stop my browser from loading the page, but instead I selected in the Konqueror menu: "Open with ... kate".

Then kate went berzerk. The backtrace dump apparently weighed many gigabytes of memory. I didn't react quickly enough and much too soon, My system became unresponsive. I could see in krunner that Kate ate more and more memory; my system was busy swapping, and I just couldn't bring up a console nor even use the already opened krunner to kill Kate and stop this mad dash to eat up all the available memory. The mouse was more or less responsive, but whatever I clicked on took ages to come up or react, to the point that I could no longer know what I had actually clicked on last. I had closed (or tried to close) Kate, but it still tried to load the huge mega dump.

The whole ordeal lasted about a whole hour during which I was completely powerless. The system was simply swapping, swapping, swapping more and more memory, and I just couldn't manage to kill the mad process.

Then, all of a sudden, kate finally died or exited, either because it finally acted on my much earlier command to exit the application, or because it had run out of memory. I don't know. I don't know whether the whole backtrace was finally dumped to it or not.

My system hadn't crashed. The kernel was still running fine. But the memory of all applications had been swapped out, so using anything required for the related memory to be swapped back into the main memory. However applications like Konqueror didn't recover from the ordeal. I couldn't open a new browser window or anything. So finally, I saved what I had left to save, and rebooted the computer.

I'd be interested to know how this experience could be recreated safely in a testing environment.

More importantly, what strategies could distributions put in place to safeguard from such occurrences? Can this be prevented? Can't the system notice that an application is requesting too much memory?

I think that the most important is to make sure that, for as long as the kernel does not die, a moderately experienced user should be able to recover fairly easily from such a mishap. What is the point of having the mouse pointer being niced up so that it's always responsive, if whatever we click on is not responsive because it is not niced higher than the mad process? Is there a way to have applications in the foreground being automatically niced up so as to respond to user demand?

Another scenario I have encountered many times in the past, back when I had a single-core CPU, is that one process would take 100% of CPU resources, thereby locking the system completely.

The bottom line is that, as long as the linux kernel is still operating fine, a complete lock-up of the system should never occur, never mind how badly some application might behave, or whatever stupid operation a user like me might have attempted.

What can be done to prevent such things? How should the system be configured? What is possible (at least theoretically), and what is not? What solutions already exist??

Issues related to this page:

ProjectSummaryStatusPriorityCategoryLast updatedAssigned to
Linux DistributionHow to prevent system lock-ups: when a process …activenormaltask8 years 43 weeks