I was taught C at Epitech. A single segfault, no matter insidious, was a valid reason to render a whole project NULL. We often had evaluations ran with LD_LIBRATY_PATH=malloc_that_fails.so or just piping /dev/urandom to stdin...
Needless to say, calling exit() when a call to malloc() failed wasn't an acceptable recover routine.
> Needless to say, calling exit() when a call to malloc() failed wasn't an acceptable recover routine.
What do you do when malloc fails? A bit of graceful shutdown and logging seems like it would be in order, but otherwise how do you keep rolling if mallocs start failing? It seems to me like that would indicate something has gone unusually wrong and full recovery is futile.
I grew up using the Amiga, when having memory allocation fail was routine (a standard Amiga 500 for example, came with 512KB RAM, and was rarely expanded to more than 1MB, so you would run out of memory).
What you do when malloc() fails depends entirely on your application: If a desktop application on the Amiga would shutdown just because a memory allocation failed, nobody would use it. The expection was you'd gracefully clean up, and fail whatever operation needed the memory, and if possible inform the user to let him/her free up memory before trying again.
This expectations in "modern" OS's that malloc never fails unless the world is falling really annoys me - it for example leads to systems where we use too much swap to the point where systems often slow down or become hopelessly unresponsive in cases where the proper response would have been to inform the user - the user experience is horrendous: Swap is/was a kludge to handle high memory prices; having the option is great, but most of the time when I have systems that dip into swap, it indicates a problem I'd want to be informed about.
But on modern systems, most software handles it so badly that turning swap off is often not even a viable choice.
Of course there are plenty of situations where the above isn't the proper response, e.g. where you can't ask the user. But even for many servers, the proper response would not be fall over and die if you can reasonably dial back your resource usage and fail in more graceful ways.
E.g. an app server does a better job if it at least provides the option to dynamically scale back the number of connections it handles rather than failing to provide service at all - degrading service or slowing down is often vastly better than having a service fail entirely.
Isn't fork the real offender, which requires Linux to overcommit by default? Disabling swap shouldn't affect that, right? Just makes your problem happen later, in a somewhat non-deterministic way.
Without fork, what reason do you not disable swap? I can only think of an anonymous mmap where you want to use the OS VM as a cache system. But that's solved easily enough by providing a backing file, isn't it?
Saying that fork forces overcommit is strange. Fork is just one of the things that allocates memory. If you don't want overcommit fork should simply fail with ENOMEM if there isn't enough memory to back a copy of all the writable memory in the process.
I meant the practical considerations of fork means overcommitment is needed in many cases where it otherwise wouldn't be needed. If you fork a 2GB process but the child only uses 1MB, you don't want to commit another 2GB for no reason.
> Isn't fork the real offender, which requires Linux to overcommit by default?
Maybe I'm missing something, but how does fork require overcommitment? When you fork, you end up with COW pages, which share underlying memory. They don't guarantee that physical memory would be available if every page were touched and required a copy; they just share underlying physical memory. Strictly speaking, very little allocation has to happen for a process fork to occur.
If there's no overcommit, each of those COW pages needs some way of making sure it can actually be written to. Isn't that literally the point of overcommit? Giving processes more memory than they can actually use on the assumption they probably won't use it? And Windows takes the different approach of never handing out memory unless it can be serviced (via RAM or pagefile).
What am I missing? (I know you know way more about this than I do.)
When you fork a process, your application's contract with the kernel is such: existing pages will be made accessible in both the parent and the child; these pages may or may not be shared between both sides -- if they are, then the first modification to a shared page will cause an attempt to allocate a copy for that process; execution flow will continue from the same point on both sides. That's pretty much the extent of it (ignoring parentage issues for the process tree). The key thing here is the 'attempt' part -- nothing is guaranteed. The kernel has never committed to giving you new pages, just the old ones.
I don't personally see this as an overcommit, since the contract for fork() on Linux doesn't ever guarantee memory in the way that you'd expect it to. But in all honesty, it's probably just a matter of terminology at the end of the day, since the behavior (write to memory -> process gets eaten) is effectively the same.
Edit: Though I should note, all of the overcommit-like behavior only happens if you are using COW pages. If you do an actual copy on fork, you can fail with ENOMEM and handle that just like a non-overcommitting alloc. So even in the non-pedantic case, fork() really doesn't require overcommit, it's just vastly less performant if you don't use COW.
Oh. I was under the impression that if overcommit was disabled then forking a large process won't work if there's not enough RAM/swap available, regardless of usage.
So out of memory failure won't happen when you malloc, it will happen when you assign a variable in a COW page. This somewhat invalidates the idea of a failing malloc.
The most common way to indicate that an error has occurred in a C function is to return non-zero. If this is done consistently, and return values are always checked, an error condition will eventually make its way up to main, where you can fail gracefully.
For example:
int a(void)
{
...
if (oops) {
return 1;
}
return 0;
}
int b(void)
{
if (a(...) != 0) {
return 1;
}
return 0;
}
int main(void)
{
if (b(...) != 0) {
exit(EXIT_FAILURE);
}
return EXIT_SUCCESS;
}
(This means that void functions should be avoided.)
It's not so simple in real life. I use this style of error-handling for only one type of project: a library with unknown users. In that case, as I can't make assumptions about the existence of better error handling systems, it gives the most flexible result. But at a price, I know have to document the error codes, and I had damned well better also provide APIs that allow my client to meaningfully recover from the error.
In most I have worked on, this type of error handling is completely inadequate. Think multithreaded applications. The code that needs to handle the error your code just generated isn't in the call stack. This happens very often in my experience, and I have found that the best solution is to post some kind of notification message rather than returning an error code. This creates a dependency on the notification system though, so it's not always the correct solution.
The thing that I dislike the most in your example was when you propagated the error from function a out of function b. my most robust code mostly uses void functions. Error codes are only used in cases where the user can actually do some meaningful action in response to the error, and feNkly this is rarely the case. Instead I try as much as possible to correctly handle errors without propagation. It frees up the user of my APIs from having to worry about errors, and in my opinion thus should be a design goal of any API.
What's the point of propagating all errors way up to main if you're only going to exit anyway? I think we know how to indicate errors in C functions. What to do about specific errors, in this case allocation failures, is a more interesting question.
If malloc() fails now, it might succeed again later. So you can just go on doing everything else you were doing, then try the memory-hungry operation again in the future.
For example, this could be important in systems where you might be controlling physical hardware at the same time as reading commands from a network. It's probably a good idea to maintain control of the hardware even if you don't have enough memory right now to read network commands.
This is a pet peeve of mine with modern applications: So many of them just throw their metaphorical hands in the air and give up.
Prior to swap and excessive abuse of virtual memory this was not an option: If you gave up on running out of memory, your users gave up on your application. On the Amiga, for example, being told an operation failed due to lack of memory and given the chance to correct it by closing down something else was always the expected behaviour.
But try having allocations fail today, and watch the mayhem as any number of applications just fall over. So we rely on swap, which leaves systems prone to death spirals when the swap slows things down instead.
If embedded systems programmers wrote code the same way modern desktop applications developers did, we'd all be dead.
Doesn't it duplicate effort to put the responsibility of checking for available memory on individual applications?
I think, in most computing environments, it should be the operating system's responsibility to inform the user that free memory is running out. Applications should be able to assume that they can always get memory to work with.
I think the swap is an extremely sensible solution, in that executing programs slowly is (in most environments, including desktop) better than halting programs completely. It provides an opportunity for the user to fix the situation, without anything bad happening. Note that the swap is optional anyway, so don't use it if you don't like it.
Comparing modern computing environments to the Amiga is laughable. It's not even comparable to modern embedded environments, because they serve(d) different purposes.
I'm a hobbyist C application/library developer who assumes memory allocation always works.
Most computing environments don't have a user to speak of, and the correct response of an application to an out of memory error could range from doing nothing to sounding an alarm.
As a user, I find it incredibly frustrating when my old but indispensable music software runs out of address space (I have plenty of RAM) and, instead of canceling the current operation (e.g. processing some large segment of audio), just dies with a string of cryptic error dialogs. The best thing for the user in this case is to hobble along without allocating more memory, not to fail catastrophically by assuming that memory allocation always works.
Swap is not a good solution because when there's enough swap to handle significant memory overuse, the system becomes unresponsive to user input since the latency and throughput to swap are significantly slower than RAM.
I think most computing environments do have a user of, if you consider a "user" to be something that be notified and can act on such notifications (e.g. to close applications).
Your music software's problem seems to be a bad algorithm - not that it doesn't check the return values of the `*alloc` functions.Aas you say, it should be able to process the audio in constant space. While I assume that I can always acquire memory, I do still care about the space-efficiency of my software.
I must admit I've never seen my system depending on swap, so I don't know how bad it is. But, if you have 1GB of on-RAM memory already allocated, wouldn't it only be new processes that are slow?
Also, I'd again point out that if you don't like swap, you don't have to have one.
> if you have 1GB of on-RAM memory already allocated, wouldn't it only be new processes that are slow?
No - the memory sub system, will swap out pages based on usage and some other parameters. A new application would most likely result in already running applications least used pages being swapped out.
Conversely, if app developers wrote the same code that embedded systems programmers do, we'd never have any apps to use. Its just not worth the trade off- moreover, telling a user to free memory is a losing battle.
> If embedded systems programmers wrote code the same way modern desktop applications developers did, we'd all be dead.
If Boeing made passenger jets the way Boeing made fighters, we'd all be dead, too, but try telling a fighter pilot that they should do their job from a 777. It's two very different contexts.
Besides, some errors can't be recovered from. What do you do when your error logging code reports a failure? Keep trying to log the same error, or do you begin trying to log the error that you can't log errors anymore?
>What do you do when your error logging code reports a failure? Keep trying to log the same error, or do you begin trying to log the error that you can't log errors anymore?
First you try to fix the problem of the logging system by runnig a reorganisation routine (delete old data,...) or reinit the subsystem.
If that does not work AND if logging is a manadatory function of you system you make sure to change into a safe state and inidcate a fatal error state (blinking light, beeper, whatever). If the logging is such an important part of the system surrounding your system it might take further actions on its own and reinit your system (maybe turn your system off and start another logging systen).
There is no exit. You never give up.
It's an even better idea to make the hardware fail safe, so you can let the program die and not worry too much about it. This does not apply in all cases (cars), but it does apply in many (trains, like a deadman switch for the computer). For a vivid example of why this is an important approach, read about the Therac-25.
To make sure you free() all your previous allocations on the way down. You can choose not too, it's "kinda" the same (can't remember the exact difference) but it's dirty and people with OCD just won't accept it.
(Disclosure: I got OCD too, this is not meant to make C development hostile to people with OCD.)
If your program is going to exit anyway, there's no need to call free() on allocated memory. The operating system will reclaim all of the memory it granted to your process when it exits. Remember that malloc lives in your process, not the kernel. When you ask for memory from malloc, it is actually doling out memory that you already own - the operating system granted it to you, when malloc requested it. And malloc requested it because you asked for more memory than it was currently managing.
If your intention is to continue running, then of course you want to call free() on your memory. And this certainly makes sense to do as you exit functions. But if you're, say, calling exit() in the middle of your program, for whatever reason, you don't need to worry about memory.
Other resources may be a problem, though. The the operating system will reclaim things it controlled, and granted to your process - memory, sockets, file descriptors and such. But you need to be careful about resources not controlled by the operating system in such a manner.
Some kernels may not get memory back by themselves and expect each application to give it back. We're lucky that the kernels we use everyday do, but we may one day have to write for a target OS where it's not the case. Just hoping "the kernel will save us" is a practice as bad as relying on undefined behaviors.
If you're coding correctly, you have exactly as much malloc()'s as you have free()'s, so when rewinding the stack to exit, your application is gonna free() everything anyway.
Speaking of resources, what about leftover content in FIFOs or shm threads that you just locked?
And when you got OCD, you're only satisfied with this:
$ valgrind cat /var/log/*
...
==17473== HEAP SUMMARY:
==17473== in use at exit: 0 bytes in 0 blocks
==17473== total heap usage: 123 allocs, 123 frees, 2,479,696 bytes allocated
==17473==
==17473== All heap blocks were freed -- no leaks are possible
==17473==
==17473== For counts of detected and suppressed errors, rerun with: -v
==17473== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 4 from 4)
(cat was an easy choice and all I got on this box, but I've had bigger stuff already pass the "no leaks are possible" certification)
First, you sometimes do want to call exit deep within a program. That is the situation I am addressing, not normal operation. Of course you want to always free unused memory and prevent memory leaks. I am quite familiar with the importance of memory hygiene, and have even written documents admonishing students to use valgrind before coming to me for help: http://courses.cs.vt.edu/~cs3214/fall2010/projects/esh3-debu...
Second, please re-read my last sentence. I specifically addressed things that the kernel does not reclaim. This would also include System V shared memory segments and the like. You must manage these yourself, and it is always messy. Typically, rather than calling exit directly, you're going to invoke a routine that knows about all such resources, frees them, then calls exit. But you still don't need to unwind your stack back to main.
Third, the kernel always gets back all of its memory that was granted through conventional means. That's what operating systems do. I think you have a fundamental misunderstanding of what malloc is, and where it lives. Malloc is a user-level routine that lives inside of your process. When you call malloc, it is granting you memory that you already own. Malloc is just a convenience routine that sits between you and the operating system. When you say malloc(4), it does not go off and request 4 bytes from the operating system. It looks into large chunks of memory the operating system granted to it, and gives you some, updating its data structures along the way. If all of its chunks of memory are currently allocated, then it will go ask the operating system for memory - on a Unix machine, typically through a call to brk to mmap. But when it calls brk or mmap, it will request a large amount of memory, say a few megabytes. Then, from that large chunk, it will carve out 4 bytes for you to use.
(This is why off-by-one errors are so pernicious in C: the chances are very good that you actually do own that memory, so the kernel will happily allow you to access the value.)
Now, even if you are a good citizen and return all memory to malloc, the operating system still has to reclaim tons of memory from you. Which memory? Well, your stacks and such, but also all of that memory that malloc still controls. When you free memory back to malloc, malloc is very unlikely to then give it back to the operating system. So all programs, at exit, will have memory that they own that the kernel will need to reclaim.
They say memory is the second thing to go. ;) Unfortunately, the OS doesn't know how to remove files or modify database entries that also represent program state, or properly shut down connections to other machines. Proper unwinding is still necessary.
Sane cleanup. Close any open resources, especially interprocess visible resources. Resources entirely in your process, such as memory, will just get freed by the OS; a file might want to be flushed, a database properly closed. Likely, in the frame where you're out of memory, you won't have the context to know what should be done: that is most likely a decision of your caller, or their caller…
This is a brilliant example of why you'd want exceptions. Look at what you're doing for error handling, manually every time.
Exceptions do the exact same thing, except:
1) automatically
2) type-safe
3) allow you to give extra detail about the error
4) tools can actually follow the control flow and tell you what's happening and why
5) debuggers can break on all exceptions. Try breaking on "any error" in your code (I do know how C programmers "solve" that : single-stepping. Ugh)
In this case, they are better in every way.
This is almost as bad as the code in the linux kernel and GNOME where they "don't ever use C++ objects !", and proceed to manually encode virtual method tables. And then you have have 2 object types that "inherit" from eachother (copy the virtual method table) and then proceed to overwrite the wrong indices with the overridden methods (and God forbid you forget to lock down alignment, resulting in having different function pointers overwritten on different architectures). Debugging that will cost you the rest of the week.
When it comes to bashing exceptions, it would be better to give people the real reason C++ programmers hate them, it's because of the major unsolvable problem you'll suddenly run in to when using them. In C and C++ you can use exceptions XOR not using exceptions.
This sounds like it's not a big deal, until you consider libraries. You want to use old libraries ? No exceptions for you ! (unless you rewrite them) You want to use newer libraries : you don't get to not use exceptions anymore ! You want to combine the two ? That's actually possible but if any exception library interacts with a non-exception library in the call-stack boom.
Exceptions are a great idea, but they don't offer a graceful upgrade path. Starting to use exceptions in C++ code is a major rewrite of the code. I guess if you follow the logic of the article that "would be a good thing", but given ... euhm ... reality ... I disagree. Try explaining "I'm adding exceptions to our code, rewriting 3 external libraries in the process" to your boss.
This is a brilliant example of why you'd want exceptions. Look at what you're doing for error handling, manually every time.
You say that like safely handling exceptions is trivial. Exceptions are emphatically not "better in every way", they are a mixed bag. They offer many clear benefits (some that you have described here), but at the cost of making your code more difficult to reason about. You essentially end up with a lot of invisible goto's. Problems with exceptions tend to be much more subtle and hard to debug.
I'm not against them at all, and often I prefer them, but there are certainly downsides.
There is a lot of comparisons and branching going on, when the program always checks return codes. Assuming zero-cost exceptions, there is only overhead in the failure case.
I also find it very disingenious of the pro-exceptions post to claim that these mazes of ifs are easy to navigate. In his example that is sort-of true. When you're using actual real data to make the comparison it's easy to introduce very hard to trace bugs in them.
Once I had two things to check, one being time, and as you know that means 8 cases. You have to pick one to check first, and I picked the non-time based check to check first. That means that I suddenly didn't check all cases anymore :
if (currentTime() < topBound) {
if (some other condition) {
if (currentTime() > lowerBound) {
// at this point you of course do NOT know for sure that currentTime < topBound. Whoops.
(these look like they can be trivially merged. That's true if you look at just these lines, it becomes false if you look at the full set of conditions).
I don't get the sense that there was any attempt to recover from errors - it sounds more like they were enforcing that error checking occurred, by replacing `malloc` with one that just returned `NULL` always. It sounds like the goal was to make sure that one didn't assume `malloc` would always succeed and just use the memory.
Indeed, recovery is basically futile in this case and your program is going to shut down pretty quickly either way. Maybe you'll get the chance to tell the user that you ran out of memory before you die, which seems polite.
In systems that overcommit memory (like Linux), malloc() can return non-NULL and then crash when you read or write that address because the system doesn't have enough real memory to back that virtual address.
Even in Linux which set to overcommit, malloc() can still return NULL if you exhaust the virtual address space of your process, though I expect it's much less likely now on 64-bit platforms.
Yes, a better option is to make sure this error cannot happen, by making sure the program has enough memory to begin with. Fly-by-wire shouldn't need unbounded memory allocations at runtime.
There are some applications where you can try to recover by freeing something that isn't critical, or by waiting and trying again. Or you can gracefully fail whatever computation is going on right now, without aborting the entire program. But these are last resort things and will not always save you. If your fly-by-wire ever depends on such a last resort, it's broken by design :-)
From what I understand, in those sort of absolutely critical applications the standard is to design software that fails hard, fast, and safe. You don't want your fly-by-wire computer operating in an abnormal state for any amount of time, you want that system to click off and you want the other backup systems to come online immediately.
The computer in the Space Shuttle was actually 5 computers, 4 of them running in lockstep and able to vote out malfunctioning systems. The fifth ran an independent implementation of much of the same functionality. If there was a software fault with the 4 main computers, they wanted everything to fail as fast as possible so that they could switch to the 5th system.
Tangent, I was thinking about Toyota's software process failure and how they _invented_ industrial level mistake proofing yet did not apply it their engine throttle code.
C is obviously the wrong language, but from a software perspective they should have at least tested the engine controller from an adversarial standpoint (salt water on the board, stuck sensors). That is the crappy thing about Harvard architecture cpu (separate instruction and data memory), you can have while loops that NEVER crash controlling a machine that continues to wreck havoc, sometimes you want a hard reset and a fast recovery.
I wasn't trying to nitpick. Correcting the example, and yes recovering from a malloc failure _could_ be a worthy goal, but on Linux by the time your app is getting signaled about malloc failures the OOM killer is already playing little bunny foo foo with ur processes.
If your app can operate under different allocation regimes then there should be side channels for dynamically adjusting the memory usage at runtime. On Linux, failed malloc is not that signal and since _so many_ libraries and language runtime semantics allocate memory, making sure you allocate no memory in your bad-malloc path is very difficult.
Like eliteraspberrie said, the proper way to recover from an error is to unroll your stack back to your main function and return 1 there.
Error checking was enforced for EVERY syscall, be it malloc() or open(). Checking for errors was indeed required but not enough: proper and graceful shut down was required too.
Go ahead. Call exit(). On your HTTP/IRC/anything server. Just because you couldn't allocate memory for one more client. Now your service is down and the CTO is looking for blood =)
Yes, it's far-fetched and like some said further down the comments, you "can't" run out of memory in Linux, but straight killing a service is never good.
If your server's running Limux, it's going to kill your process with no questions asked if you run out of memory. You're better off practicing crash-only error recovery and having robust clients that can handle a reconnect.
HTTP is stateless already, so crash and restart all you want!
The OOM killer is more likely to kill some other process and trash your server.
Thankfully that sort of behavior has been vastly reduced since the thing was introduced, but disabling overcommit for high-reliability applications is still a reasonable course of action.
The OOM killer might eventually kill something, after it thrashes the system for a few hours.
I had a server last week in which swap hadn't been configured. A compilation job took all memory and the OOM started thrashing. Thankfully there's always one SSH session open but I couldn't kill anything, sync or shutdown; fork failed with insufficient memory.
Left if thrashing overnight and had to power-kill it next day.
I'd say letting that server bounce and having the watchdog/load balancer work to keep you at capacity is the best option there. You are going to need that infrastructure anyway and if you can't malloc enough for one more client, who is to say that the existing stuff is going to be able to continue either?
You should count on any particular instance bouncing when that happens, and design your system to survive that. You should also invest some effort to figure out why your system can get into that state. Consider if any particular instance should be attempting to accept more jobs than it has the ability to handle. I shouldn't be able to trigger an out of memory situation on your IRC server just by connecting as many irc clients as I can.
I was taught C at Epitech. A single segfault, no matter insidious, was a valid reason to render a whole project NULL. We often had evaluations ran with LD_LIBRATY_PATH=malloc_that_fails.so or just piping /dev/urandom to stdin...
Needless to say, calling exit() when a call to malloc() failed wasn't an acceptable recover routine.