User Tools

Site Tools


memory_management_design

IoTivity is supposed to work for indefinite (long) periods of time in constrained computing environments. I believe IoTivity 1.0.0 will often fail this design goal because of serious shortcomings in its memory management architecture.

I estimate that active IoTivity servers with constrained memory will almost never run as long as a year, and may often fail in weeks. Moreover, the failures will be invisible, so the cause of the failures will be opaque. The mystery failures will be resolved by restarting the server, and similar failures will re-occur in equivalent time frames.

The memory failures result from these design deficiencies:

  • memory leaks
  • unhandled memory allocation failures
  • badly handled memory allocation failures
  • memory fragmentation

I have examined this issue extensively in the IoTivity code base. Moreover, I have forked IoTivity as part of my day job to provide an OIC server overcomes these issues so it will run in *highly* constrained computing environments. Let's consider the deficiencies in order, then consider the solutions.

memory leaks

Last I looked, the Jenkins build process will +1 a patch if Valgrind finds some limited number of memory leaks (5?). It may have been necessary to allow a handful of memory leaks in order to meet our aggressive schedule for the initial IoTivity release, but now that we have released 1.0.0, eliminating leaks should be a high priority in software expected to run in constrained environments.

Keep in mind that the number of memory leaks may be more than Valgrind can find, since it depends on the existence of comprehensive test software to find leaks, and some of us know that the current test suites are far from comprehensive. For this reason, I believe it can be argued that we are better off applying limited resources to broadening our test coverage than working on the leaks, per se.

The leak problem is exacerbated by the profligate use of unconstrained mallocs (malloc, calloc, strdup, free, etc) in all areas of the code base. I found there are nearly 200 mallocs (of various sorts) in the stack (RI) and connectivity (CA) subtrees, which doesn't include the CoAP, DTLS, and security libraries. Testing all these will be a nightmare.

The most effective method of taming memory leaks is to first eliminate the need for most mallocs. This allows us to build comprehensive memory leak analyses, with the side benefit of reducing heap management services and memory copies, both of which are also problems in constrained environments. For instance, in my highly constrained OIC server, the RI and CA layers allocate only seven data buffers, most of which can be reused without involving heap management services, while providing nearly all the functionality of an IoTivity server.

Only after drastically reducing the number of mallocs should we try to eliminate memory leaks. Ironically, the process of reducing the number of mallocs will likely eliminate most or all leaks. Also, expanding test coverage can be done in parallel with reducing the mallocs, and when both are completed, memory leaks can be contained relatively easily.

unhandled memory allocation failures

IoTivity recognizes a memory allocation failure (unable to allocate from heap) by testing the return value of a malloc call. As far as I know (having examined almost every line of IoTivity), there is a test for failure after every malloc call. The problem is that often there is nothing to do with that knowledge except avoid an immediate segment fault. For example, near the top of CAReceivedPacketCallback() in camessagehandler.c, a failure in CAParsePDU() results in a lost message. CAParsePDU can fail for various reasons, and one is failure to allocate the returned coap_pdu_t buffer. (In a DEBUG server a log message will be set, but a RELEASE server will be silent)

There are many places where this sort of thing happens. Once it happens, it is likely to happen again, so it is likely that the server will be silently useless until something causes it to be restarted. Even then, the circumstances that caused a memory allocation failure are likely to occur again.

One might argue that not testing for the allocation failure would likely cause a recognizable failure (SIGSEGV?), resulting in quicker recognition that a problem had occurred. I don't recommend that, but it should be clear that our current approach to memory management is not working.

badly handled memory allocation failures

What should IoTivity do if a malloc fails? The full answer lies in a comprehensive plan for memory usage, but any answer is very difficult. Ideally, the failure should be placed in front of a human (or computer monitor that reports to a human) as soon and as clearly as possible. This is a hard problem because reporting a memory failure almost always requires using more memory, and if subsequent failures occur, the reporting is doomed. I will talk more about this in the solutions.

What IoTivity does now is worse than useless. Instead of focusing on reporting, many memory allocation failures result in piece-wise degredation of the server. AddObserver() in ocobserver.c illustrates the range of responses. If obsNode can't be allocated, this part of the request being processed is ignored by skipping the meat of the function. Then if the various components of obsNode can't be allocated, the progress is unwound, and the memory failure is reported to the caller (FormOCEntityHandlerRequest). There, the type of failure is lost by turning the error into a generic failure (OC_STACK_ERROR), and an attempt is made to report the failure to to the client that made the request. Good luck reporting the error when any subsequent malloc needed for sending the message may also fail.

Note that large amounts of code and complexity are spent trying to handle this memory failure, with no guarantee that any positive result will transpire. A naive response to this situation is to commit to cleaning up all these data paths and make sure that every memory allocation failure is reported, without changing the underlying logic. I claim this is a fools errand, for two major reasons. First, this approach burdens all of the code with large amounts of exception code paths that slow development, increase maintenance costs, and make an IoTivity server harder to run on highly constrained computing environments. Second, this approach doesn't solve the real problem, which is that an IOC server should run forever in the face of constrained RAM.

memory fragmentation

Suppose we completely deal with the previous three deficiencies. We eliminate all the memory leaks. We reliably and adequately handle every memory allocation failure. Does that eliminate the issue under discussion? Of course not. First, properly handling memory allocation failures doesn't keep them from happening, and they are typically catastrophic to the server when they do. Second, memory leaks aren't the only reason an application can run out of memory.

I have spent many hours poring over heap allocation maps trying to understand why an application fails when there is enough free memory to support it. This generally happens after an application has run a long time, and the heap has become fragmented so that the larger allocation fails because the free space is all held in numerous smaller chunks. The C programming model allows no method for coalescing the smaller chunks into larger ones, so memory allocation fails.

Memory fragmentation represents a form of entropy. At the time of allocation failure, a heap allocation map can appear like some mischievous hand has planned to destroy the application by partitioning the heap into small chunks. But no hand is apparent. What happens is that random allocations and frees leave some ongoing allocations in places that keep freed allocations from coalescing. This process happens faster as the number of allocations goes up and the size variations of the allocations goes up.

Of course, heap allocation algorithms try to minimize fragmentation. There are dozens (hundreds?) of such algorithms in the literature and in software libraries. Each of them minimizes fragmentation by making assumptions about the pattern of allocations they will face. Some work best if the application allocates only power-of-two size buffers, some work best if allocations are always freed in the reverse order of allocation. (There are many other assumptions.) In any case, IoTivity will generally face allocation algorithms that are compromises, having shown good behavior in a wide variety of circumstances.

IoTivity represents a worst case for memory fragmentation. I found nearly 200 mallocs in the RI and CA layers alone, allocating large buffers (>1K) and short strings (1 byte). I suspect most of them are exercised for each request processed, and I know the largest and smallest buffers are allocated for each request. Some buffers are allocated for longer term usage (observe), and the application can allocate long term storage at any time (resource). In a highly constrained memory environment, these characteristics doom most IoTivty instances to eventual memory allocation failure.

summary

My purpose in this article is to make the IoTivity community aware of the memory management issue. I hope understanding this issue results in the application of resources to deal with the deficiencies.

While I describe four deficiencies, fixing them is not a matter of coming up with a solution for each one. Robust memory management is the result of architecture, not design or coding. We need to significantly re-architect IoTivity in order to deal with the deficiencies.

In memory management design II I describe the ways IoTivity can change to eliminate these deficiencies.

John Light
Intel OTC OIC development

memory_management_design.txt · Last modified: 2015/11/19 17:46 by John Light