January 4, 2013

Signal handler safety, re-entering malloc

This is a story from real-world development. From signal(7):


   Async-signal-safe functions
       A  signal  handler  function must be very careful,
       since processing elsewhere may be interrupted at some
       arbitrary point in the execution of the program.
       POSIX has the concept of "safe function".  If a signal
       interrupts the execution of an  unsafe  function,
       and handler calls an unsafe function, then the behavior
       of the program is undefined.

After that a list of safe functions follows, and one notable things is that malloc and free are async-signal-unsafe!

I hit this issue while enabling tcmalloc's debugallocation for Chromium Debug builds. We have a StackDumpSignalHandler for tests, which prints a stack trace on various crashing signals for easier debugging. It's very useful, and worked fine for a pretty long while (which means that "but it works!" is not a valid argument for doing unsafe things).

Now when I enabled debugallocation, I noticed hangs triggered by the stack trace display. In one example, this stack trace:

@0  0x00000000019c6c85 in tcmalloc::Abort () at third_party/tcmalloc/chromium/src/base/abort.cc:15
@1  0x00000000019b39c1 in LogPrintf (severity=-4,
    pat=0x32aeb18 "memory allocation/deallocation mismatch at %p: allocated with %s being deallocated with %s", ap=0x7fff52c379e8)
    at third_party/tcmalloc/chromium/src/base/logging.h:210
@2  0x00000000019b3a8b in RAW_LOG (lvl=-4,
    pat=0x32aeb18 "memory allocation/deallocation mismatch at %p: allocated with %s being deallocated with %s")
    at third_party/tcmalloc/chromium/src/base/logging.h:230
@3  0x00000000019c3fb1 in MallocBlock::CheckLocked (this=0x7fd18f143400, type=-21308287)
    at ./third_party/tcmalloc/chromium/src/debugallocation.cc:461
@4  0x00000000019c3c42 in MallocBlock::CheckAndClear (this=0x7fd18f143400, type=-21308287)
    at ./third_party/tcmalloc/chromium/src/debugallocation.cc:401
@5  0x00000000019c436a in MallocBlock::Deallocate (this=0x7fd18f143400, type=-21308287)
    at ./third_party/tcmalloc/chromium/src/debugallocation.cc:557
@6  0x00000000019c1929 in DebugDeallocate (ptr=0x7fd18f143420, type=-21308287)
    at ./third_party/tcmalloc/chromium/src/debugallocation.cc:998
@7  0x00000000028d1482 in tc_delete (p=0x7fd18f143420) at ./third_party/tcmalloc/chromium/src/debugallocation.cc:1232
@8  0x000000000097dc04 in cc::ResourceProvider::deleteResourceInternal (this=0x7fd191827da0, it=...) at cc/resource_provider.cc:242
@9  0x000000000097daaf in cc::ResourceProvider::deleteResource (this=0x7fd191827da0, id=1) at cc/resource_provider.cc:230
@10 0x00000000006f9824 in (anonymous namespace)::ResourceProviderTest_Basic_Test::TestBody (this=0x7fd18dc5abf0)
    at cc/resource_provider_unittest.cc:328
@11 0x00000000008ec801 in testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void> (object=0x7fd18dc5abf0,
    method=&virtual testing::Test::TestBody(), location=0x29463ab "the test body") at testing/gtest/src/gtest.cc:2071
@12 0x00000000008e9665 in testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void> (object=0x7fd18dc5abf0,
    method=&virtual testing::Test::TestBody(), location=0x29463ab "the test body") at testing/gtest/src/gtest.cc:2123
@13 0x00000000008dee0d in testing::Test::Run (this=0x7fd18dc5abf0) at testing/gtest/src/gtest.cc:2143
@14 0x00000000008df3ea in testing::TestInfo::Run (this=0x7fd191823020) at testing/gtest/src/gtest.cc:2319
@15 0x00000000008df8dc in testing::TestCase::Run (this=0x7fd19181f0d0) at testing/gtest/src/gtest.cc:2426
@16 0x00000000008e3eea in testing::internal::UnitTestImpl::RunAllTests (this=0x7fd19829dd60) at testing/gtest/src/gtest.cc:4249

generates SIGSEGV (tcmalloc::Abort). This is just debugallocation having stricter checks about usage of dynamically allocated memory. Now the StackDumpSignalHandler kicks in, and internally calls malloc. But we're already inside malloc code as you can see on the above stack trace (see frame @7, bold font), and re-entering it tries to take locks that are already held, resulting in a hang.

The fix required several changes:
  • no dynamic memory, and that includes std::string and std::vector, which use it internally
  • no buffered stdio or iostreams, they are not async-signal-safe (that includes fflush)
  • custom code for number-to-string conversion that doesn't need dynamically allocated memory (snprintf is not on the list of safe functions as of POSIX.1-2008; it seems to work on a glibc-2.15-based system, but as said before this is not a good assumption to make); in this code I've named it itoa_r, and it supports both base-10 and base-16 conversions, and also negative numbers for base-10
  • warming up backtrace(3): now this is really tricky, and backtrace(3) itself is not whitelisted for being safe; in fact, on the very first call it does some memory allocations; for now I've just added a call to backtrace() from a context that is safe and happens before the signal handler may be executed; implementing backtrace(3) in a known-safe way would be another fun thing to do
Note that for the above, I've also added a unit test that triggers the deadlock scenario. This will hopefully catch cases where calling backtrace(3) leads to trouble.

For more info, feel free to read the articles below: