Add Book to My BookshelfPurchase This Book Online

Chapter 6 - Practical Considerations

Pthreads Programming
Bradford Nichols, Dick Buttlar and Jacqueline Proulx Farrell
 Copyright © 1996 O'Reilly & Associates, Inc.

Debugging
Well, you probably have an independent streak and, against our earlier advice, will write some multithreaded programs with bugs. Debugging multithreaded programs will provide you with some interesting new challenges. First of all, you'll investigate types of programming errors that result from thread synchronization problems, namely deadlocks and race conditions. Second, once you've seen a problem (for instance, some data corruption or a hang), you'll discover you may have a hard time duplicating it. Because the alignment of events among threads that run concurrently is largely left up to chance, errors, once found, may be unrepeatable. Finally, because threads are a new technology, many vendors have yet to upgrade their debuggers to operate well on threaded programs. 
All of this is to say, quite simply, that you'll need your wits about you when debugging a multithreaded application! Sound like fun? Read on! 
Deadlock
When one or more threads are spinning or have stopped permanently, chances are that you've encountered a deadlock. You'll have a good idea that you've run into a deadlock, because your program...will...just..., er, stop. 
The most common reason for a deadlock—and the easiest to solve—is forgetting to unlock a mutex. Deadlocks can also result from problems in the order in which threads obtain locks. You may need to perform a bit of detective work to resolve these. The rule is that all threads in your program must always pursue locks in the same order. If, en route to obtaining Lock B, a lock must first obtain Lock A, then no thread should try to obtain Lock B without first obtaining Lock A. 
You may also encounter another form of deadlock. A thread may suspend itself to wait on a condition variable that is never signaled by any other thread, thus falling into some sort of deep, undisturbed sleep. If you see a deadlock of this sort, you'd do well to look for an inconsistency in the way in which your thread interprets the condition. For instance, does one thread sleep on count = = 0 and another thread signal the condition when count < 0 ? A condition is usually signaled when a variable reaches a certain value. If the variable can never reach that value, you can anticipate trouble. If you expect a flag to be set, double check to ensure that it actually is; if you expect a counter to reach zero, make sure that it actually does. 
Fortunately, the Pthreads library knows all about the mutexes and condition variables in use. If you are armed with a good thread-knowledgeable debugger, you can list which thread is waiting for which mutex or condition variable and make great strides toward pinpointing the culprits of the deadlock. Even without such a debugger, you can periodically tap into the Pthreads library's statistics by adding simple wrapper routines around the Pthreads calls in your program. 
Race Conditions
A friend once told us the story of the time she and her husband set out to purchase a new car. On her way home from work one Friday, she stopped at a Dodge dealer in New Hampshire, saw the ideal minivan, and arranged a trade-in. On his way home from work the same day, her husband stopped at a Dodge dealer in Massachusetts, found the perfect vehicle, and also arranged a trade-in. We were at first surprised that they took a multithreaded approach to buying an automobile (where we would have chosen a more traditional monolithic approach) but that evening began to think about their predicament—proud owners of three Dodge Caravans. Then again, it might be just two Caravans—they did in fact trade one in. But they actually traded in the same minivan twice—to two different dealers in two different states. The pettiest of crimes (inadvertent fraud would surely qualify) becomes grossly magnified when state lines are crossed; maybe they would somehow wind up with no Caravans if the dealers claimed a breach of contract. At length, they did work things out (probably by using a pthread_join or something) and ended up with a single minivan—a brand new one!
In our imagination, the behavior of our threaded friends may have had several possible outcomes, some more inconvenient than others. At the end, they "got the right answer." We wonder if they would get the right answer each time they set out to buy a new car in this way.
This is a type of race condition. A race condition occurs when multiple threads share data and at least one of the threads accesses the data without going through a defined synchronization mechanism. (You'd think our friends would have used a mutex or, better, a telephone call, to synchronize their car hunting.) As a result, a thread that reads the data at the same time as the first thread may get a corrupted value—or not, depending upon the timing between the two threads. 
A race condition may be difficult to detect. It may lie around in your code like an accident waiting to happen. It may not surface consistently; it might occur once in every hundred (or thousand) executions. Even if does arise, you may miss it if you're not looking very closely at your program's output. If you're lucky, a race condition will make a bad memory reference, cause a fatal signal, and crash your program. At least then you can begin the process of isolating the problem and identifying its cause. 
Unlike deadlocks, race conditions involve resources (such as files, buffers, and counters) that aren't managed by the Pthreads library. Often, a race condition involves a resource that you didn't realize was shared among your threads. For example, perhaps two threads called a nonreentrant routine from a system library, or executed some initialization code that you intended to be run only once. A subtler problem arises if a thread passes a pointer to its stack data as a parameter in a pthread_create call. Even though a thread's automatic data is supposed to be private, nothing prevents another thread from accessing it if you pass it its address! In other cases, you may be aware that a particular resource is shared by multiple threads, but you didn't get its synchronization right. For example, a thread might reference shared data after it has yielded the mutex that protects the data. 
Event Ordering
Because problems like deadlocks and race conditions can be intermittent, rearing up only once every hundred or so program runs, debugging a multithreaded program requires keener detective skills and more patience than you'd bring to a more traditional debugging session.
The ordering of the events performed collectively by a program's threads at run time becomes supremely important in debugging a multithreaded program. Unsynchronized access to shared data often works if events on that data don't collide. For instance, if Thread A performs an unsynchronized access of a resource it shares with Thread B before Thread B accesses the resource, there's no chance for a race condition to develop. However, if Thread B happens to access the resource while Thread A is still busy with it, a race condition will result. Now, the race condition may not cause an error every time it occurs. Sometimes your threads may, almost accidentally, come out of the race condition with the right answers.
To make matters worse, various things in your program's run-time environment, unrelated to the program itself, can impact the ordering of the program's events. Introducing a debugger, for instance, can cause the events to occur in an order that's different from the sequence they'd follow when the program is run in a production environment. Similarly, you may discover bugs as you move your program from one platform to another. The new platform's scheduling policy, performance, and system load could be different enough so that some tasks complete faster than others, thus disrupting the usual ordering of events that had up to that point concealed the bug.
Less Is Better
Remember, the roots of your program's race conditions and deadlocks are in its threads' use of shared data. If your threads share little data, you'll have little opportunity to create bugs that cause either of these problems. Moreover, because there is less synchronization overhead, your threads will run faster. The reduction in a program's complexity, as well as potential improvements in its performance, may make it worth your while to look at ways of localizing data access to specific threads and minimizing the program's overall synchronization needs.
Trace Statements
Regardless of the capabilities of your debugger, you can insert trace statements in your code to monitor your program's activities. A trace statement usually takes the form of a printf or a write to a log file. 
If your debugger does not have built-in thread support, trace statements may be your only means of monitoring what your threads are doing at the time of a deadlock. 
It's handy to define trace statements as macros that can be conditionally compiled based on the definition of a DEBUG symbol, as shown in Example 6-1. 
Example 6-1: Trace Statement (trace.c)
#if DEBUG
#define DPRINTF(x)    printf x
#else
#define DPRINTF(x)
#end
In our definition of DPRINTF, we allow a variable-sized argument list, as long as you surround the list with double parentheses: 
DPRINTF(("module com: start. %count, %size", count, size));
Where should you place trace statements? Trace statements are most useful when they are inserted at those places where deadlocks and race conditions usually occur. For the best payback, place them before and after each call to these functions: pthread_mutex_lock, pthread_mutex_unlock, pthread_cond_wait, and pthread_cond_signal. You can also use a trace statement at other points to track the ongoing status of your program: for instance, which modules have been executed and what the values of counters and key variables are. 
What sort of information should a trace statement print? You should include the name of the current routine at the very least, plus other information that is useful in that module's context. If the routine is called by only one thread, the routine name may suffice to identify the message. However, if the routine can be called by multiple threads, you must also include some sort of thread identification, particularly if you're logging to the monitor or a common log file. You may not be able to do this as neatly as you'd like. The Pthreads library doesn't provide thread IDs, only thread handles. You could, however, pull something together that works fairly well. You could print out the thread handle address, which does uniquely identify each thread, and cope with a bit of awkward reading in the trace output. Better yet, you could assign a meaningful string to each thread handle address, storing them in keys or in a global array of pointers to char
There is one last problem to solve here. You may remember that the thread handle is returned in an output argument to the caller of pthread_create. This means that the created thread doesn't know the address in which its creator stored its handle. You'll need to provide some way for a thread to obtain this address so that it can include it in its trace messages. One approach might be to have each creating thread store the handles of the threads it creates in a global table. A thread that needs to find out the address of its own thread handle calls pthread_self to obtain a copy of its handle. It then indexes through this table to determine its unique handle address. 
Beware of synchronization issues when using trace statements, particularly if they write information to a common log file. If threads don't synchronize their writes, trace messages in the log file may be garbled or out of order. Moreover, if they do synchronize their writes by locking a mutex on the file, their execution will become linked at each trace, possibly masking race conditions that could occur during normal program execution. Furthermore, if you deploy the application with logging enabled, its performance will be abysmal! 
Debugger Support for Threads
Not surprisingly, the Pthreads standard does not address debugging support for threads. Consequently, any thread-debugging capability you find in a debugger will be vendor-specific. Nevertheless, a good system will extend its standard system debugger to help thread programmers. 
In some cases a system's debugger will not work, or become hopelessly confused, when it's used with a multithreaded program. Some of the issues for a debugger are formidable. When we set a breakpoint somewhere in a program, does it cause just the thread that hits it to stop, or all threads of the process? It should probably stop all threads. When we step through code, which thread runs? 
As an example, the ladebug debugger on Digital UNIX has features to support the debugging of multithreaded programs. The ladebug debugger has built-in features for identifying individual threads within a process and printing out their state. For instance, the where command allows you to specify which threads' call stacks you want to examine. The thread command allows you to set a particular thread as being a "current" thread, to which subsequent commands will apply. If we were to use ladebug to debug our ATM server, the session might look like this: 
% ladebug atm_svr
Welcome to the Ladebug Debugger Version 4.0-19
------------------
object file name: atm_svr
Reading symbolic information ...done
(ladebug) stop in main[#1: stop in int main(int, char**) ]
(ladebug) run
Here, we didn't specify that any particular thread take the breakpoint. Consequently, when we run the program, the entire process is stopped when any thread reaches main
[1] stopped at [main:127 0x1200022bc]
    127   atm_server_init(argc, argv);
(ladebug) show thread
Thread State      Substate        Policy     Priority Name
------ ---------- --------------- ---------- -------- -------------
>      1 running                    throughput 11       default thread
      -1 blocked    kernel          fifo       32       manager thread
      -2 ready                      idle        0       null thread for VP 0x0
(ladebug) where
>0  0x1200022bc in main(argc=1, argv=0x11ffff308) atm_svr.c:127
(ladebug) p $curthread
1
The show thread command tells us that three threads are running. Surprise! The Pthreads library is itself a threaded program and creates its own daemon threads. (Digital's implementation uses negative numbers to identify threads that are put there by the system.) Thread 1 is the only thread that is created by our application. This makes sense because we've only just gotten into main! The where command confirms that we're in the first line of main
Next, we ask the debugger to stop in process_request. This breakpoint will apply to all threads—including those we're about to create. And so we continue: 
(ladebug) stop in process_request
[#2: stop in void* process_request(void*) ]
(ladebug) c
After a client issues a request, the server program hits the new breakpoint:
[2] stopped at [process_request:210 0x120002518]
    210   workorder_t *workorderp = (workorder_t *)input_orderp;
(ladebug) show thread
Thread State      Substate        Policy     Priority Name
------ ---------- --------------- ---------- -------- ------------
    1 blocked    kernel          throughput  11      default thread
   -1 blocked    kernel          fifo        32      manager thread
   -2 ready                      idle         0      null thread for VP 0x0
>   2 running                    throughput  11      <anonymous>
(ladebug) where
>0  0x120002518 in process_request(input_orderp=0x140011000) atm_svr.c:210
#1  0x3ff80823e94 in thdBase(0x0, 0x0, 0x0, 0x1, 0x45586732, 0x3)
DebugInformationStrippedFromFile101:???
Now, we can see the new thread our server just created to process the incoming request. The > in the output of show thread tells us that this is our current thread. When we subsequently issue the where command, this thread's start function, process_request, appears on the stack above the thread "base." 
You don't need to change the current thread in ladebug just to look at a thread's stack, but, just for illustration purposes, that's what we'll do here: 
(ladebug) thread 1
Thread State      Substate        Policy     Priority Name
------ ---------- --------------- ---------- ------- -------------
     1 blocked    kernel          throughput 11      default thread
(ladebug) p $curthread
1
(ladebug) where
>0  0x3ff82050f28 in /usr/shlib/libc.so
#1  0x120003a38 in server_comm_get_request(conn=0x140011100,
                               req_buf=0x140011104="") atm_com_svr.c:187
#2  0x120002308 in main(argc=1, argv=0x11ffff308) atm_svr.c:135
We use the thread command to change the current thread, and then we show its stack with the where command. The main thread is hanging out in a Standard C library (libc) routine (select, to be exact) in server_comm_get_request
As long as we don't send it another request, Thread 1 isn't going to do much. Let's step through some of the processing of the request in Thread 2. Here we'll step to the beginning of the open_account procedure: 
(ladebug) thread 2
Thread State      Substate        Policy     Priority Name
------ ---------- --------------- ---------- -------- -------------
     2 running                    throughput 11       <anonymous>
(ladebug) s
stopped at [process_request:216 0x12000251c]
    216   sscanf(workorderp->req_buf, "%d", &trans_id);
(ladebug) s
stopped at [process_request:220 0x12000253c]
    220   switch(trans_id) {
(ladebug) s
stopped at [process_request:223 0x1200025dc]
    223   open_account(resp_buf);
(ladebug) s
stopped at [open_account:327 0x120002a20]
    327 void open_account(char *resp_buf)
(ladebug) c
.
.
.
Process has exited with status 0
(ladebug) quit
%
Digital UNIX has integrated many Pthreads features into its ladebug debugger. The ladebug debugger allows you to access even more detailed information by using the pthread command. The pthread command allows you to issue a subclass of thread-display commands that can show you the detailed states of mutexes, condition variables, and threads, plus various other types of information that can help you debug a threaded application. For example, you'd use the pthread command to see threads' cancellation states and types, which threads have which signals blocked, or what the last exception a thread handled was. 
The pthread help command shows us a full listing of available commands. 
Example: Debugging the ATM Server
Let's pretend we made some mistakes when writing our ATM server, and we've encountered deadlocks during some of our test runs. In this section, we'll illustrate how we'd investigate the problem using a thread-smart debugger. We'll use the Digital UNIX ladebug debugger just because it has good thread support. Reading this section will help you learn how to troubleshoot a deadlock or a race condition, even if you don't have this debugger. 
Debugging a deadlock caused by a missing unlock
A deadlock would occur if a worker thread's service routine failed to unlock the mutex after it modified an account, as shown in Example 6-2.
Example 6-2: A Broken Deposit Routine (atm_svr_broken.c)
void deposit(char *req_buf, char *resp_buf)
{
  int rtn;
  int temp, id, password, amount;
  account_t *accountp;
  /* Parse input string */
  sscanf(req_buf, "%d %d %d %d ", &temp, &id, &password, &amount);
  /* Check inputs */
  if ((id < 0) || (id >= MAX_NUM_ACCOUNTS)) {
    sprintf(resp_buf, "%d %s", TRANS_FAILURE, ERR_MSG_BAD_ACCOUNT);
    return;
  }
  pthread_mutex_lock(&global_data_mutex);
  /* Retrieve account from database */
  if ((rtn = retrieve_account( id, &accountp)) < 0) {
    sprintf(resp_buf, "%d %s", TRANS_FAILURE, atm_err_tbl[-rtn]);
  }
    .
    .
    .
    /* Finish processing deposit */
    /* pthread_mutex_unlock(&global_data_mutex); */
}
Consider the following series of transactions on our account database: 
 1.Read balance in account 3. 
 2.Deposit $100 in account 3. 
 3.Read balance in account 4. 
 4.Deposit $25 in account 3. 
Although the mutex unlock is missing, we can run the first three transactions without a problem. Because the read service routine's locking behavior is correct, its read of account 3 does not prevent the subsequent deposit to the same account. Remember too that each account has its own lock, so the read to account 4 does not reveal a problem. It's only when we again access account 3 that we stumble. 
The worker thread that handles our fourth transaction suspends in its pthread_mutex_lock call, waiting forever for the thread that performed the second transaction to unlock account 3. Because of the flaw in the deposit routine, this will never happen. Over time, the server will launch its maximum number of worker threads. Each will eventually be drawn into the black hole of account 3 (and any other account to which a previous thread has made a deposit). 
We could easily identify the problem by inspecting our sources, but let's use the strange behavior we've noticed in our server as a good reason to summon the debugger. 
% ladebug atm_svr_broken
Welcome to the Ladebug Debugger Version 4.0-19
------------------
object file name: atm_svr_broken
Reading symbolic information ...done
(ladebug)
First, we'll need to choose a useful breakpoint. This is often the most difficult part of troubleshooting. When in doubt, you should place breakpoints at the beginning and end of the thread start routine, if your program contains one. In the ATM server, this would be the process_request routine: 
(ladebug) stop at process_request
[#1: stop in void* process_request(void*) ]
(ladebug) stop at "atm_svr_broken.c":257
[#2: stop at "atm_svr_broken.c":257  ]
(ladebug) run
We'll get our debugging session moving by issuing some client requests. Our first request, a deposit, would cause the debugger to stop the program at the breakpoint we placed at the beginning of process_request: Here, we'll take a look at the locked mutexes using the show mutex command: 
[1] stopped at [process_request:213 0x120002518]
    213   workorder_t *workorderp = (workorder_t *)input_orderp;
(ladebug) where
>0  0x120002518 in process_request(input_orderp=0x140011100) atm_svr_broken.c:213
#1  0x3ff80823e94 in thdBase(0x0, 0x0, 0x0, 0x1, 0x45586732, 0x3)
DebugInformationStrippedFromFile101:???
(ladebug) show mutex with state == locked
(ladebug)
The show mutex command shows that no mutex locks are being held by any thread at this point. Let's continue the program so that we reach the breakpoint at the end of process_request
(ladebug) c
[2] stopped at [process_request:257 0x120002678]
    257   return(NULL);
(ladebug) show mutex with state == locked
Mutex 49 (normal) "mutex at 0x140001760" is locked
(ladebug)
Now we've hit the end of our process_request routine. This time, show mutex is telling us there's a mutex still locked. At this point, the error is evident. There are no other transactions in progress, so we know our thread has failed to unlock the mutex. 
If we disable the breakpoints and continue (or even if we step through the program), we find that subsequent commands to the same account hang. While one is hung, we can get the debugger's attention with CTRL-C, and see what's happening (see Example 6-3). 
Example 6-3: Watching Threads Hang in the ladebug Debugger
(ladebug) c
Thread received signal INT
stopped at [msg_receive_trap: ??? 0x3ff8100ea44]
(ladebug) show thread
Thread State      Substate        Policy     Priority Name
------ ---------- --------------- ---------- ------- -------------
     1 blocked    kernel          throughput 11      default thread
>   -1 blocked    kernel          fifo       32      manager thread
    -2 running                    idle        0      null thread for VP 0x0
     4 blocked    mutex wait      throughput 11       <anonymous>
(ladebug) where thread 1
Stack trace for thread 1
#0  0x3ff82050f28 in /usr/shlib/libc.so
#1  0x120003a08 in server_comm_get_request(conn=0x140011000,
                          req_buf=0x140011004="") atm_com_svr.c:187
#2  0x120002308 in main(argc=1, argv=0x140008030) atm_svr_broken.c:138
(ladebug) where thread 4
Stack trace for thread 4
#0  0x3ff8082bbf4 in /usr/shlib/libpthread.so
#1  0x3ff80829700 in hstTransferContext(0x1, 0x140005a78, 0x3ffc0439dc0, 0x4,
0x3ffc0438a00, 0x140011180) DebugInformationStrippedFromFile109:???
#2  0x3ff80813edc in dspDispatch(0x140009a10, 0x1400081a8, 0x140008030, 0x0,
0x140001760,
                          0x100000000) DebugInformationStrippedFromFile89:???
#3  0x3ff80817758 in pthread_mutex_block(0x1, 0x3ffc0433400, 0x3ffc0439dc0, 0x0,
0x140001760, 0x0) DebugInformationStrippedFromFile95:???
#4  0x3ff8082b9f0 in __pthread_mutex_lock(0x3ffc0433400, 0x3ffc0439dc0, 0x0,
0x140001760, 0x0, 0x120002bd4) DebugInformationStrippedFromFile111:???
#5  0x120002bd0 in deposit(req_buf=0x140011184="2 25 25 200",
resp_buf=0x140035a18="") atm_svr_broken.c:418
#6  0x1200025cc in process_request(input_orderp=0x140011180) atm_svr_broken.c:230
#7  0x3ff80823e94 in thdBase(0x0, 0x0, 0x0, 0x1, 0x45586732, 0x3)
DebugInformationStrippedFromFile101:???
(ladebug) quit
%
We see that there are two active application threads, one of which is the main thread. The where command tells us that the main thread (Thread 1) is in its normal hangout, waiting on select in server_comm_get_request. The where on Thread 4 shows us that it is our process_request thread (stack entry #6) and that it's waiting in the depths of pthread_mutex_lock (stack entry #4). It will stay there forever, because the thread that should have unlocked the mutex terminated sometime ago! 
Debugging a race condition caused by a missing lock
In the debugging session in Example 6-3, we looked at the results of a forgotten pthread_mutex_unlock call. In Example 6-2, a unlock was forgotten and caused a deadlock. Our efforts to debug the missing unlock were fairly straightforward. We placed breakpoints at the beginning and end of the thread-starting routine and examined the state of the mutexes at each. What if we had forgotten a pthread_mutex_lock call in one of our threads? What would be the symptoms of this problem, and how would we proceed to debug it? 
Our ATM server starts getting into trouble as its clients issue more and more requests for the same account. The more worker threads that are accessing this account at the same time, the more likely our server is to encounter a race condition on the account's data. More likely than not, we would discover such race conditions by running the server under a suitable test suite that simulates a heavy client load. It would be unfortunate if we had to wait for a race condition to surface from the disastrous effects our server might have on our customers' real-world data. Our tests would know what results we expect from all our threads combined and be able to compare the final state of account data against their expectations. 
As we proceed to debug a race condition, our first step will be to identify the data that is being corrupted. Once we've found the victim, we'll ask questions that are very much like those you'd ask during a good game of Clue: "Which threads knew the victim?" "When was their last contact with the victim?" and "Do they have an alibi?" Those threads that approached the account holding a mutex lock (and released the lock when leaving) have an alibi that's air tight. 
Assume that our test suite detected an account corruption problem in the ATM server. In the server, threads access accounts by calling the retrieve_account routine and release them by calling store_account. Before it calls retrieve_account, a thread should be holding the account's mutex; it should release it after it calls store_account
In the case of the ATM server, it's easier to find the missing pthread_mutex_lock call by closely inspecting our code than by using the debugger. The retrieve_account routine is called from only three places: deposit, withdraw, and balance. These three routines themselves are called from only one place: process_request. Checking these four routines for correctly paired lock and unlock calls would quickly reveal the source of the error. 
When confronted with a race condition in a more complex application, you may find it easier to start with the debugger and then move on to code inspection. You might use the debugger to set a watchpoint on a piece of shared data or to set breakpoints at those program statements that change the data. While the program is stopped at a breakpoint, you can identify the active thread and determine whether or not it holds the lock required for the account it's accessing. 

Previous SectionNext Section
Books24x7.com, Inc © 2000 –  Feedback