Add Book to My BookshelfPurchase This Book Online

Chapter 6 - Practical Considerations

Pthreads Programming
Bradford Nichols, Dick Buttlar and Jacqueline Proulx Farrell
 Copyright © 1996 O'Reilly & Associates, Inc.

Performance
If well-designed and well-written, a multithreaded program can outperform a similar nonthreaded application. However, if you make bad design decisions (trying to force concurrency on a large set of strictly ordered tasks is a very basic bad design decision) or poorly execute a good design, you may wind up with a program that fares worse than what you started with. At the very end of Chapter 1, Why Threads?, we discussed which types of applications are good candidates for threading. Here, we'll look at those decisions you must make once you've selected the application and begun your design work. 
The Costs of Sharing Too Much—Locking
There's an unspoken tradition in our neighborhood that's beyond belief, but we'll tell you about it anyway. Without exception, the parents raise their children so that they're mindful of the virtues of sharing, which will surely be a benefit to them as they grow older and socialize. On any given Saturday night, herds of kids wheel about the streets on bicycles, skateboards, roller blades, scooters, and the like. When a boy tires of his bike, he exchanges it for a girl's skateboard; when a girl tires of her roller blades, she trades them for a boy's scooter; and so it goes. What the tradition seems to be is that any kid will share his or her wheels with any other kid, as long as the borrower's Dad hauls the stupid thing from the middle of the street back to its owner at the end of the evening. Anyone who has seen the neighborhood Dads out on the streets at 10 p.m. on a weekend night will learn this piece of wisdom: sharing is nice, but it's often inefficient—and inelegant. 
Concurrency may give a multithreaded program its greatest performance advantage over other styles of programming. However, the more its threads share, the more its performance is pulled back to that of the rank and file. Shared data (and the associated locks) is both the greatest asset and the biggest curse in multithreaded programming. That threads in the same process have equal access to a common set of resources, including the process's address space, allows them to communicate with each other much faster than independent processes can. When they need to share a particular resource, they don't have to copy it from one process's memory to another, nor do they need to use System V shared memory functions. Normal memory accesses work fine. Unfortunately, as we've seen, sharing isn't entirely free. It's as if multithreading allows you to go a bit faster than traditional speed limits, but data sharing is the speed trap in the bushes. We must apply a lock to brake a bit while we pass through, but once we're through we can cruise once again.* Although we took a performance hit, we'll still reach our destination sooner than we would've otherwise. 
 *None of the authors (nor anyone else affiliated with the publication of this book) actually drives this way. The appearance of this metaphor in this book is not meant to favor any particular driving style over another.
Locks reveal the dependencies among the threads in our program: at each lock point, either threads share data, or one thread must wait for another to finish some task. The impact of each lock on our program's performance is twofold: 
 There's the time it takes for a thread to obtain an unowned lock. This has little impact on our program's concurrency, so it's usually acceptable. The few calls required to lock and unlock a lock are minimal overhead. 
 There's the time a thread spends while waiting for a lock that's already held by another thread. Because it keeps the thread from accomplishing its task, this delay may cause a significant loss of concurrency. The loss can become magnified if other threads depend on the results of the blocked thread. 
Applications are suitable for threading only if access to shared data is a small part of them. If you find that your threads regularly block on locks and spend a lot of time waiting for shared data to become free, something's wrong with your program's design. 
As a rule, you should ensure that, when your threads do hold locks, they hold them for the shortest possible time. This allows other threads to obtain the locks more quickly, avoiding the long waits that are the major hits to a program's concurrency. Examine each block of code framed by pthread_mutex_lock and pthread_mutex_unlock calls for instructions that don't require the special synchronization and could well be performed elsewhere. 
In the following series of examples, we'll show you some common errors in using locks and suggest ways that you can avoid similar problems in your code. In Example 6-4, let's look at some code with poor locking placement. 
Example 6-4: Code with Poor Locking Placement (badlocks.c)
pthread_mutex_t count_lock = PTHREAD_MUTEX_INITIALIZER;
int count = 0;
void r1(char *fname, int x, char **bufp)
{
   double temp;
   int fd;
   .
   .
   .
   pthread_mutex_lock(&count_lock);
   temp = sqrt(x);
   fd = open(fname, O_CREAT | O_RDWR, 0666);
   count++;
   *bufp = (char *)malloc(256);
   pthread_mutex_unlock(&count_lock);
   .
   .
   .
}
If count is the only piece of shared data used by this code, we can make the code considerably more efficient by rearranging the pthread_mutex_lock and pthread_mutex_unlock calls as shown in Example 6-5. 
Example 6-5: Code with Poor Locking Placement, Improved (goodlocks.c)
pthread_mutex_t count_lock = PTHREAD_MUTEX_INITIALIZER;
int count = 0;
void r1(char *fname, int x, char **bufp))
{
   double temp;
   int fd;
   .
   .
   .
   temp = sqrt(x);
   fd = open(fname, O_CREAT | O_RDWR, 0666);
   pthread_mutex_lock(&count_lock);
   count++;
   pthread_mutex_unlock(&count_lock);
   *bufp = (char *)malloc(256);
   .
   .
   .
}
Finding poor locking policies is not often this simple. In Example 6-6, we'll look at the more complex situation in which the code references the shared data (count) from within a loop. 
Example 6-6: Code with Poor Locking Placement in a Loop (badlocks.c)
pthread_mutex_t count_lock = PTHREAD_MUTEX_INITIALIZER;
int count = 0;
void r2(char *fname, int x, char **bufp)
{
   double temp;
   int i, reads;
   int start = 0, end = LOCAL_COUNT_MAX;
   int fd;
   pthread_mutex_lock(&count_lock);
   for (i = start; i < end; i++) {
       fd = open(fname, O_CREAT | O_RDWR, 0666);
       x = x + count;
       temp = sqrt(x);
       if (temp == THRESHOLD)
          count++;
       .
       .
       .
       /* Lengthy I/O operations */
       .
       .
       .
   }
   pthread_mutex_unlock(&count_lock);
}
When examining this code, we must first decide whether or not we should move the lock calls from outside the loop to the inside. If the loop spends most of its processing time performing operations on shared data, or if its total processing time is quite short, it's probably most efficient to keep the lock calls outside. This would leave the whole loop in the critical section. On the other hand, we'd move the lock calls inside if the loop has a lengthy processing time and doesn't reference shared data. We need to be mindful that the lock calls themselves take time. We don't really want to pay the cost of the lock calls each time we go through the loop unless, in doing so, we significantly reduce the time we spend blocking other threads. We'll assume that the code in Example 6-7 pays off in that way. 
Example 6-7: Code with Poor Lock Placement in a Loop Improved (goodlocks.c)
pthread_mutex_t count_lock = PTHREAD_MUTEX_INITIALIZER;
int count = 0;
void r2(char *fname, int x, char **bufp)
{
   double temp;
   int i, reads;
   int start = 0, end = LOCAL_COUNT_MAX;
   int fd;
   for (i = start; i < end; i++) {
       fd = open(fname, O_CREAT | O_RDWR, 0666);
       pthread_mutex_lock(&count_lock);
       x = x + count;
       temp = sqrt(x);
       if (temp == THRESHOLD)
          count++;
       pthread_mutex_unlock(&count_lock);
       .
       .
       .
       /* Lengthy I/O operations */
       .
       .
       .
   }
}
Once you've arranged it so that threads hold locks for the shortest time possible, you should then focus on reducing the amount of data protected by any one lock (that is, reducing the lock's granularity). The smaller the unit of data a lock protects, the less likely it is that two threads will need to access it at the same time. For instance, if your program currently locks an entire database, consider locking individual records instead; if it currently locks records, try locking fields. 
For example, suppose we've set up locks like this: 
pthread_mutex_t data_lock;
struct record {
         int code;
         int field1;
         .
         .
         .
} data[DATA_SIZE];
Here, a single mutex, data_lock, protects the whole array. In the following code, we'll rearrange our record's structure so that each record contains its own lock. Now our threads can lock each record individually. 
struct record {
          pthread_mutex_t data_lock;
          int code;
          int field1;
          .
          .
          .
} data[DATA_SIZE];
Be careful when following this course. As you tune your locks to finer and finer granularity, you must know when to stop. Eventually, you pass the point at which it's useful to break down the data a lock protects. In fact, at some point, your efforts may result in your threads performing more locking operations—and unnecessary ones at that. Performance tests and profiling can help you determine the granularity at which you should impose locking on your program's data. Good tests can clearly identify how often data is being accessed and what percentage of its execution time a program spends waiting for locks on the data. 
Now that you've reduced the size of the code a thread executes while holding a lock, and reduced the size of the data each lock protects, you should consider whether some locks might in fact synchronize more efficiently if they were condition variables. Here's the rule of thumb: use locks to synchronize access to shared data, use condition variables to synchronize threads against events—those places in your program where one thread needs to wait for another to do something before proceeding. 
It's easy to get mixed up. The beginning threads programmer will often rough out a bit of code like that in Example 6-8. 
Example 6-8: Using a Mutex to Poll State (polling.c)
pthread_mutex_t db_lock = PTHREAD_MUTEX_INITIALIZER;
int db_initialized;
.
.
.
pthread_mutex_lock(&db_lock);
while (!db_initialized)
         pthread_mutex_unlock(&db_lock);
         sleep(1);
         pthread_mutex_lock(&db_lock);
}
pthread_mutex_unlock(&db_lock);
.
.
.
However, when we think a little harder about what we want this code to do, we realize that our threads are polling on the value of the db_initialized flag to determine when the database-initialization event has occurred. When this event occurs, our threads can proceed. When looked at in this light, it becomes clear that we should be using a condition variable instead of the mutex, as in Example 6-9. 
Example 6-9: Replacing a Mutex with a Condition Variable (polling.c)
pthread_mutex_t db_lock = PTHREAD_MUTEX_INITIALIZER;
pthread_cond_t db_init_cv = PTHREAD_COND_INITIALIZER;
int db_initialized;
pthread_mutex_lock(&db_lock);
while (!db_initialized) {
         .
         .
         .
         pthread_cond_wait(&db_init_cv, &db_lock);
}
.
.
.
pthread_mutex_unlock(&db_lock);
Using the condition variable, we can spare our threads the cycles it would take for them to continually lock a flag and check for the event. Instead, we'll wake them only when the database has actually been initialized. 
After trying these methods to reduce lock contention, you might want to take a last look at the tasks you've delegated to the program's threads. Some tasks you've assigned to different threads may be linked so tightly that they can't be separated without introducing some strained and perhaps impossible locking requirements. If this is so, you might be able to increase the program's overall performance by joining the tasks and having them performed by a single thread. 
Thread Overhead
Although the cost of creating and synchronizing multiple threads is less than that of spawning and coordinating multiple processes, using threads does involve overhead nonetheless.
When a thread is created, the Pthreads library (and perhaps the system) must perform database searches and allocate new data structures, synchronizing the creation of this thread with other pthread_create calls that may be in progress at the same time. It must place the newly created thread into the system's scheduling queues. In a kernel thread-based implementation, this requires a system call. The result is that the operating system allocates resources for the thread that are similar to those it allocates for a process. 
You can minimize this overhead by avoiding the simplistic one-thread-per-task model. For instance, our initial version of the ATM server example was rather wasteful in that it created a thread for each client request and then let the thread exit when it completed the request. The version of the server we developed at the end of Chapter 3, Synchronizing Pthreads, was more efficient. When it started, it created a pool of worker threads and let them block on a condition variable. When a new request arrived for processing, the boss would signal on the condition variable, waking the workers. As they complete requests, workers would return to sleep on the condition variable. 
Reusing existing threads is an excellent way to avoid the overhead of thread creation. You may need to experiment a little to determine how many threads can run efficiently at the same time. At length, you should create the maximum number of threads at initialization time so that a thread's creation expense is not billed against the request the thread is meant to process. 
Thread context switches
Once they've been created, threads must share often limited CPU resources. Even on a multiprocessing platform, the number of threads in your program may easily exceed the number of available CPUs. Regardless of whether you're using a user space or kernel thread-based implementation, scheduling a new thread requires a context switch between threads. The running thread is interrupted and its registers and other private resources are saved. A new thread is selected from the scheduler's priority queues, and its registers and private context are brought in from swap space. 
Some context switches are voluntary. If a thread is waiting for an I/O call to complete or a lock to be freed, it's just as if the thread has asked the operating system to remove it from execution and give another thread a chance to run. Others are involuntary. Maybe the thread has exceeded its quantum, and to be fair, it must yield the CPU. Maybe a higher priority thread has become runnable and is being given the CPU. In a perfect world (the same one in which threads never wait for other threads to unlock a mutex), no thread would be suspended involuntarily. Be that as it may, we'll look toward reducing the number of involuntary context switches as a good way to avoid the overhead of unnecessary context switches and improve our program's performance. 
The most common cause of involuntary context switches among threads is the normal expiration of time quanta. If your platform's scheduler uses a round-robin scheduling policy, one good place to start reducing the number of context switches is by increasing the quantum value. Be careful, though. Because time quanta are meant to more fairly distribute CPU cycles among runnable threads, you may need to cope with some side effects on certain types of operations. For instance, if a user clicks on a box to request a quick operation, he or she may need to wait longer than before because a thread performing a slow operation has yet to use up its quantum. 
Some Pthreads implementations allow you to control their scheduling policy, allowing you to ensure a quicker response time for high priority threads that are performing important tasks. There's a trade-off here, too, of course. The overall application might run slightly slower than under the default policy, because the favorable treatment enforced for high priority threads is causing more involuntary context switches. 
Finally, too many context switches may simply mean that your program has too many threads. Try running the program with fewer threads, and see if the program speeds up. Eventually, you should determine when the system reaches its saturation point and limit the number of concurrent threads accordingly. 
Synchronization Overhead
Each synchronization object (be it a mutex, condition variable, once block, or key) requires that the Pthreads library create and maintain some data structures and execute some code (possibly even a system call). Consequently, creating large numbers of such objects has its own cost. The cost can be magnified by the way in which you deploy the synchronization objects. For instance, if you create a lock for each record in a database, you increase the disk space required to store the database as well as the memory required to hold it while a thread is running. Nevertheless, the overhead could be worthwhile if the database must support different client requests simultaneously, and establishing fine-grained lock points at the record level allows it to do so efficiently. 
How Do Your Threads Spend Their Time?
Profiling a program is a good first step toward identifying its performance bottlenecks. To track the time a program's threads spend using the CPU or waiting for locks and I/O completion, we can use any profiling tool that supports threads. (On Digital UNIX, the standard profiling tools, prof or pixie, can provide per-thread profiling data.) 
By examining the profiling data, you'll get an idea of your threads' behavior. You should look for answers to the following questions: 
 Do the threads spend most of their time blocked, waiting for other threads to release locks?
This is a sign that the tasks the threads perform aren't really independent of each other or that locking is applied too coarsely to the shared data. 
 Are they runnable for most of their time but not actually running because other threads are monopolizing the available CPUs? 
In this case, the number of CPU-intensive threads is outstripping the number of CPUs in the system. (This can also happen to multiprocess applications.) Use the W and xload utilities to obtain the system's load factor: that is, the average number of processes and threads waiting to access the CPU. Use vmstat and iostat to determine the percentage of time the CPU is running in user space, is running kernel-mode code, or is idle. If the load factor is constantly high, or the amount of idle time is negligible, then you have too many processes or threads for your CPU. 
 Are they spending most of their time waiting on the completion of I/O requests? 
In this case, most of your I/O may be directed at a single disk and that disk is becoming quickly saturated. Thereafter, requests sent to it will wind up queued in the driver or at the disk. To avoid this bottleneck, you must spread the data across other available disks. Use the iostat tool to list the I/O transaction rates to the devices on your system. If you cannot utilize additional disks, you may need to reorganize your application so that it requires fewer disk writes. 
Performance in the ATM Server Example
Let's return to our ATM server and look at its performance. We'll create a specialized client program that can send the server a stream of requests and measure its response time. The test client measures the total time the server takes to complete a large set number of account transactions.
As shown in Figure 6-4, the ATM test parent program can start multiple test client processes to issue requests to the server across multiple connections. It can also specify how often a test client process accesses a specially designated "hot-spot" account. Finally, we can adjust the ATM server itself so that the work it performs to satisfy a client's request is more or less I/O intensive or CPU intensive. 
Figure 6-4: The ATM performance test setup
To find out exactly how useful threads are, we created two additional versions of the ATM server—a serial server (one that doesn't use threads at all) and a multiprocess server. 
We didn't optimize any of these programs in any sense and have often added code specifically to increase the amount of I/O or CPU work performed by the server. Our tests are meant to highlight common high-level aspects of multithreaded program performance and are not intended to be specific benchmarking results for the platform on which we ran them. Results will vary across different platforms. 
We recorded the results we'll present in this section on a single-CPU Alpha-processor-based DEC 3000 M300 workstation with 32 megabytes of memory, running Version 3.2C of the Digital UNIX operating system. The programs we used were Pthreads Draft 4 versions of our ATM server programs. 
Performance depends on input workload: increasing clients and contention
The ATM is a classic server—it receives multiple concurrent requests. It performs I/O both to obtain the requests and to process them. As we'll show in the following test runs, the multithreaded version of the server generally outperforms the other versions. But even so, the tests show that the results depend heavily on the type of input the server receives and the characteristics of the work the server performs to service the requests. The input can vary, based on the number of clients that are simultaneously active and how often clients request access to the same account at the same time. The server's response to a client's request can involve different amounts of I/O and more or less CPU-intensive tasks. 
First, let's see whether our multithreaded server or our serial server fares better as the number of clients increases. During this test run, we'll increase the number of active clients from 1 to 15, while keeping the net amount of work the server <?troff .hw performs>performs constant. All the clients access their own accounts and never access the hot-spot account. We'll run the test on our uniprocessor under the following conditions: 
Contention for accounts
None
Number of clients
Increasing
Total accounts accessed
30
Total accesses to accounts
240
Type of accesses
Deposits
Server work
50/50 I/O and CPU
Figure 6-5 shows the results (in terms of the ratio of the execution time of the multithreaded server over that of the serial server). 
Figure 6-5: Multithreaded server with increasing clients
When we increase the number of clients, the results show: 
 When there's just one client, the serial server outperforms the multithreaded server. 
 When there's more than one client, each requesting transactions on different accounts, the multithreaded server bests the serial server. 
When there's only one client, the server has only one request to process at any given time. After it issues a request, each client waits for a response before making another. In this situation, the actions taken by the multithreaded server to create a new thread and synchronize access to data are pure overhead. Because this overhead is not offset by any gain from concurrency, the multithreaded server's performance when only one client is active is, at best, close to that of the serial server. We could eliminate some overhead if we used a thread pool, effectively moving thread creation from the server's transaction-processing path to its initialization routine. 
When there are multiple clients, the worker threads that are processing client requests can work concurrently. While one thread waits for the completion of an I/O operation to a database account, other threads can continue their tasks and issue I/O requests to other accounts. In this test run, we made sure that no two threads would access the same account. As a result, our threads suffer the overhead of locking, but they never block on a lock that's held by another thread. 
Now let's see what happens to our servers when we ask the clients to modify the same account. During this test run, we'll gradually increase the percentage of the total requests that each client makes to the hot-spot account. Here, too, we'll keep the net amount of work the server performs constant. We'll run the test on our uniprocessor under the following conditions: 
Contention for accounts
Increasing
Number of clients
5
Total accounts accessed
30
Total accesses to accounts
240
Type of accesses
Deposits
Server work
50/50 I/O and CPU
Figure 6-6 shows the results. 
Figure 6-6: Multithreaded server with increasing contention
When we increase the amount of contention, the results show that, when multiple clients are accessing a single hot-spot account, the serial server outperforms the multithreaded server. 
As the number of requests from different clients to the hot-spot account increases, the performance of our multithreaded server declines. When all requests from all clients are directed at the same account, the server loses all concurrency; each worker thread must wait to obtain the lock, on the account and it's almost always held by another thread. When there's this amount of contention among threads, it's clear that we're asking the threads to perform tasks that are not independent. They're related by the shared data of the single account. 
The results of this test run demonstrate that multithreaded programs perform best when contention is the exception and not the rule. Consequently, when you're trying to determine whether or not an application would benefit from threads, look for tasks that can be performed independently, without interference from other tasks. Moreover, after you've designed the threads, minimize the amount of data they must share. 
Performance depends on a good locking strategy
Now we'll look at how different locking strategies affect the performance of our multithreaded ATM server. We'll test three different locking designs: 
 No locks at all (We'll disregard the inevitable race-conditions.) 
 One lock for the entire database 
 One lock for each account in the database 
As in our last test run, we'll gradually increase the percentage of the total requests that each client makes to the hot-spot account. However, in this test run, we'll track the extent to which a locking strategy impacts the server's performance. We'll compare the two versions of the server that use locks (one using a single lock on the whole database and one using a lock for each account) against an ideal version that uses no locks. We'll run the test on our uniprocessor under the following conditions: 
Contention for accounts
Increasing
Number of clients
5
Total accounts accessed
30
Total accesses to accounts
240
Type of accesses
Deposits
Server work
50/50 I/O and CPU
Figure 6-7 shows the results.
Figure 6-7: Multithreaded server locking designs
The results show that, when a lock is assigned to each account in the database, performance is better than when a single lock protects the entire database. 
When a single lock is used for the database, performance is uniformly bad, regardless of the amount of contention. Because all worker threads must obtain the one and only lock whenever they access any account in the database, they cannot concurrently access accounts. It matters little whether they're accessing different accounts or the same hot-spot account. 
When we use one lock per account, we see better performance because multiple threads can now independently access different accounts. When we reach the extreme of targeting all client requests to the hot-spot account, the single-lock and multilock versions perform about the same. Here, the hot-spot account lock is acting like the single global lock because it's the only one being used. 
The results of this test demonstrate that careful distribution of a larger number of locks can have less performance impact than the use of a single lock for which all threads contend. In another sense, the fewer locks threads can fight over the better. 
Performance depends on the type of work threads do
Now we'll look at the types of work threads perform. 
When we add threads to an application it's to concurrently perform a set of computational tasks. Each task has a certain average time to complete and a certain mix of I/O and CPU-intensive activity. In our ATM server, the task that is being performed by worker threads is a deposit to an account in a bank's database. 
We've adapted our server so that we can supply startup arguments that increase either its I/O activity or CPU-intensive activity. We increase I/O activity by forcing threads to write changed accounts to disk multiple times; we increase CPU-intensive activity by causing them to spin in a simple counting loop. Using these arguments, we'll adjust the combination of CPU and I/O work a thread must perform to complete a deposit transaction. 
In our test run, we'll move from a completely I/O-intensive workload to a completely CPU-intensive workload and record the results. We'll run the test on our uniprocessor under the following conditions: 
Contention for accounts
None
Number of clients
5
Total accounts accessed
30
Total accesses to accounts
240
Type of accesses
Deposits
Server work
Varying I/O and CPU workloads (work per transaction 4x other tests )
Figure 6-8 shows the results. 
Figure 6-8: Multithreaded server with varying I/O and CPU workloads
The results demonstrate: 
 In a uniprocessor configuration, the serial server outperforms the multithreaded server on a pure CPU-intensive workload. 
 On a mixed workload, the multithreaded server outperforms the serial server. 
As the server's work becomes completely CPU intensive, threads no longer provide a performance benefit. The single CPU becomes a bottleneck for the many threads waiting to perform CPU-bound tasks. Think of the CPU as a resource with a single lock for which all threads contend. 
Key performance issues between using threads and using processes
We'll now use our ATM server test program to highlight the ways in which performance differs between multithreaded and multiprocess versions of the same servers. Threads and processes are alike in many respects, although using processes results in more overhead than using threads. Processes are more expensive to create, and once created, they use more resources than threads to intercommunicate. 
If we replaced the multithreaded server in the previous tests with a multiprocess one, the basic curve of the test results would remain essentially the same. However, the point at which the performance of the multiprocess server would exceed that of the serial server would be further out than the point we charted for the multithreaded server. In fact, to justify using a multiprocess server, we'd need more clients, more contention at shared data, or less CPU-intensive work than we'd need to justify writing a multithreaded server. 
First, let's see how our multithreaded server and multiprocess server compare as the number of clients increases. As in the earlier test run, we'll increase the number of active clients from 1 to 15, while keeping constant the net amount of work each server performs. All the clients access their own accounts and never access the hot-spot account. We'll run the test on our uniprocessor under the following conditions: 
Contention for accounts
None
Number of clients
Increasing
Total accounts accessed
30
Total accesses to accounts
240
Type of accesses
Deposits
Server work
50/50 I/O and CPU
Figure 6-9 shows the results. 
Figure 6-9: Multithreaded vs. multiprocess server performance with increasing clients
The results demonstrate that the multithreaded server outperforms the multiprocess server, regardless of the number of clients. 
The difference between the multithreaded and multiprocess servers is in the relative costs of creating threads vs. creating processes. Although both threads and processes must obtain locks to access shared data, they don't have to wait on locks because this test run eliminates contention for the data. 
Now let's introduce the contention, and see how our servers fare. As in the earlier test run, we'll ask the clients to modify the same account and gradually increase the percentage of the total requests that each client makes to this account. Here too we'll keep constant the net amount of work the server performs. We'll run the test on our uniprocessor under the following conditions: 
Contention for accounts
Increasing
Number of clients
5
Total accounts accessed
30
Total accesses to accounts
240
Type of accesses
Deposits
Server work
Default (I/O Intensive
Figure 6-10 shows the results. 
Figure 6-10: Multithreaded vs. multiprocess server performance with increasing contention
The results show that the synchronization mechanisms used by the multithreaded server are more efficient than those used by the multiprocess server. 
Where the multithreaded server uses mutex locks to control access to shared data, the multiprocess server uses System V semaphores. When there is little contention among threads for account data, the multithreaded server operates more efficiently because the Pthreads mutex-locking calls operate within user space. On the other hand, the multiprocess server's semaphore-locking calls are system calls and involve the operating system's kernel. As client contention for the hot-spot account increases, the multiprocess server starts catching up to the multithreaded server. It no longer matters that the Pthreads synchronization primitives are lighter in weight than the multiprocess ones. Because worker threads and child processes alike are blocked waiting for account access, neither server is able to provide any concurrency. 
One last difference between multithreaded and multiprocess servers that would be worth examining is the ways in which they share data. Whereas threads exchange data by simply placing it in global variables in their process's address space, processes must use pipes or special shared memory segments controlled by the operating system. Because we did not design the threads in our ATM server to share data, we have no good way of testing the performance of the servers' data communication mechanisms. 

Previous SectionNext Section
Books24x7.com, Inc © 2000 –  Feedback