Simple Queuing Theory

With the advent of the World Wide Web, everybody is thinking of their computers as servers. IBM recently re-branded their entire line to eServer (actually, the e- is supposed to be their e-business logo, but nobody seems to care about that little detail, except IBM’s legal department). So we are all supposed to write service programs that feed the net. The classical problems of response time versus system capacity and utilization are as important as ever, if not more so.

Let me start with a true war-story. About twenty years ago I worked as a consultant to Lockheed who was bidding for producing a computerized Directory Assistance System for (what was then) New York Telephone. Lockheed won the bid because the hardware was the cheapest of the competing bids. The system was configured as a distributed system with many machines serving requests (about 2 million a day). Each request took on average one-third of a second to process on a given CPU, so Lockheed figured that a CPU could serve 3 requests per second and could then calculate the number of CPUs to install by dividing the peak number of requests per second by 3. They did that and won the bid. Only later did they realize that they actually needed almost twice that amount because of the fact that if you have a load of 3 transactions per second, you need a capacity of 5.8 transactions per second to handle that load without excessive queues building up. Some simple queuing theory could have told them that. This is where I came in: my job was to speed up the processing by a factor of two, so they didn’t have to install twice as many CPUs. One way of dealing with a massive screw-up!

This counterintuitive result still hits people today. So it pays to examine why a simple-minded approach to queuing just doesn’t work. That is the purpose of this article

We consider a system receiving requests, servicing them, and consequently producing responses. The treatment is based on the assumption that the events causing requests to the system occur at random. The probability of any particular event happening is generally low and uniform, i.e. the probability of it happening in a given unit of time is the same as the probability of it happening in any other unit of time. This situation is typical of most (but not all) real-time systems. The number of arrivals of events during a given time interval can then be described by the Poisson distribution:

P(n) = (exp(-<n>) <n>n) / n!

Where P(n) denotes the probability of having n events per unit time if the average number of events per unit time is <n>. [n!=n(n-1)(n-2)…1] In the following we shall use angle brackets <x> to denote the average of a variable x. If we have N single values x1, x2, ., xN, the average value is defined as

<x> = (x1 + x2 + ... + xN) / N

Probability theory shows that if the probability of one particular event occurring in a given time interval is constant (uniform distribution), then the probability of having n such events occur in a given time interval follows the Poisson distribution. The probability that times between events are less than a certain value then follows an exponential distribution:

P(t<T) = 1 - exp(-T /<t>)

where <t> is the mean interarrival time.

For illustration, consider a system where transactions occur at the average rate of 2 per second. The probability of having n arrivals per second is thus P(n) = (exp(-2) * 2n) / n!, since <n> = 2 per second. Using this formula for n from 0 through 9 yields:

n

P(n)

0

0.1353

1

0.2707

2

0.2707

3

0.1804

4

0.0902

5

0.0361

6

0.0120

7

0.0034

8

0.0009

9

0.0002

It is possible, but very unlikely, to get more than 9 arrivals per second. On the other hand, there is a 14.3% chance that the rate will reach or exceed twice the average range. That is, rather large fluctuations can occur. The degree of these fluctuations decreases with increasing <n> as shown in the following table which shows the probability that the number of arrivals exceeds more than twice its average value for several different mean values <n>

<n>

P(n>=2<n>)

1

0.265

2

0.143

3

0.085

4

0.051

5

0.030

6

0.019

7

0.012

Most systems of the type we are considering can be regarded as a facility with a certain utilization factor that we shall generally denote by u. If the system can handle five transactions per unit time and it during a certain interval handled only three transactions per unit time, it was only 3/5 = 0.6 or 60% utilized. If the facility utilization is low, system response is swift and there will be little or no queuing. The greater the utilization is, the greater the delays eventually become and the longer the queues that will build up.

As we have just seen, the utilization factor can fluctuate markedly simply because the number of events per unit time (e.g. per second) fluctuates. If a system can handle four events per second and the transaction intensity is two events per second, the system will be overloaded 14.3% of the time. The overload situation will relax shortly though because of the inherent system capacity. Experience has shown (becoming a good engineering rule of thumb) that an overloading of less than 10% is generally safe and can be absorbed by the system without noticeable penalties. In other words:

The system must have such a capacity that the traffic does not exceed the capacity 90% of the time.

The quantity of interest here is the probability of having X or more arrivals in a time interval for which the mean arrival rate is <n>:

P(n>=X) = sum((exp(-<n>)<n>n)/n!) from n=X to infinity

A sound design goal is then to ensure that

P(n>=X) < 0.10

i.e. to configure the system to handle a number X of arrivals per second given the average arrival rate of <n> so that the above inequality holds. The following table gives X and the utilization u as functions of <n>:

<n>

X

u=<n>/X

1

2.8

0.36

2

4.4

0.45

3

5.8

0.52

4

7.2

0.56

5

8.5

0.59

6

9.8

0.61

7

11.0

0.64

8

12.2

0.66

9

13.5

0.67

10

14.7

0.68

20

26.3

0.76

100

113.0

0.88

1000

1041.0

0.96

For large <n>, X approaches the value

X à <n> + 1.287 sqrt(<n>)

Also: u à 1 as <n> à infinity. But for most values found in practice, X > <n> by a significant amount. This simple fact is often difficult to accept. As an example, if <n> = 10 transactions per second, we need a capacity of X=14.7 or 47% larger in order to cope smoothly with the load. In Lockheed’s case the numbers were <n> = 3, and thus X = 5.8, almost twice as high.

At first sight there seems to be a paradox in the above considerations. If <n> = 10 per second, we need to have a capacity of X=14.7 (47% more), but if my unit of time is changed to be a "dec", which is 10 seconds (just chosen arbitrarily), then the very same real traffic volume would be <n> = 100 (namely 10*10) and according to the table we need now X=113 or only 13% overcapacity rather than 47%.

It seems that simply by arbitrarily varying my unit of time over which I measure the traffic intensity, I require different amount of capacity to handle what is really the same number of events. How can that be? The solution lies in the realization that the design criterion we used was to configure the system so that delays would not become excessive. Now, delays are measured in the same units of time as <n> is calculated over, and increasing the length of the time interval over which we measure <n> amounts to increasing the delay we will tolerate as well. The implicit assumption behind the considerations about facility overloading is in fact that we do not tolerate delays that are much longer than the servicing time (which is 1/X). You do not want to wait in line for an hour for a transaction that only takes a minute; however, waiting a few minutes would be quite acceptable.

Let us now consider a system having a single server with a queue of items awaiting service. Let ts be the service time for one item and Sd be the standard deviation ("spread") of all the service times. Further, let w be the number of items waiting in the queue at a given time and let q be the total number of items in the system (either waiting or being served). Let tw be the time an item waits before being served and let tq be the total time the items spends in the system, then for each individual event we have:

tq = tw + ts

and thus also for the mean values:

<tq> = <tw> + <ts>.

On the average, for a steady state condition, we must have:

<w> = <n> <tw>

and <q> = <n> <tq>.

If we in one second serve <n> events (in a steady state the number of transactions served equals the number of arrivals on the average) and the mean service time is <ts>, the total time of that second used in servicing is <n> <ts>, which is precisely the amount of time the facility is being used, hence the utilization factor becomes just that:

u = <n> <ts>

Combining the above relations, we see that:

<q> = <n> <tw> + <n> <ts> = <w> + u.

The basic result of single-server queuing theory was developed by Khintchine and Pollaczek and can be expressed:

<w> = u2/(2(1-u)) (1 + 1/Rs)

where the parameter Rs is given by

Rs = (<ts>/Sd)2

When the service times are exponentially distributed, the standard deviation Sd is equal to the mean:

Sd = <ts>

and Rs = 1. If the service times were constant (Sd = 0), we get Rs = infinity. In practice, the assumption of exponentially distributed service times is often a good one and we shall use Rs = 1. Therefore, the mean number of items waiting is:

<w> = u2 / (1-u)

and then

<q> = <w> + u = u2 / (1-u) + u = u /(1-u)

We shall comment upon the relation <w> = <n> <tw>. In a steady state as many items leave the queue as enter it from the other end. That number is <n>. If the queue contains <w> items and <n> leave per second, it will take <w>/<n> seconds to empty the queue and the latest arrival would the spend so many seconds in the queue, hence the time an item waits before being served is:

<tw> = <w> / <n> from which we get <w> = <n> <tw>.

We can now write

<tw> = <w> / <n> = (<w> / u) <ts>, since u = <n> <ts>, and

thus

<tq> = <tw> + <ts> = (<w>/u + 1) <ts>.

Using <w> = u2/(1-u), we finally get:

<tq> = (1 + u/(1-u)) <ts> = (1 + <q>) <ts>

We have thus expressed the queuing time <tq> in terms of the service time <ts> and the utilization u.

The following table shows this relationship between the load factor u, the number of items in the system <q>, and the response times <tq> in terms of the mean service time <ts>:

u

<q>

<tq> in units of <ts>

0.2

0.25

1.25

0.4

0.67

1.67

0.6

1.50

2.50

0.7

2.33

3.33

0.8

4.00

5.00

0.85

5.67

6.67

0.90

9.00

10.00

0.95

19.00

20.00

It is clear that when the utilization increases to beyond 80%, the queues grow alarmingly and the system is hardly usable at all.

When we have a stream of transactions that arrive at random, the times between arrivals are exponentially distributed. But suppose that there is a two-way switch at the arrival point. The switch operates so as to send alternate transactions on alternate output "lines" for service. The result is a more uniform distribution of interarrival times. The distributions that result in this way have been studied by the Danish mathematician A.K. Erlang and are called Erlang distributions. (Erlang worked for the Copenhagen Telephone Company, which explains the terminology of traffic, switches, and lines). The exponential distribution is an Erlang-1 distribution, while the two-way switch generates Erlang-2 distributions. If we had switched the transactions down a very large number of paths - instead of just the two as above - the resulting interarrival times would become very nearly constant. An Erlang-infinity distribution of interarrival times means that these are constant.

The converse is also true: if we concentrate onto one line the traffic from many separate lines, the spread in interarrival times will increase leading to more queuing, as there will be bursts of events piling up, as well as intervals without traffic, where the server is idle and the facility is underused.

Switching down several output lines simply means having several servers handling transactions. Let us assume there are M identical servers. The mean number of items arriving per unit time is <n> as before. Thus <n>/M items go to each server. Again, let the mean service time be <ts>. The facility utilization therefore becomes:

u = <n> <ts> / M

As before: <q> = <n> <tw> + <n> <ts>

And so:

<q> = <w> + M u

It can be shown that the probability that ALL servers are busy at a given time is

B = P(q>=M) = (1 - Z(M,u)) / (1 - u Z(M,u))

Where

Z(M,u) = (sum ((M u)i/i!) from i = 0 to M-1) / (sum ((M u)i/i!) from i = 0 to M)

For M = 1 the expression for B reduces to B = u. The table below explores B as a function of u and M:

u

BM=1

BM=2

BM=4

BM=8

BM=16

BM=50

BM=100

0.1

0.100

0.018

0.001

0.000

0.000

0.000

0.000

0.2

0.200

0.067

0.010

0.000

0.000

0.000

0.000

0.3

0.300

0.138

0.037

0.004

0.000

0.000

0.000

0.4

0.400

0.229

0.091

0.018

0.001

0.000

0.000

0.5

0.500

0.333

0.174

0.059

0.009

0.000

0.000

0.6

0.600

0.450

0.287

0.140

0.042

0.001

0.000

0.7

0.700

0.576

0.429

0.271

0.130

0.011

0.000

0.8

0.800

0.711

0.596

0.458

0.305

0.087

0.020

0.9

0.900

0.853

0.788

0.702

0.591

0.364

0.217

0.94

0.940

0.911

0.870

0.815

0.740

0.568

0.434

The mean number of items in the queue can be shown to be

<w> = B u /(1-u)

hence

<q> = <w> + M u = (B/(1-u) + M) u

and the response time becomes:

<tq> = <q> / <n> = (<q> <ts>) / (<n> <ts>) = <q> <ts> / (M u)

so that, finally:

<tq> = (1+ B/M/(1-u)) <ts>

For M = 2 we have (exactly):

<tq> = <ts> / (1-u2) and B = 2u2 / (1+u)

The following simple formula is a good approximation for all M (exact for M <= 2):

<tq> = <ts> / (1-uM)

We shall contrast the response times (in terms of the mean service time <ts>) for various values of M and note the significant improvements when M grows larger:

u

<tq>M=1

<tq>M=2

<tq>M=4

<tq>M=16

0.2

1.250

1.042

1.003

1.000

0.4

1.667

1.190

1.038

1.000

0.6

2.500

1.562

1.179

1.007

0.7

3.333

1.961

1.357

1.027

0.76

4.167

2.367

1.548

1.058

0.80

5.000

2.778

1.746

1.095

0.84

6.250

3.397

2.047

1.158

0.88

8.333

4.433

2.558

1.273

0.90

10.000

5.263

2.969

1.370

0.92

12.500

6.510

3.589

1.518

0.94

16.667

8.591

4.626

1.771

0.96

25.000

12.755

6.705

2.284

0.98

50.000

25.253

12.950

3.389

It is important not to confuse the true multi-server model with one where arrivals are just distributed cyclically to each server in turn. This sort of model is really just M single-server queues in parallel and does not give the improvement in response time to the same degree as the true multi-server situation with only one queue.

Since it is important then to have only one queue, we may ask: how long could that queue become? The average number of items in the system is not the measure we want, rather it is <w>, the number of items in the queue. Using <w> = B u/(1-u) we get:

u

<w>M=1

<w>M=2

<w>M=4

<w>M=16

0.6

0.9

0.7

0.4

0.1

0.7

1.6

1.3

1.0

0.3

0.8

3.2

2.8

2.4

1.2

0.9

8.1

7.7

7.1

5.3

0.94

14.7

14.3

13.6

11.6

For heavy load, <w> is not very sensitive to the number of servers. It is maybe at first sight a bit surprising that the queue length does not depend on the arrival rate <n>, but only on the facility utilization u. How should the queue length be configured? Characteristic for this type of queue is its wide fluctuations: the standard deviation of the queue length is large. Queuing theory yields the result that the standard deviation of the queue length is given by:

Sd(w) = sqrt(B u (1 + u - B u))/(1-u)

And the queue length itself is:

w = B u/(1-u)

Under heavy load B and u are both close to one, so:

w = 1 /(1-u)

and

Sd(w) = 1 /(1-u) = w

I.e. the standard deviation is of the same size as the length itself. The probability that a value is more than three standard deviations removed from the mean is very low (0.001) so that the maximum queue length could be set to

wmax = <w> + 3 Sd(w) = 4<w>

You could add an extra 1 for added margin and use wmax = 5<w> with confidence.