Price / Performance Sort

Jim Gray, Microsoft Research, Chris Nyberg, Ordinal Technologies

Gray@Microsoft.com, Chris@Ordinal.com

April 1998

Abstract: NTsort is an external sort on WindowsNT 5.0. It has minimal functionality but excellent price performance. In particular, running on mail-order hardware it can sort 1.5 GB for a penny. This sets a new price-performance record. This paper documents this and proposes that the PennySort benchmark be revised to Price/Performance sort: a simple GB/$ sort metric based on a two-pass sort.

Why does anyone care about sorting and sort performance? The prosaic reason is that sorting is a common task -- it is frequently used in database systems, data analysis, and data mining. Another important reason is that sorting is a simple balanced workload, involving memory access, IO, and cpu. It evaluates a computer system's overall performance. Being simple, sorting is easily ported from one system to another, easily scaled to large SMP systems, and to computer clusters.

The first public sort benchmark was defined in A Measure of Transaction Processing Performance, Datamation, April 1, 1995. That article defined DatamationSort to measure how fast can you sort a million records. The records are 100 bytes, with 10-byte keys in random order. The sort is external (disk-to-disk.) The time includes starting the program, creating the target file, and doing the sort. Prices are list prices depreciated over 3 years. (see http://research.microsoft.com/barc/SortBenchmark/ for the rules).

Since then there has been steady improvement in sort performance: sort speeds have approximately doubled each year, and price performance has approximately doubled each year -- improving a thousand-fold every decade. In part, this has been due to Moore's law, things get faster every year: but that is only 40% of the story. The other 60% came from better algorithms and from parallelism. The current champion, NOWsort, used 100 UltrasSPARCs to sort 9GB in a minute.

The DatamationSort times were getting tiny (a few seconds) and so it seemed better to have a fixed-time benchmark rather than a fixed-size sort. MinuteSort, how much can you sort in a minute, replaced DatamationSort in 1994. MinuteSort is a "biggest bang" (price is no object) test. PennySort is a "bang-for-the-buck" measure, how many 100-byte records you can sort for a penny. Inexpensive systems are allowed to run for a long time while expensive systems must run the sort quickly.

Several MinuteSort benchmark results have been reported, but until now, there have been no PennySort results. In fact, the PennySort Benchmark was originally the DollarSort benchmark. It was changed to PennySort when someone pointed out that a DollarSort would run for 50,000 seconds on a 2,000$ system. PennySort brought this time down to 500 seconds. However, that was the wrong answer. If innovation continues and Moore's Law continues to operate, in a less than a decade, PennySort will be back up to the 50,000 seconds.

Consequently, this paper both reports the first PennySort result and proposes to redefine the benchmark into a Price/Performance sort: the $/GB cost of a two-pass sort.

 

What We Did: Daytona PennySort

We have three commercial sorts running on Windows NT 4.0.3 at our lab:

  1. NT Sort included in NT 5.0,
  2. NitroSort v 1.0A from Opnek Research, Hackettstown, NJ (299$ from http://www.nitrosort.com/nitrsort.html) and
  3. Postman's Sort v 3.21. . from Robert Ramey Software Development, Santa Barbara, CA ( 149$ from http://www.rrsd.com/)

Other commercial sorts include Syncsort (http://www.syncsort.com/infosnt.htm) and CoSort (http://www.iri.com/98new/win32.html). These sorts are as expensive as the PennySort machine, and so are more appropriate for MinuteSort benchmarks.

All three sorts are single-threaded, so they run best on uni-processors. We originally estimated that a 2k$ system should be able to sort 700MB for a penny. The system would have the fastest CPU and memory that 1.4k$ can buy, a 2 GB data source and target disk, and a 1 GB scratch disk. It would have 64 MB of DRAM (to produce 30 30MB data runs on the scratch disk using 256KB writes) and then merge them into the target (using 256KB reads and writes). We budgeted 500$ for NT Workstation and for the sort software.

We were aware of the substantial price differential between SCSI and IDE disk drives. Traditionally, the SCSI price was justified because IDE did PIO rather than DMA. Programmed Input-Output (PIO) moves each byte through the CPU registers rather than having the host-bus adapter stream the data via Direct Memory Access (DMA). Recently, IDE drives have adopted UltraDMA IDE -- the IDE adapter does not interrupt the processor during a disk transfer. The result is that the cpu overhead for an IO goes from 6 instructions per byte transferred under PIO to 0.2 instructions per byte transferred under DMA. This is a dramatic saving, without DMA the system has 80% processor utilization when transferring 8MBps. With DMA, the utilization drops to 2%. Therefore, UltraDMA IDE drives compete with SCSI for small disk arrays. SCSI is still advantageous for strings of disks: IDE does not allow multiple outstanding commands on one string.

We shopped on the web and found a mail-order house that would sell us a inexpensive system described in Table 1. The actual system we got was more expensive because it included a CDROM, sound, and overnight-freight. Those extras are not essential to doing the sort and so are subtracted from the price. These are OEM prices. The OEM is not allowed to sell the items individually at this price (for example the best individual price of NT Workstation is 140$).

Table1: Price breakdown of PennySort Machine

quantity

part

price

1

Assembly

25

1

Barebone (P2L97+P2-266+FAN) - 1 Year Warranty

495

1

64MB SDRAM (10ns)

94

1

Mini Tower Case w/235W Power Supply & fan

47

1

floppy drive

18

2

Fujitsu MPS3032UA 3.1 GB U/ATA 10ms 128k 5400rpm

278

1

INTEL 8465 Etherexpress PCI Pro/100

48

1

Virage 3D w/2MB EDO

33

1

NT Workstation 4.0.3

69

1

Shipping (3 day)

35

Total:

1142

Given this price, the time allowed for a PennySort, computed by the table at right, is 828 seconds.

 

We first ran the UtraIDE disks without installing the DMA software (the PiiXide driver, called that because it runs on the Pentium II). The non-DMA system ran 80% CPU saturated when reading or writing the disks at 8 MBps. Whit the PiiXide driver installed, the CPU utilization dropped 40x to 2%. The disk transfer rate is limited by the rate at which bytes move past the disk read/write heads. Outer bands transfer at almost 9 MBps, inner bands transfer at 6 MBps. The average rate across the surfaces is 7.8MBps.

Both disks had a small (400MB) FAT partition to store NT programs and data. They each have a large (3GB) NTFS partition to store the input and output data, and the sort temporary file. The data flow is as shown in Figure 2. The data moves to or from disk 4 times in a 2-pass sort. During both phases, both the input and output can be overlapped. Therefore, in theory, the sort can fully utilize the bandwidth of both disks.

Figure 2: Data flow of a two-pass sort: (1) input is sorted into runs stored on the temporary file. Then (2) runs are merged to form the sorted output file

An August 1, 1997, PC Week article by John Shumate, reviewed several sort programs: 4 Programs Make NT 'Sort' of Fast reviewed these programs and sparked our interest. John pointed out that NTsort produced no output and no error if the file was larger than memory. He also pointed out that NTsort had other shortcomings: no GUI, no API, and extremely limited function. Consequently, he recommended the other sort programs, notably CoSort, Optec, Nitro, and Postman.

We repaired NTsort to be a simple two-pass sort: if the input fits in available memory, NTsort does one pass. If not, NTsort allocates memory appropriate for a two-pass sort: essentially the square-root of the input file size times the transfer size -- 20 MB in the case of a 1 GB sort. Sort then reads a block of data, does a key-prefix QuickSort of the data, and writes out the sorted run to a temporary file. When all of the input has been converted to temporary run files, NTsort merges the runs into the output using a key tournament tree. NT sort does not overlap or pipeline the phase-one IO. If the input or output is a file, NTsort uses unbuffered IO.

The complete documentation for NTsort is given in the text box above. Minimal is one word that comes to mind. Anyone wanting a full function sort should look elsewhere. Both PostmanSort and NitroSort are inexpensive, well documented, and have a full-function API and command interface. NitroSort has a very nice GUI as well. Nevertheless, NTsort is "free" and it can run the PennySort benchmark to show of NT's IO performance.


We generated a 15 M record (1.5GB) file using the SortGen program. We sorted the result using NT Sort (with "C" locale), Postman Sort and NitroSort. The results are shown in the table below. NitroSort limits itself to 4 MB by default. For a 1.5GB sort, that implies a 3-pass sort and much longer run times. So, we gave NitroSort a hint: telling it to use 21 MB of memory. It then ran in reasonable time. Both PostmanSort and NitroSort use the NT File system buffered IO (rather than direct IO). That explains why they have much larger kernel times. On the other hand they use much less user time, so they use 2x to 4x less cpu time than NTsort, showing that they have much better algorithms. Since sort is IO bound, this higher overhead is not reflected in the elapsed time.

Product

Elapsed time

Kernel time

User time

Total
cpu time

Cost

pennies

NTsort

820

7

398

405

0.99

Postman

821

68

169

237

1.12

Nitro (no hint)

2642

89

107

196

4.03

Nitro hint

1291

78

117

195

1.97

If NTsort properly overlapped IO and computation, it should run in about 450 seconds on this hardware -- almost twice as fast.

The point of this exercise was to show how inexpensively commodity hardware can perform IO intensive tasks. The next obvious step is to port NTsort, or one of the other sorts to a cluster and replicate the NOWsort work (they sorted 9GB on 100 nodes in 60 seconds). A cluster of 100 PennySort machines should cost about 10 times less (each node would only process 250 MB of data) and run about three times faster.

 

 

 

Price/Performance Sort

PennySort seems like a poor benchmark definition. The idea of a fixed-price benchmark is not workable when price-performance doubles every year. You can do 1,000 times more a decade later for the same price. In particular, the PennySort of the year 2008 is likely to sort 1.50 TB and run for 800,000 seconds (over two days) on a one-dollar computer!

The goal is to have a simple and portable I/O benchmark that measures the system's price/performance. The simple way out of this seems to be to use the two-pass minute-sort benchmark, but just aim for minimum price measured in $/GB sorted. To be specific:

  1. Sort the largest file you can in a minute.
  2. Compute the system price per minute in the standard way (3-year depreciation).
  3. Compute the $/GB sorted by dividing item 1 by item 2.

The virtue of picking a minute is that it gives direct comparability with MinuteSort performance and price performance.

The current Price Performance results are:

year

MB/sec

GB/$

Time(sec

System, reference

price M$

cpus

1985

0.02

0.053

6000

M6800 Bitton et al [7,8]

0.03

1

1986

0.03

0.009

3600

Tandem Tsukerman [19,20]

0.3

3

1987

3.85

0.052

26

Cray YMP, Weinberger [21]

7

1

1991

14.29

0.541

7

IBM 3090, DFsort/Saber

2.5

1

1990

0.31

0.148

320

Kitsuregawa [12]

0.2

1

1993

1.20

0.114

83

Sequent, Graefe [11]

1

32

1994

1.72

0.163

58

IPSC/Wisc DeWitt [10]

1

32

1994

11.11

5.256

9

Alpha, Nyberg [7]

0.2

1

1995

28.57

2.703

3.5

SGI/Ordinal, Nyberg [16]

1

16

1995

19.61

37.101

5.1

IBM, Agarwal [2]

0.05

1

1996

100.00

15.768

60

NOW, Arpaci-Dusseau [3]

0.6

32

1997

155.00

7.332

60

Now 100 , Arpaci-Dusseau [3]

2

100

1997

86.21

6.274

58

SGI/Ordinal, Nyberg [17]

1.3

14

1998

1.83

144.220

820

NTsort

0.0012

1

 

 

References

[1] Anon-Et-Al. (1985). "A Measure of Transaction Processing Power." Datamation. V.31(7): PP. 112-118. also in Readings in Database Systems, M.J. Stonebraker ed., Morgan Kaufmann, San Mateo, 1989.

[2] R. C. Agarwal. "A Super Scalar Sort Algorithm for RISC Processors." ACM SIGMOD '96, pp. 240--246, June 1996.

[3] A.C Arpaci-Dusseau, R. H. Arpaci-Dusseau, D.E. Culler, J. M. Hellerstein, D. A. "Patterson. High-Performance Sorting on Networks of Workstations." ACM SIGMOD '97, Tucson, Arizona, May, 1997.

[4] Baer, J.L., Lin, Y.B., "Improving Quicksort Performance with Codeword Data Structure", IEEE Trans. on Software Engineering, 15(5). May 1989. pp. 622-631.

[5] Baugsto, B.A.W., Greipsland, J.F., "Parallel Sorting Methods for Large Data Volumes on a Hypercube Database Computer", Proc. 6th Int. Workshop on Database Machines, Deauville France, Springer Verlag Lecture Notes No. 368, June 1989, pp.: 126-141.

[6] Baugsto, B.A.W., Greipsland, J.F., Kamerbeek, J. "Sorting Large Data Files on POMA," Proc. CONPAR-90VAPP IV, Springer Verlag Lecture Notes No. 357, Sept. 1990, pp.: 536-547.

[7] Bitton, D., Design, Analysis and Implementation of Parallel External Sorting Algorithms, Ph.D. Thesis, U. Wisconsin, Madison, WI, 1981

[8] Beck, M., Bitton, D., Wilkenson, W.K., "Sorting Large Files on a Backend Multiprocessor", IEEE Transactions on Computers, V. 37(7), pp. 769-778, July 1988.

[9] Conner, W.M., Offset Value Coding, IBM Technical Disclosure Bulletin, V 20(7), Dec. 1977, pp. 2832-2837

[10] DeWitt, D.J., Naughton, J.F., Schneider, D.A. "Parallel Sorting on a Shared-Nothing Architecture Using Probabilistic Splitting", Proc. First Int Conf. on Parallel and Distributed Info Systems, IEEE Press, Jan 1992, pp. 280-291

[11] Graefe, G, S.S. Thakkar, "Tuning a Parallel Sort Algorithm on a Shared-Memory Multiprocessor", Software Practice and Experience, 22(7), July 1992, pp. 495.

[12] Kitsuregawa, M., Yang, W., Fushimi, S. "Evaluation of an 18-stage Pipeline Hardware Sorter", Proc. 6th Int. Workshop on Database Machines, Deauville France, Springer Verlag Lecture Notes No. 368, June 1989, pp. 142-155.

[13] Knuth, E.E., Sorting and Searching, The Art of Computer Programming, Addison Wesley, Reading, Ma., 1973.

[14] Lorie, R.A., and Young, H. C., "A Low Communications Sort Algorithm for a Parallel Database Machine," Proc. Fifteenth VLDB, Amsterdam, 1989, pp. 125-134.

[15] Lorin, H. Sorting, Addison Wesley, Englewood Cliffs, NJ, 1974.

[16] C. Nyberg, T. Barclay, Z. Cvetanovic, J. Gray, and D. Lomet. "AlphaSort: A RISC Machine Sort." ACM SIGMOD '94, May 1994.

[17] Nyberg, C. J., " Nsort: a Parallel Sorting Program for NUMA and SMP Machines", Nov.1997

[18] Shumate, J., "4 Programs Make NT Sort of Fast," PC Week, August 1, 1997.

[19] Salzberg, B., et al., "FastSort– An External Sort Using Parallel Processing", Proc. SIGMOD 1990, pp. 88-101.

[20] Tsukerman, A., "FastSort– An External Sort Using Parallel Processing" Tandem Systems Review, V 3(4), Dec. 1986, pp. 57-72.

[21] Weinberger, P.J., Private communication 1986.

[22] Yamane, Y., Take, R. "Parallel Partition Sort for Database Machines", Database Machines and Knowledge Based Machines, Kitsuregawa and Tanaka eds., pp.: 1117-130. Klewar Academic Publishers, 1988.