Sorting a million 32-bit integers in 2MB of RAM using Python

October 22, 2008 — originally posted on Neopythonic

Someone jokingly asked me how I would sort a million 32-bit integers in 2 megabytes of RAM, using Python. Taking up the challenge, I leared something about buffered I/O.

Obviously this is a joke question -- the data alone would take up 4 megabytes, assuming binary encoding! But there's a possible interpretation: given a file containing a million 32-bit integers, how would you sort them with minimal memory usage? This would have to be some kind of merge sort, where small chunks of the data are sorted in memory and written to a temporary file, and then the temporary files are merged into the eventual output area.

Here is my solution. I'll annotate it in a minute.

NOTE: All my examples use Python 3.0. The main difference in this case is the use of file.buffer to access the binary stream underlying the text stream file.

#!/usr/bin/env python3.0
import sys, array, tempfile, heapq
assert array.array('i').itemsize == 4

def intsfromfile(f):
  while True:
     a = array.array('i')
     a.fromstring(f.read(4000))
     if not a:
         break
     for x in a:
         yield x

iters = []
while True:
  a = array.array('i')
  a.fromstring(sys.stdin.buffer.read(40000))
  if not a:
      break
  f = tempfile.TemporaryFile()
  array.array('i', sorted(a)).tofile(f)
  f.seek(0)
  iters.append(intsfromfile(f))

a = array.array('i')
for x in heapq.merge(*iters):
  a.append(x)
  if len(a) >= 1000:
      a.tofile(sys.stdout.buffer)
      del a[:]
if a:
  a.tofile(sys.stdout.buffer)

On my Google desktop (a 3 year old PC running a Googlified Linux, rating about 34000 Python 3.0 pystones) this took about 5.4 seconds to run, with an input file containing exactly 1,000,000 32-bit random integers. That's not so bad, given that a straightforward in-memory sort of the same input took about 2 seconds:

#!/usr/bin/env python3.0
import sys, array
a = array.array('i', sys.stdin.buffer.read())
a = list(a)
a.sort()
a = array.array('i', a)
a.tofile(sys.stdout.buffer)

Back to the merge-sort solution. The first three lines are obvious:

#!/usr/bin/env python3.0
import sys, array, tempfile, heapq
assert array.array('i').itemsize == 4

The first line says we're using Python 3.0. The second line imports the modules we're going to need. The third line here makes it break on those 64-bit systems where the 'i' typecode doesn't represent a 32-bit int; I am making no attempts to write this code portably.

Then we define a little helper that is a generator which reads integers from a file and yields them one at a time:

def intsfromfile(f):
  while True:
      a = array.array('i')
      a.fromstring(f.read(4000))
      if not a:
          break
      for x in a:
          yield x

This is where the performance tuning of the algorithm takes place: it reads up to 1000 integers at a time, and yields them one by one. I had originally written this without buffering -- it would just read 4 bytes from the file, convert them to an integer, and yield the result. But that ran about 4 times as slow! Note that we can't use a.fromfile(f, 1000) because the fromfile() method complains bitterly when there aren't enough values in the file, and I want the code to adapt automatically to however many integers are on the file. (It turns out we write about 10,000 integers to a typical temp file.)

Next we have the input loop. This repeatedly reads a chunk of 10,000 integers from the input file, sorts them in memory, and writes them to a temporary file. We then add an iterator over that temporary file, using the above intsfromfile() function, to a list of iterators that we'll use in the subsequent merge phase.

iters = []
while True:
  a = array.array('i')
  a.fromstring(sys.stdin.buffer.read(40000))
  if not a:
      break
  f = tempfile.TemporaryFile()
  array.array('i', sorted(a)).tofile(f)
  f.seek(0)
  iters.append(intsfromfile(f))

Note that for an input containing a million values, this creates 100 temporary files each containing 10,000 values.

Finally we merge all these files (each of which is sorted) together. The heapq module has a really nice function for this purpose: heapq.merge(iter1, iter2, ...) returns an iterator that yields the input values in order, assuming each input itself yields its values in order (as is the case here).

a = array.array('i')
for x in heapq.merge(*iters):
  a.append(x)
  if len(a) >= 1000:
      a.tofile(sys.stdout.buffer)
      del a[:]
if a:
  a.tofile(sys.stdout.buffer)

This is another place where buffered I/O turned out to be essential: Writing each individual value to a file as soon as it is available slows the algorithm down about twofold. Thus, by simply adding input and output buffering, we gained a tenfold speed-up!

And that's the main moral of the story.

Another lesson is praise for the heapq module, which contains the iterator merge functionality needed in the output phase. Also, let's not forget the utility of the array module for managing binary data.

And finally, let this example remind you that Python 3.0 is notso different from Python 2.5!

Comments (25)

Guido van Rossum — October 23, 2008

@olivier: that's a cool way of doing it, but not for Python, where an int takes 12 bytes and the list takes another 4 bytes per value. Given some other overhead you'd end up having to make a lot of passes over the full input (the same number as would be feasible with my version -- I used 100 but it could likely be less), which would probably be slower. It's interesting that a heap features in both algorithms. I'm not convinced that it's the fastest way to sort though, despite the O(N log N) complexity. Python's built-in sort is wicked.

union — October 23, 2008

For nice code listings you can use google's own client side code highlighting lib:

google-code-prettify

My post on how to set it up:

Code highlighting on blogger

Guido van Rossum — October 23, 2008

@union: I followed your recipe, and I verified that all the elements are indeed in the page, but it doesn't work. What am I missing?

Boris Bluntschli — October 23, 2008

I think if you change the first two lines to

<link href="http://google-code-prettify.googlecode.com/svn/trunk/src/prettify.css" rel="stylesheet" type="text/css"/>
<script src="http://google-code-prettify.googlecode.com/svn/trunk/src/prettify.js" type="text/javascript">

it should work. You're currently linking to the SVN repository browser, not the files themselves.

union — October 23, 2008

Yes Boris is correct, I included wrong url's ( but somehow managed to put the right ones in my template ).

I have corrected my original post or you can just use the tag's Boris posted in his comment.

LetsEatLunch — October 23, 2008

Beastly...

Luis — October 24, 2008

Thanks Guido!

By the way, when are you replying the other questions? :-)

Joel Hockey — October 26, 2008

I had a go at implementing this using a different approach which runs much faster on my computer. I read the file multiple times and select values within a given range that will (hopefully) fit in the 1Mb memory limit. This is then sorted and appended to the output. I choose the ranges by partitioning the possible 32-bit values (-2^31, 2^31-1) into 8 sections.

The timing numbers I get are:
in-memory sort: 2.09s
guido merge sort: 10.90s
my select range sort: 2.92s

#!/usr/bin/env python3.0
import array

outf = open('joel-sorted.dat', 'wb')

INT_MIN, INT_MAX = -(2**31), 2**31-1
size = int(2**32 / 8)
cutoffs = range(INT_MIN, INT_MAX, size)

# read file through and only collect values in our range
for i in range(len(cutoffs)):
inf = open('nums.dat', 'rb')
  min, max = cutoffs[i], cutoffs[i]+size
  a = array.array('i')
  while True:
    temp = array.array('i', inf.read(500000))
    if not temp:
      break
    for j in temp:
      if j >= min and j < max:
        a.append(j)

  # sort and append to output
  array.array('i', sorted(a)).tofile(outf)

Ralph Corderoy — October 27, 2008

Hi Joel Hockey, splitting 2**32 into 8 gives a range of 2**29. All 1,000,000 integers, roughly 2**20, could easily fit into the first slice so you'd be sorting the whole file in one go; thereby exceeding the memory limitation. You'd need to split 2**32 into many more chunks than 8 to avoid exceeding memory, and that will probably slow you down a lot. Cheers, Ralph.

Unknown — October 27, 2008

How about using Numpy to do read them in with frombuffer()? Something like in the multiprocess test here.

Joel Hockey — October 27, 2008

Hi Ralph. Yes it is possible for this approach to exceed the 2M memory limit if the numbers are skewed. I should have mentioned that it makes the assumption that the numbers are uniformly random.

If the 1M 32-bit ints are sufficiently uniform though, then this approach uses about 1Mb of memory. The range of possible values for a 32-bit signed int is [-2147483648, 2147483648) (lower number is included, upper number is excluded). I break this range up into 8 smaller ranges - [-2147483648, -1610612736), [-1610612736, -1073741824), ... [1610612736, 2147483648). Out of 1 million uniformly random 32-bit numbers, you would expect 125,000 to fall into each range. In fact, as long as there are not more than 250,000 numbers in each sub-range, the approach will not exceed the memory limit.

I chose 8 sub-ranges to be conservative. You could choose 3 or 4, or even possibly 2, but then you would need to sort the numbers in-place and be a lot trickier which is likely to make the code slower. The performance for 8 was not much worse than 3, and not much different to a full in-memory sort. My calculation for the memory used is:
* 500,000 bytes holding input numbers which can be reclaimed before sub-range sorting.
* 500,000 (approx) bytes holding numbers in given sub-range (125,000 32-bit ints)
* 500,000 (approx) bytes to hold sorted numbers in sub-range.

Total mem: approx 1Mb

Like I said, this does assume that the numbers are uniformly random. I did consider doing a first pass of the numbers to work out the optimal bounds for the sub-ranges, but this crude approach was enough to satisfy my curiosity.

Hopefully Jon Bentley will show up soon and wow us all with some way to do it that is way faster and uses less memory.

Ted Hosmann — October 29, 2008

GvR - this is the post that put your new blog on my radar. Subscribe to feed, Check.

I am looking forward to reading many more posts - keep it up.

Also, thanks for Python.

Jurjen — November 03, 2008

This comment has been removed by the author.

Jurjen — November 03, 2008

Hmm. Still I am not convinced that the disk buffers for these 100 files don't actually just contain all the numbers, so that you seem to be cheating a bit: it is a "sort in disk buffer space".
I would be really hard, I think, to actually force the program to use little memory.
Still, heapq.merge is cool!

Stoune — November 18, 2008

This is task more for Donald Knuth.
I think Python influence not so big.

Travis Oliphant — December 04, 2008

For what it's worth, NumPy and memory-mapped arrays can solve this problem quite simply and quickly (how much memory is used depends on the OS):

import numpy as np

np.memmap("file_with_integers", dtype=int).sort(kind='merge')

Pranav Prakash — January 09, 2009

For a beginner like me, this code and post was surely educative enough. I have one more point to, "why Python"

Anonymous — February 03, 2009

private void SortFile(string unsortedFileName, string sortedFileName)
{
//Read the file into an array of ints
byte[] file = System.IO.File.ReadAllBytes(unsortedFileName);
int[] numbers = new int[file.Length / 4];
for (int i = 0; i < numbers.Length; i++)
{
numbers[i] = BitConverter.ToInt32(file, i);
}

//Sort the array
Array.Sort(numbers);

//Write the results back to file
byte[] tempArray = new byte[4];
for(int j = 0; j < numbers.Length; j++)
{
tempArray = BitConverter.GetBytes(numbers[j]);
file[j] = tempArray[0];
file[j+1] = tempArray[1];
file[j+2] = tempArray[2];
file[j+3] = tempArray[3];
}

System.IO.File.WriteAllBytes(sortedFileName,file);

}

Wallacoloo — February 23, 2010

@Linn: By reading all 1000000 4byte integers in at once, you're using atleast 4000000 bytes of memory. But you're only given 2MB. Plus, it looks like you're using another 4000000 bytes in your "file" array.

FutureNerd — August 03, 2010

The question is not a joke! But since it seems like the kind of thing Microsoft would use in a job- interview puzzle (pointless obsessive byte-squeezing optimization!), I'll be a little cryptic:

With a main working array that's only 2,000,000 bytes (Python or any language will have some fixed overhead, I think it's fair not to count that), you can write the program so that the only I/O it does is to read the unsorted input file in one pass, then write the sorted output file in one pass. No temporary files, seeks, additional buffers, or virtual arrays are needed.

2e6 bytes is a small percentage more than is needed. Given a fixed percentage slack, the run time scales like n log n.

In Python you can avoid the memory overhead that Guido mentions, by using the array module.

Anand — July 11, 2011

http://anandtechblog.blogspot.com/2010/12/how-to-sort-4gb-files-google-question.html

Shriram Kunchanapalli — April 26, 2012

@Anand: What's the run time complexity of your solution?

Anonymous — October 25, 2014

I use python for some specific purposes in my research (infectious disease modeling). So I haven't worked with problems like this before. I'm trying to learn more about python, so here I am.

I seem to be misunderstanding the code - to me it doesn't look like it solves the problem. Can someone explain to me what my mistake is?

It looks to me like the file is read in piece by piece. Each piece is sorted, and then put into a temporary file. The temporary files are eventually merged together into a longer file. So it seems that the final file is going to be a bunch of ordered sections that are not themselves in order.

I feel like it's a problem to do with me not understanding the heapq.merge command completely.

Guido van Rossum — October 25, 2014

Joel: the magic is all in heapq.merge(). It does the "merge" phase of a merge-sort. Think of what happens at the very beginning. There are 1000 files each with 1000 sorted numbers. We want the smallest number of all, which must be the first number in one of the files (because they are sorted). It finds and yields this value, and advances that file's iterator. Now we want the next smallest number. Again, this must be at the head of one of the file iterators (but not necessarily the same one). And so on, until all files have been exhausted.

Anonymous — October 26, 2014

Thanks for the quick reply.

I felt like there had to be something in that step I didn't understand.