October 22, 2008 — originally posted on Neopythonic
Someone jokingly asked me how I would sort a million 32-bit integers in 2 megabytes of RAM, using Python. Taking up the challenge, I leared something about buffered I/O.
Obviously this is a joke question -- the data alone would take up 4 megabytes, assuming binary encoding! But there's a possible interpretation: given a file containing a million 32-bit integers, how would you sort them with minimal memory usage? This would have to be some kind of merge sort, where small chunks of the data are sorted in memory and written to a temporary file, and then the temporary files are merged into the eventual output area.
Here is my solution. I'll annotate it in a minute.
NOTE: All my examples use Python 3.0. The main difference in this case is the use of file.buffer to access the binary stream underlying the text stream file.
#!/usr/bin/env python3.0On my Google desktop (a 3 year old PC running a Googlified Linux, rating about 34000 Python 3.0 pystones) this took about 5.4 seconds to run, with an input file containing exactly 1,000,000 32-bit random integers. That's not so bad, given that a straightforward in-memory sort of the same input took about 2 seconds:
import sys, array, tempfile, heapq
assert array.array('i').itemsize == 4
def intsfromfile(f):
while True:
a = array.array('i')
a.fromstring(f.read(4000))
if not a:
break
for x in a:
yield x
iters = []
while True:
a = array.array('i')
a.fromstring(sys.stdin.buffer.read(40000))
if not a:
break
f = tempfile.TemporaryFile()
array.array('i', sorted(a)).tofile(f)
f.seek(0)
iters.append(intsfromfile(f))
a = array.array('i')
for x in heapq.merge(*iters):
a.append(x)
if len(a) >= 1000:
a.tofile(sys.stdout.buffer)
del a[:]
if a:
a.tofile(sys.stdout.buffer)
#!/usr/bin/env python3.0Back to the merge-sort solution. The first three lines are obvious:
import sys, array
a = array.array('i', sys.stdin.buffer.read())
a = list(a)
a.sort()
a = array.array('i', a)
a.tofile(sys.stdout.buffer)
#!/usr/bin/env python3.0The first line says we're using Python 3.0. The second line imports the modules we're going to need. The third line here makes it break on those 64-bit systems where the 'i' typecode doesn't represent a 32-bit int; I am making no attempts to write this code portably.
import sys, array, tempfile, heapq
assert array.array('i').itemsize == 4
def intsfromfile(f):This is where the performance tuning of the algorithm takes place: it reads up to 1000 integers at a time, and yields them one by one. I had originally written this without buffering -- it would just read 4 bytes from the file, convert them to an integer, and yield the result. But that ran about 4 times as slow! Note that we can't use
while True:
a = array.array('i')
a.fromstring(f.read(4000))
if not a:
break
for x in a:
yield x
a.fromfile(f, 1000) because the fromfile() method complains bitterly when there aren't enough values in the file, and I want the code to adapt automatically to however many integers are on the file. (It turns out we write about 10,000 integers to a typical temp file.)Next we have the input loop. This repeatedly reads a chunk of 10,000 integers from the input file, sorts them in memory, and writes them to a temporary file. We then add an iterator over that temporary file, using the above intsfromfile() function, to a list of iterators that we'll use in the subsequent merge phase.
iters = []Note that for an input containing a million values, this creates 100 temporary files each containing 10,000 values.
while True:
a = array.array('i')
a.fromstring(sys.stdin.buffer.read(40000))
if not a:
break
f = tempfile.TemporaryFile()
array.array('i', sorted(a)).tofile(f)
f.seek(0)
iters.append(intsfromfile(f))
Finally we merge all these files (each of which is sorted) together. The heapq module has a really nice function for this purpose: heapq.merge(iter1, iter2, ...) returns an iterator that yields the input values in order, assuming each input itself yields its values in order (as is the case here).
a = array.array('i')
for x in heapq.merge(*iters):
a.append(x)
if len(a) >= 1000:
a.tofile(sys.stdout.buffer)
del a[:]
if a:
a.tofile(sys.stdout.buffer)
This is another place where buffered I/O turned out to be essential: Writing each individual value to a file as soon as it is available slows the algorithm down about twofold. Thus, by simply adding input and output buffering, we gained a tenfold speed-up!Another lesson is praise for the heapq module, which contains the iterator merge functionality needed in the output phase. Also, let's not forget the utility of the array module for managing binary data.
And finally, let this example remind you that Python 3.0 is notso different from Python 2.5!
Guido van Rossum — October 23, 2008
@olivier: that's a cool way of doing it, but not for Python, where an int takes 12 bytes and the list takes another 4 bytes per value. Given some other overhead you'd end up having to make a lot of passes over the full input (the same number as would be feasible with my version -- I used 100 but it could likely be less), which would probably be slower. It's interesting that a heap features in both algorithms. I'm not convinced that it's the fastest way to sort though, despite the O(N log N) complexity. Python's built-in sort is wicked.union — October 23, 2008
For nice code listings you can use google's own client side code highlighting lib:Guido van Rossum — October 23, 2008
@union: I followed your recipe, and I verified that all the elements are indeed in the page, but it doesn't work. What am I missing?Boris Bluntschli — October 23, 2008
I think if you change the first two lines tounion — October 23, 2008
Yes Boris is correct, I included wrong url's ( but somehow managed to put the right ones in my template ).LetsEatLunch — October 23, 2008
Beastly...Luis — October 24, 2008
Thanks Guido!Joel Hockey — October 26, 2008
I had a go at implementing this using a different approach which runs much faster on my computer. I read the file multiple times and select values within a given range that will (hopefully) fit in the 1Mb memory limit. This is then sorted and appended to the output. I choose the ranges by partitioning the possible 32-bit values (-2^31, 2^31-1) into 8 sections.Ralph Corderoy — October 27, 2008
Hi Joel Hockey, splitting 2**32 into 8 gives a range of 2**29. All 1,000,000 integers, roughly 2**20, could easily fit into the first slice so you'd be sorting the whole file in one go; thereby exceeding the memory limitation. You'd need to split 2**32 into many more chunks than 8 to avoid exceeding memory, and that will probably slow you down a lot. Cheers, Ralph.Unknown — October 27, 2008
How about using Numpy to do read them in with frombuffer()? Something like in the multiprocess test here.Joel Hockey — October 27, 2008
Hi Ralph. Yes it is possible for this approach to exceed the 2M memory limit if the numbers are skewed. I should have mentioned that it makes the assumption that the numbers are uniformly random.Ted Hosmann — October 29, 2008
GvR - this is the post that put your new blog on my radar. Subscribe to feed, Check.Jurjen — November 03, 2008
This comment has been removed by the author.Jurjen — November 03, 2008
Hmm. Still I am not convinced that the disk buffers for these 100 files don't actually just contain all the numbers, so that you seem to be cheating a bit: it is a "sort in disk buffer space".Stoune — November 18, 2008
This is task more for Donald Knuth.Travis Oliphant — December 04, 2008
For what it's worth, NumPy and memory-mapped arrays can solve this problem quite simply and quickly (how much memory is used depends on the OS):Pranav Prakash — January 09, 2009
For a beginner like me, this code and post was surely educative enough. I have one more point to, "why Python"Anonymous — February 03, 2009
private void SortFile(string unsortedFileName, string sortedFileName)Wallacoloo — February 23, 2010
@Linn: By reading all 1000000 4byte integers in at once, you're using atleast 4000000 bytes of memory. But you're only given 2MB. Plus, it looks like you're using another 4000000 bytes in your "file" array.FutureNerd — August 03, 2010
The question is not a joke! But since it seems like the kind of thing Microsoft would use in a job- interview puzzle (pointless obsessive byte-squeezing optimization!), I'll be a little cryptic:Anand — July 11, 2011
http://anandtechblog.blogspot.com/2010/12/how-to-sort-4gb-files-google-question.htmlShriram Kunchanapalli — April 26, 2012
@Anand: What's the run time complexity of your solution?Anonymous — October 25, 2014
I use python for some specific purposes in my research (infectious disease modeling). So I haven't worked with problems like this before. I'm trying to learn more about python, so here I am.Guido van Rossum — October 25, 2014
Joel: the magic is all in heapq.merge(). It does the "merge" phase of a merge-sort. Think of what happens at the very beginning. There are 1000 files each with 1000 sorted numbers. We want the smallest number of all, which must be the first number in one of the files (because they are sorted). It finds and yields this value, and advances that file's iterator. Now we want the next smallest number. Again, this must be at the head of one of the file iterators (but not necessarily the same one). And so on, until all files have been exhausted.Anonymous — October 26, 2014
Thanks for the quick reply.