Why sorting an array makes a Python loop faster

This morning I ran across a blog post that presented two Python scripts, one of which was much more efficient than the other. The blog post has since been deleted so I can’t link to it, but the scripts basically boil down to this:

import time
a = [i for i in range(1000000)]
sum = 0
t1 = time.time()
for i in a:
    sum = sum + i
t2 = time.time()
print t2-t1
import time
from random import shuffle
a = [i for i in range(1000000)]
shuffle(a)
sum = 0
t1 = time.time()
for i in a:
    sum = sum + i
t2 = time.time()
print t2-t1

As you can see, the two scripts have entirely identical behavior. They both generate a list of the first one million integers and print the time it takes to compute their sum. The only difference is that slow.py first randomizes the order of the integers. Although it may seem surprising, it seems that this randomization is enough to slow the program significantly. On my machine running Python 2.7.3, fast.py is consistently a tenth of a second quicker than slow.py (this is a nontrivial speedup given that fast.py takes about a quarter of a second total to run). Go ahead and try it out yourself. (I haven’t tested it on Python 3, but the results shouldn’t be too different.)

So why does randomizing the list’s elements result in such a significant slowdown? The author of the original blog post chalked it up to “branch prediction”. If you’re not familiar with the term, check out this StackOverflow question, which explains the concept very well. (My suspicion is that the author of this blog post encountered this page or a similar one and applied the idea to a Python snippet where it didn’t quite fit.)

Of course, I was doubtful that branch prediction could actually be the culprit here. There are no top-level conditional branches in the Python code, and it stands to reason that both scripts take precisely the same branches within the body of the loop. No part of the program is conditional on the integers themselves, and there is no data dependency between each of the list’s elements. Of course, I’m still uncertain that Python is anywhere near “close to the metal” enough that CPU-level branch prediction should even be a factor in the performance analysis of a Python script. Python is way too high-level for that.

So if it’s not branch prediction, just why is slow.py so slow? After a little bit of research, I think I’ve figured it out, but not before making a few “false starts”. The answer requires a little familiarity with the internals of the Python virtual machine.

False start: lists and generators

My first thought was that Python was able to deal with the sorted list [i for i in range(1000000)] much more efficiently than the randomized list. That is, maybe it was substituting the list for a generator like this one:

def numbers():
    i = 0
    while i < 1000000:
        yield i
        i += 1

That might be a bit more time-efficient, I thought. After all, if Python internally uses a generator in place of a bonafide list, then it doesn’t have to go to the trouble of keeping all of the integers in memory at once, which might save us a lot of overhead. The randomized list in slow.py can’t be easily captured by a simple generator, so the VM wouldn’t be able to make any such optimization.

Unfortunately, this wasn’t a useful observation. If you squeeze an a.sort() in between the shuffle and the loop in slow.py, it becomes as fast as fast.py again. Clearly there is something about having these numbers sorted that makes the program faster.

False start: lists vs. arrays

My next thought that it might be the data structure giving us caching issues. a is a “list”, which (understandably) led me to believe that a was implemented under-the-hood as a linked list. If the shuffle operation statefully randomizes the nodes of the linked list, then fast.py may be able to take advantage of cache locality by allocating all of its linked-list nodes adjacently, while slow.py will get a lot of cache misses because each linked list node will reference another node that is not likely to be on the same cache line.

… Unfortunately, that didn’t get me anywhere, either. Python’s list objects aren’t linked lists but are arrays in the truest sense of the word. In particular, this is the C structure definition of a Python list object:

typedef struct {
  PyObject_VAR_HEAD
  PyObject **ob_item;
  Py_ssize_t allocated;
} PyListObject;

… in other words, ob_item is a pointer to an array of pointers to PyObjects, and allocated is the size of the array we’ve allocated. So that’s not helpful in solving this problem, either (though it did soothe a few of my uncertainties about the algorithmic complexity of some list operations in Python: YES, list appends are O(1), YES, accessing an arbitrary element of a list is O(1), etc.). Now, I’m just trying to figure out why Guido chose to call them “lists” rather than “arrays”, which is what they are.

The solution: integral objects

An array’s elements are all adjacent in memory, so the data structure isn’t giving us cache locality problems. It turns out that cache locality is the problem that makes slow.py so slow, but it comes from an unexpected place. In Python, integers are heap-allocated objects rather than simple values. In particular, this is what an integer object looks like in the virtual machine:

typedef struct {
  PyObject_HEAD
  long ob_ival;
} PyIntObject;

The only “interesting” element of this structure is ob_ival (which is the value of the integer as a C long). If it seems wasteful to you to make an entire heap-allocated object just for an integer, you’re probably right. A lot of languages make optimizations to prevent doing exactly this. For example, Matz’s Ruby Interpreter generally stores objects as pointers, but makes an exception for integers since they’re used so frequently. In short, the MRI crams Fixnums into the same space as an object reference and tags the least significant bit to let you know it’s an integer rather than a pointer (on all modern systems, malloc always returns memory addresses that are aligned to some power of 2). At that point, you just have to do a right shift to get the value of the integer — no heap allocation or redirection needed. slow.py and fast.py would have been equally speedy (and it’s likely they both would have been faster) if CPython had made a similar optimization.

So how does CPython deal with integers, then? What is the behavior of the interpreter that’s giving us so much trouble? The Python interpreter allocates integers in “blocks” of about 40 at a time. When Python needs to create a new integer object, it carves out the next available space in the current integer “block” and stores the integer there. Since our code allocates one million integers in a row, adjacent integers will (for the most part) be placed adjacently in memory. Therefore, iterating through the first million integers in order exhibits good cache locality, and iterating through the first million integers in a random order will result in frequent cache misses.

So, the answer to the question “why does sorting this array make this code faster” is that it doesn’t. The non-shuffled array is faster to iterate through since we access the integer objects in the same order that they were allocated (and they do have to be allocated).

Edit, September 4: In the original article, I included a bit about how many CPU instructions it takes to print out an integer in Python. The figure was a bit misleading so I removed it. Don’t go looking for it, it’s gone now.