This morning I ran across a blog post that presented two Python scripts, one of which was much more efficient than the other. The blog post has since been deleted so I can’t link to it, but the scripts basically boil down to this:

import time a = [i for i in range(1000000)] sum = 0 t1 = time.time() for i in a: sum = sum + i t2 = time.time() print t2-t1

import time from random import shuffle a = [i for i in range(1000000)] shuffle(a) sum = 0 t1 = time.time() for i in a: sum = sum + i t2 = time.time() print t2-t1

As you can see, the two scripts have entirely identical behavior. They both generate a list of the first one million integers and print the time it takes to compute their sum. The only difference is that `slow.py`

first randomizes the order of the integers. Although it may seem surprising, it seems that this randomization is enough to slow the program significantly. On my machine running Python 2.7.3, `fast.py`

is consistently a tenth of a second quicker than `slow.py`

(this is a nontrivial speedup given that `fast.py`

takes about a quarter of a second total to run). Go ahead and try it out yourself. (I haven’t tested it on Python 3, but the results shouldn’t be too different.)

So why does randomizing the list’s elements result in such a significant slowdown? The author of the original blog post chalked it up to “branch prediction”. If you’re not familiar with the term, check out this StackOverflow question, which explains the concept very well. (My suspicion is that the author of this blog post encountered this page or a similar one and applied the idea to a Python snippet where it didn’t quite fit.)

Of course, I was doubtful that branch prediction could actually be the culprit here. There are no top-level conditional branches in the Python code, and it stands to reason that both scripts take precisely the same branches within the body of the loop. No part of the program is conditional on the integers themselves, and there is no data dependency between each of the list’s elements. Of course, I’m still uncertain that Python is anywhere near “close to the metal” enough that CPU-level branch prediction should even be a factor in the performance analysis of a Python script. Python is way too high-level for that.

So if it’s not branch prediction, just why is `slow.py`

so *slow*? After a little bit of research, I think I’ve figured it out, but not before making a few “false starts”. The answer requires a little familiarity with the internals of the Python virtual machine.

### False start: lists and generators

My first thought was that Python was able to deal with the sorted list `[i for i in range(1000000)]`

much more efficiently than the randomized list. That is, maybe it was substituting the list for a generator like this one:

def numbers(): i = 0 while i < 1000000: yield i i += 1

That might be a bit more time-efficient, I thought. After all, if Python internally uses a generator in place of a bonafide list, then it doesn’t have to go to the trouble of keeping all of the integers in memory at once, which might save us a lot of overhead. The randomized list in `slow.py`

can’t be easily captured by a simple generator, so the VM wouldn’t be able to make any such optimization.

Unfortunately, this wasn’t a useful observation. If you squeeze an `a.sort()`

in between the `shuffle`

and the loop in `slow.py`

, it becomes as fast as `fast.py`

again. Clearly there is something about having these numbers sorted that makes the program faster.

### False start: lists vs. arrays

My next thought that it might be the data structure giving us caching issues. `a`

is a “list”, which (understandably) led me to believe that `a`

was implemented under-the-hood as a linked list. If the `shuffle`

operation statefully randomizes the nodes of the linked list, then `fast.py`

may be able to take advantage of cache locality by allocating all of its linked-list nodes adjacently, while `slow.py`

will get a lot of cache misses because each linked list node will reference another node that is not likely to be on the same cache line.

… Unfortunately, that didn’t get me anywhere, either. Python’s list objects aren’t linked lists but are arrays in the truest sense of the word. In particular, this is the C structure definition of a Python list object:

typedef struct { PyObject_VAR_HEAD PyObject **ob_item; Py_ssize_t allocated; } PyListObject;

… in other words, `ob_item`

is a pointer to an array of pointers to `PyObjects`

, and `allocated`

is the size of the array we’ve allocated. So that’s not helpful in solving this problem, either (though it did soothe a few of my uncertainties about the algorithmic complexity of some list operations in Python: YES, list appends are O(1), YES, accessing an arbitrary element of a list is O(1), etc.). Now, I’m just trying to figure out why Guido chose to call them “lists” rather than “arrays”, which is what they are.

### The solution: integral objects

An array’s elements are all adjacent in memory, so the data structure isn’t giving us cache locality problems. It turns out that cache locality *is* the problem that makes `slow.py`

so slow, but it comes from an unexpected place. In Python, integers are heap-allocated objects rather than simple values. In particular, this is what an integer object looks like in the virtual machine:

typedef struct { PyObject_HEAD long ob_ival; } PyIntObject;

The only “interesting” element of this structure is `ob_ival`

(which is the value of the integer as a C `long`

). If it seems wasteful to you to make an entire heap-allocated object just for an integer, you’re probably right. A lot of languages make optimizations to prevent doing exactly this. For example, Matz’s Ruby Interpreter generally stores objects as pointers, but makes an exception for integers since they’re used so frequently. In short, the MRI crams Fixnums into the same space as an object reference and tags the least significant bit to let you know it’s an integer rather than a pointer (on all modern systems, `malloc`

always returns memory addresses that are aligned to some power of 2). At that point, you just have to do a right shift to get the value of the integer — no heap allocation or redirection needed. `slow.py`

and `fast.py`

would have been equally speedy (and it’s likely they both would have been faster) if CPython had made a similar optimization.

So how *does* CPython deal with integers, then? What is the behavior of the interpreter that’s giving us so much trouble? The Python interpreter allocates integers in “blocks” of about 40 at a time. When Python needs to create a new integer object, it carves out the next available space in the current integer “block” and stores the integer there. Since our code allocates one million integers in a row, adjacent integers will (for the most part) be placed adjacently in memory. Therefore, iterating through the first million integers *in order* exhibits good cache locality, and iterating through the first million integers in a random order will result in frequent cache misses.

So, the answer to the question “why does sorting this array make this code faster” is that *it doesn’t*. The non-shuffled array is faster to iterate through since we access the integer objects in the same order that they were allocated (and they do have to be allocated).

*Edit, September 4: In the original article, I included a bit about how many CPU instructions it takes to print out an integer in Python. The figure was a bit misleading so I removed it. Don’t go looking for it, it’s gone now.*

I tested the code on python3 (there was some indentation error with `sum = sum + i`, but i fixed it; along with print-as-function). im running Linux Mint on a 3.5 GHz Phenom II quad core processor.

-> % python slow.py

0.186490058899

-> % python fast.py

0.107305049896

-> % python3 slow.py

0.22517633438110352

-> % python3 fast.py

0.12798285484313965

You seem to be right about the

thanks for the interesting article, kinda sucks how much slower python3 is than python2!!

Hey, thanks for testing this out. Kinda crazy how the disparity between the two scripts isn’t any better in 3, but it actually seems to be a little bit worse!

I think an interesting followup to this article is a comparison of the same algorithms and how they act differently between python2 and python3. Although it may be out of the scope for this a measly blog post. I guess ill just do my own research…

this may be because in python3 the {int, long} are uniformed.

struct _longobject {

PyObject_VAR_HEAD

digit ob_digit[1];

};

the struct is more general, and need more processing.

In python3 range() really is a generator, which should make the difference larger.

Practically, timeit (%timeit in IPython) and numpy.random.randint may be helpful.

Running python3.3 on windows 7 I get 0.13 s for the sorted version and 0.22 s for the unsorted version.

With lists and numpy.sum(a) to compute the sum I get 0.083 s for the sorted and 0.22 s for the unsorted version.

Replacing the lists with numpy arrays and using the loop to find the sum I get 0.29 s for both the sorted and the unsorted version.

Using numpy arrays and numpy.sum(a) I get 0.0010 for both the sorted and unsorted version.

I find it interesting that for the unsorted numpy array looping is as slow as with a list.

Looks like you are right. If we alter slow version with `a = map(lambda x: int(str(x)), a)` (str hack to force allocating of new integer), than slow-version will be again as fast as fast one.

A direct way to confirm your finding is to construct the list as follows:

a = [random.randrange(1000000) for i in range(1000000)]

Using that without the shuffle is fast, with the shuffle is slow. So it is really about allocation order, not value of the integers.

And there I thought you were going to tell us was that the difference was that the sorted version accumulates the total sum as slowly as possible as thus says in plain-int land for as long as possible before having to flip over to the more expensive long type. But the difference in cost is not significant compared to the cost of having the CPU wait on memory?

Nope! In fact, the time it takes to convert to a long is all but irrelevant in the tests I ran. To confirm this for yourself, construct the array and sum it in reverse order. That will convert it to a long much sooner, and doing that had no observable effect on the running time of the script for me.

I assume that you were running this on a 64 bit machine where the sum stays as an int.

I wasn’t.

There were a few problems with your methodology that made me wonder if you were really testing the for-loop. I factored these out, and I’ll be jiggered, you’re right. Here’s the code I used to test it: https://gist.github.com/eykd/6453378

Thanks for this. It was noted on Reddit that for loops have quite a bit of overhead that you can defer to compiled code if you use the “sum” built-in function. Of course, that exacerbated the difference in performance between the two scripts.

Well, that made me more curious, so I refactored the gist to check both python for-loops and list comprehension for-loops, which run much faster. There appears to be no significant difference between sorted and unsorted in a list comprehension.

Results from running this on my Mac, OS X 10.6.8, 2.7GHz Intel Core i7 yadda yadda:

Sorted for loop:

18.2923810482

Unsorted for loop:

56.756278038

Sorted comprehension:

124.576835871

Unsorted comprehension:

126.68524313

I have no idea what that means, but it sure is interesting.

That looks interesting — can you include the code that you ran to get those results?

It’s in the gist: https://gist.github.com/eykd/6453378

It’s in the gist, again https://gist.github.com/eykd/6453378

If anyone is still reading this thread:

As has been suggested above, there is a difference between STARTING with a sorted sequence (as this example does) and actually sorting a list that wasn’t in sequence to begin with. Repeating this, by starting with a random sequence and THEN ordering it gives the opposite result (output and modified script):

Unsorted, Sequential List : 0.1790

Shuffled, Sequential List : 0.2650

Unsorted, Random List : 0.1820

Sorted, Random List : 0.2640

—————————————————————————

Final results: UNSORTED list loops FASTER than re-ordered list

both sorting & shuffling RE-ORDER the list & SLOW looping

—————————————————————————

# Script to test whether looping is faster through and unsorted or a sorted list

# Final results: UNSORTED list loops FASTER than re-ordered list

# both sorting & shuffling RE-ORDER the list & SLOW looping

import time

import random

# The original test is WRONG

# – the original list is an UNSORTED list of sequential integers

# (see below: this in NOT the same as a SORTED list of random integers

a = [i for i in range(1000000)]

sum = 0

t1 = time.time()

for i in a:

sum = sum + i

t2 = time.time()

print (‘%-30s: %.4f’ % (‘Unsorted, Sequential List’, t2-t1))

a = [i for i in range(1000000)]

random.shuffle(a)

sum = 0

t1 = time.time()

for i in a:

sum = sum + i

t2 = time.time()

print (‘%-30s: %.4f’ % (‘Shuffled, Sequential List’, t2-t1))

# The CORRECTED test starts with an UNSORTED list of RANDOM integers

# – then the list is sorted and looping is timed again

# The unsorted list (of a random sequence) processes faster

a = []

for i in range(0, 1000000):

a += [random.randint(0, 1000000)]

sum = 0

t1 = time.time()

for i in a:

sum = sum + i

t2 = time.time()

print (‘%-30s: %.4f’ % (‘Unsorted, Random List’, t2-t1))

a.sort()

sum = 0

t1 = time.time()

for i in a:

sum = sum + i

t2 = time.time()

print (‘%-30s: %.4f’ % (‘Sorted, Random List’, t2-t1))

print(”’

—————————————————————————

Final results: UNSORTED list loops FASTER than re-ordered list

both sorting & shuffling RE-ORDER the list & SLOW looping

—————————————————————————

”’)