03Sep / 2013

Why sorting an array makes a Python loop faster

This morning I ran across a blog post that presented two Python scripts, one of which was much more efficient than the other. The blog post has since been deleted so I can’t link to it, but the scripts basically boil down to this:

import time
a = [i for i in range(1000000)]
sum = 0
t1 = time.time()
for i in a:
    sum = sum + i
t2 = time.time()
print t2-t1

import time
from random import shuffle
a = [i for i in range(1000000)]
shuffle(a)
sum = 0
t1 = time.time()
for i in a:
    sum = sum + i
t2 = time.time()
print t2-t1

As you can see, the two scripts have entirely identical behavior. They both generate a list of the first one million integers and print the time it takes to compute their sum. The only difference is that slow.py first randomizes the order of the integers. Although it may seem surprising, it seems that this randomization is enough to slow the program significantly. On my machine running Python 2.7.3, fast.py is consistently a tenth of a second quicker than slow.py (this is a nontrivial speedup given that fast.py takes about a quarter of a second total to run). Go ahead and try it out yourself. (I haven’t tested it on Python 3, but the results shouldn’t be too different.)

So why does randomizing the list’s elements result in such a significant slowdown? The author of the original blog post chalked it up to “branch prediction”. If you’re not familiar with the term, check out this StackOverflow question, which explains the concept very well. (My suspicion is that the author of this blog post encountered this page or a similar one and applied the idea to a Python snippet where it didn’t quite fit.)

Of course, I was doubtful that branch prediction could actually be the culprit here. There are no top-level conditional branches in the Python code, and it stands to reason that both scripts take precisely the same branches within the body of the loop. No part of the program is conditional on the integers themselves, and there is no data dependency between each of the list’s elements. Of course, I’m still uncertain that Python is anywhere near “close to the metal” enough that CPU-level branch prediction should even be a factor in the performance analysis of a Python script. Python is way too high-level for that.

So if it’s not branch prediction, just why is slow.py so slow? After a little bit of research, I think I’ve figured it out, but not before making a few “false starts”. The answer requires a little familiarity with the internals of the Python virtual machine.

False start: lists and generators

My first thought was that Python was able to deal with the sorted list [i for i in range(1000000)] much more efficiently than the randomized list. That is, maybe it was substituting the list for a generator like this one:

def numbers():
    i = 0
    while i < 1000000:
        yield i
        i += 1

That might be a bit more time-efficient, I thought. After all, if Python internally uses a generator in place of a bonafide list, then it doesn’t have to go to the trouble of keeping all of the integers in memory at once, which might save us a lot of overhead. The randomized list in slow.py can’t be easily captured by a simple generator, so the VM wouldn’t be able to make any such optimization.

Unfortunately, this wasn’t a useful observation. If you squeeze an a.sort() in between the shuffle and the loop in slow.py, it becomes as fast as fast.py again. Clearly there is something about having these numbers sorted that makes the program faster.

False start: lists vs. arrays

My next thought that it might be the data structure giving us caching issues. a is a “list”, which (understandably) led me to believe that a was implemented under-the-hood as a linked list. If the shuffle operation statefully randomizes the nodes of the linked list, then fast.py may be able to take advantage of cache locality by allocating all of its linked-list nodes adjacently, while slow.py will get a lot of cache misses because each linked list node will reference another node that is not likely to be on the same cache line.

… Unfortunately, that didn’t get me anywhere, either. Python’s list objects aren’t linked lists but are arrays in the truest sense of the word. In particular, this is the C structure definition of a Python list object:

typedef struct {
  PyObject_VAR_HEAD
  PyObject **ob_item;
  Py_ssize_t allocated;
} PyListObject;

… in other words, ob_item is a pointer to an array of pointers to PyObjects, and allocated is the size of the array we’ve allocated. So that’s not helpful in solving this problem, either (though it did soothe a few of my uncertainties about the algorithmic complexity of some list operations in Python: YES, list appends are O(1), YES, accessing an arbitrary element of a list is O(1), etc.). Now, I’m just trying to figure out why Guido chose to call them “lists” rather than “arrays”, which is what they are.

The solution: integral objects

An array’s elements are all adjacent in memory, so the data structure isn’t giving us cache locality problems. It turns out that cache locality is the problem that makes slow.py so slow, but it comes from an unexpected place. In Python, integers are heap-allocated objects rather than simple values. In particular, this is what an integer object looks like in the virtual machine:

typedef struct {
  PyObject_HEAD
  long ob_ival;
} PyIntObject;

The only “interesting” element of this structure is ob_ival (which is the value of the integer as a C long). If it seems wasteful to you to make an entire heap-allocated object just for an integer, you’re probably right. A lot of languages make optimizations to prevent doing exactly this. For example, Matz’s Ruby Interpreter generally stores objects as pointers, but makes an exception for integers since they’re used so frequently. In short, the MRI crams Fixnums into the same space as an object reference and tags the least significant bit to let you know it’s an integer rather than a pointer (on all modern systems, malloc always returns memory addresses that are aligned to some power of 2). At that point, you just have to do a right shift to get the value of the integer — no heap allocation or redirection needed. slow.py and fast.py would have been equally speedy (and it’s likely they both would have been faster) if CPython had made a similar optimization.

So how does CPython deal with integers, then? What is the behavior of the interpreter that’s giving us so much trouble? The Python interpreter allocates integers in “blocks” of about 40 at a time. When Python needs to create a new integer object, it carves out the next available space in the current integer “block” and stores the integer there. Since our code allocates one million integers in a row, adjacent integers will (for the most part) be placed adjacently in memory. Therefore, iterating through the first million integers in order exhibits good cache locality, and iterating through the first million integers in a random order will result in frequent cache misses.

So, the answer to the question “why does sorting this array make this code faster” is that it doesn’t. The non-shuffled array is faster to iterate through since we access the integer objects in the same order that they were allocated (and they do have to be allocated).

Edit, September 4: In the original article, I included a bit about how many CPU instructions it takes to print out an integer in Python. The figure was a bit misleading so I removed it. Don’t go looking for it, it’s gone now.

Permalink 20 Comments

sinistersnare

September 4, 2013 at 1:44 am

I tested the code on python3 (there was some indentation error with `sum = sum + i`, but i fixed it; along with print-as-function). im running Linux Mint on a 3.5 GHz Phenom II quad core processor.

-> % python slow.py
0.186490058899

-> % python fast.py
0.107305049896

-> % python3 slow.py
0.22517633438110352

-> % python3 fast.py
0.12798285484313965

You seem to be right about the
thanks for the interesting article, kinda sucks how much slower python3 is than python2!!

Reply ↓
- rickystewart
  
  September 4, 2013 at 1:55 am
  
  Hey, thanks for testing this out. Kinda crazy how the disparity between the two scripts isn’t any better in 3, but it actually seems to be a little bit worse!
  
  Reply ↓
  - sinistersnare
    
    September 4, 2013 at 2:39 am
    
    I think an interesting followup to this article is a comparison of the same algorithms and how they act differently between python2 and python3. Although it may be out of the scope for this a measly blog post. I guess ill just do my own research…
  - xiaobingjiang
    
    September 4, 2013 at 4:57 am
    
    this may be because in python3 the {int, long} are uniformed.
    
    struct _longobject {
    PyObject_VAR_HEAD
    digit ob_digit[1];
    };
    
    the struct is more general, and need more processing.
  - kuschku
    
    September 4, 2013 at 11:59 am
    
    In python3 range() really is a generator, which should make the difference larger.
Wes Turner (@westurner)

September 4, 2013 at 5:47 am

Practically, timeit (%timeit in IPython) and numpy.random.randint may be helpful.

Reply ↓
Joes Staal

September 4, 2013 at 10:11 am

Running python3.3 on windows 7 I get 0.13 s for the sorted version and 0.22 s for the unsorted version.
With lists and numpy.sum(a) to compute the sum I get 0.083 s for the sorted and 0.22 s for the unsorted version.
Replacing the lists with numpy arrays and using the loop to find the sum I get 0.29 s for both the sorted and the unsorted version.
Using numpy arrays and numpy.sum(a) I get 0.0010 for both the sorted and unsorted version.
I find it interesting that for the unsorted numpy array looping is as slow as with a list.

Reply ↓
Oleg Grenrus (@phadej)

September 4, 2013 at 10:32 am

Looks like you are right. If we alter slow version with `a = map(lambda x: int(str(x)), a)` (str hack to force allocating of new integer), than slow-version will be again as fast as fast one.

Reply ↓
Carl Friedrich Bolz (@cfbolz)

September 4, 2013 at 12:48 pm

A direct way to confirm your finding is to construct the list as follows:

a = [random.randrange(1000000) for i in range(1000000)]

Using that without the shuffle is fast, with the shuffle is slow. So it is really about allocation order, not value of the integers.

Reply ↓
Brandon Rhodes (@brandon_rhodes)

September 4, 2013 at 1:18 pm

And there I thought you were going to tell us was that the difference was that the sorted version accumulates the total sum as slowly as possible as thus says in plain-int land for as long as possible before having to flip over to the more expensive long type. But the difference in cost is not significant compared to the cost of having the CPU wait on memory?

Reply ↓
- rickystewart
  
  September 4, 2013 at 4:08 pm
  
  Nope! In fact, the time it takes to convert to a long is all but irrelevant in the tests I ran. To confirm this for yourself, construct the array and sum it in reverse order. That will convert it to a long much sooner, and doing that had no observable effect on the running time of the script for me.
  
  Reply ↓
  - Carl Friedrich Bolz (@cfbolz)
    
    September 4, 2013 at 5:25 pm
    
    I assume that you were running this on a 64 bit machine where the sum stays as an int.
  - rickystewart
    
    September 4, 2013 at 5:26 pm
    
    I wasn’t.
David Eyk

September 5, 2013 at 5:32 pm

There were a few problems with your methodology that made me wonder if you were really testing the for-loop. I factored these out, and I’ll be jiggered, you’re right. Here’s the code I used to test it: https://gist.github.com/eykd/6453378

Reply ↓
- rickystewart
  
  September 5, 2013 at 5:37 pm
  
  Thanks for this. It was noted on Reddit that for loops have quite a bit of overhead that you can defer to compiled code if you use the “sum” built-in function. Of course, that exacerbated the difference in performance between the two scripts.
  
  Reply ↓
  - David Eyk
    
    September 5, 2013 at 7:03 pm
    
    Well, that made me more curious, so I refactored the gist to check both python for-loops and list comprehension for-loops, which run much faster. There appears to be no significant difference between sorted and unsorted in a list comprehension.
    
    Results from running this on my Mac, OS X 10.6.8, 2.7GHz Intel Core i7 yadda yadda:
    
    Sorted for loop:
    18.2923810482
    
    Unsorted for loop:
    56.756278038
    
    Sorted comprehension:
    124.576835871
    
    Unsorted comprehension:
    126.68524313
    
    I have no idea what that means, but it sure is interesting.
  - rickystewart
    
    September 7, 2013 at 4:03 am
    
    That looks interesting — can you include the code that you ran to get those results?
  - eykd
    
    September 8, 2013 at 6:56 am
    
    It’s in the gist: https://gist.github.com/eykd/6453378
eykd

September 8, 2013 at 6:55 am

It’s in the gist, again https://gist.github.com/eykd/6453378

Reply ↓
strawwmann

May 23, 2014 at 4:16 am

If anyone is still reading this thread:

As has been suggested above, there is a difference between STARTING with a sorted sequence (as this example does) and actually sorting a list that wasn’t in sequence to begin with. Repeating this, by starting with a random sequence and THEN ordering it gives the opposite result (output and modified script):

Unsorted, Sequential List : 0.1790
Shuffled, Sequential List : 0.2650
Unsorted, Random List : 0.1820
Sorted, Random List : 0.2640

—————————————————————————
Final results: UNSORTED list loops FASTER than re-ordered list
both sorting & shuffling RE-ORDER the list & SLOW looping
—————————————————————————

# Script to test whether looping is faster through and unsorted or a sorted list
# Final results: UNSORTED list loops FASTER than re-ordered list
# both sorting & shuffling RE-ORDER the list & SLOW looping

import time
import random

# The original test is WRONG
# – the original list is an UNSORTED list of sequential integers
# (see below: this in NOT the same as a SORTED list of random integers
a = [i for i in range(1000000)]
sum = 0
t1 = time.time()
for i in a:
sum = sum + i
t2 = time.time()
print (‘%-30s: %.4f’ % (‘Unsorted, Sequential List’, t2-t1))

a = [i for i in range(1000000)]
random.shuffle(a)
sum = 0
t1 = time.time()
for i in a:
sum = sum + i
t2 = time.time()
print (‘%-30s: %.4f’ % (‘Shuffled, Sequential List’, t2-t1))

# The CORRECTED test starts with an UNSORTED list of RANDOM integers
# – then the list is sorted and looping is timed again
# The unsorted list (of a random sequence) processes faster
a = []
for i in range(0, 1000000):
a += [random.randint(0, 1000000)]
sum = 0
t1 = time.time()
for i in a:
sum = sum + i
t2 = time.time()
print (‘%-30s: %.4f’ % (‘Unsorted, Random List’, t2-t1))

a.sort()
sum = 0
t1 = time.time()
for i in a:
sum = sum + i
t2 = time.time()
print (‘%-30s: %.4f’ % (‘Sorted, Random List’, t2-t1))

print(”’
—————————————————————————
Final results: UNSORTED list loops FASTER than re-ordered list
both sorting & shuffling RE-ORDER the list & SLOW looping
—————————————————————————
”’)

Reply ↓

	vocalbit on There has to be another way: m…
	Dustin on There has to be another way: m…
	Joel Williamson on There has to be another way: m…
	Sandro Magi on There has to be another way: m…
	strawwmann on Why sorting an array makes a P…

foreign keys

Why sorting an array makes a Python loop faster

False start: lists and generators

False start: lists vs. arrays

The solution: integral objects

20 responses to “Why sorting an array makes a Python loop faster”

Leave a comment Cancel reply

Recent Posts

Recent Comments

Archives

Categories

Meta

foreign keys

Why sorting an array makes a Python loop faster

False start: lists and generators

False start: lists vs. arrays

The solution: integral objects

Share this:

Related

20 responses to “Why sorting an array makes a Python loop faster”

Leave a comment Cancel reply

Recent Posts

Recent Comments

Archives

Categories

Meta