Why sorting an array makes a Python loop faster

This morning I ran across a blog post that presented two Python scripts, one of which was much more efficient than the other. The blog post has since been deleted so I can’t link to it, but the scripts basically boil down to this:

import time
a = [i for i in range(1000000)]
sum = 0
t1 = time.time()
for i in a:
    sum = sum + i
t2 = time.time()
print t2-t1
import time
from random import shuffle
a = [i for i in range(1000000)]
shuffle(a)
sum = 0
t1 = time.time()
for i in a:
    sum = sum + i
t2 = time.time()
print t2-t1

As you can see, the two scripts have entirely identical behavior. They both generate a list of the first one million integers and print the time it takes to compute their sum. The only difference is that slow.py first randomizes the order of the integers. Although it may seem surprising, it seems that this randomization is enough to slow the program significantly. On my machine running Python 2.7.3, fast.py is consistently a tenth of a second quicker than slow.py (this is a nontrivial speedup given that fast.py takes about a quarter of a second total to run). Go ahead and try it out yourself. (I haven’t tested it on Python 3, but the results shouldn’t be too different.)

So why does randomizing the list’s elements result in such a significant slowdown? The author of the original blog post chalked it up to “branch prediction”. If you’re not familiar with the term, check out this StackOverflow question, which explains the concept very well. (My suspicion is that the author of this blog post encountered this page or a similar one and applied the idea to a Python snippet where it didn’t quite fit.)

Of course, I was doubtful that branch prediction could actually be the culprit here. There are no top-level conditional branches in the Python code, and it stands to reason that both scripts take precisely the same branches within the body of the loop. No part of the program is conditional on the integers themselves, and there is no data dependency between each of the list’s elements. Of course, I’m still uncertain that Python is anywhere near “close to the metal” enough that CPU-level branch prediction should even be a factor in the performance analysis of a Python script. Python is way too high-level for that.

So if it’s not branch prediction, just why is slow.py so slow? After a little bit of research, I think I’ve figured it out, but not before making a few “false starts”. The answer requires a little familiarity with the internals of the Python virtual machine.

False start: lists and generators

My first thought was that Python was able to deal with the sorted list [i for i in range(1000000)] much more efficiently than the randomized list. That is, maybe it was substituting the list for a generator like this one:

def numbers():
    i = 0
    while i < 1000000:
        yield i
        i += 1

That might be a bit more time-efficient, I thought. After all, if Python internally uses a generator in place of a bonafide list, then it doesn’t have to go to the trouble of keeping all of the integers in memory at once, which might save us a lot of overhead. The randomized list in slow.py can’t be easily captured by a simple generator, so the VM wouldn’t be able to make any such optimization.

Unfortunately, this wasn’t a useful observation. If you squeeze an a.sort() in between the shuffle and the loop in slow.py, it becomes as fast as fast.py again. Clearly there is something about having these numbers sorted that makes the program faster.

False start: lists vs. arrays

My next thought that it might be the data structure giving us caching issues. a is a “list”, which (understandably) led me to believe that a was implemented under-the-hood as a linked list. If the shuffle operation statefully randomizes the nodes of the linked list, then fast.py may be able to take advantage of cache locality by allocating all of its linked-list nodes adjacently, while slow.py will get a lot of cache misses because each linked list node will reference another node that is not likely to be on the same cache line.

… Unfortunately, that didn’t get me anywhere, either. Python’s list objects aren’t linked lists but are arrays in the truest sense of the word. In particular, this is the C structure definition of a Python list object:

typedef struct {
  PyObject_VAR_HEAD
  PyObject **ob_item;
  Py_ssize_t allocated;
} PyListObject;

… in other words, ob_item is a pointer to an array of pointers to PyObjects, and allocated is the size of the array we’ve allocated. So that’s not helpful in solving this problem, either (though it did soothe a few of my uncertainties about the algorithmic complexity of some list operations in Python: YES, list appends are O(1), YES, accessing an arbitrary element of a list is O(1), etc.). Now, I’m just trying to figure out why Guido chose to call them “lists” rather than “arrays”, which is what they are.

The solution: integral objects

An array’s elements are all adjacent in memory, so the data structure isn’t giving us cache locality problems. It turns out that cache locality is the problem that makes slow.py so slow, but it comes from an unexpected place. In Python, integers are heap-allocated objects rather than simple values. In particular, this is what an integer object looks like in the virtual machine:

typedef struct {
  PyObject_HEAD
  long ob_ival;
} PyIntObject;

The only “interesting” element of this structure is ob_ival (which is the value of the integer as a C long). If it seems wasteful to you to make an entire heap-allocated object just for an integer, you’re probably right. A lot of languages make optimizations to prevent doing exactly this. For example, Matz’s Ruby Interpreter generally stores objects as pointers, but makes an exception for integers since they’re used so frequently. In short, the MRI crams Fixnums into the same space as an object reference and tags the least significant bit to let you know it’s an integer rather than a pointer (on all modern systems, malloc always returns memory addresses that are aligned to some power of 2). At that point, you just have to do a right shift to get the value of the integer — no heap allocation or redirection needed. slow.py and fast.py would have been equally speedy (and it’s likely they both would have been faster) if CPython had made a similar optimization.

So how does CPython deal with integers, then? What is the behavior of the interpreter that’s giving us so much trouble? The Python interpreter allocates integers in “blocks” of about 40 at a time. When Python needs to create a new integer object, it carves out the next available space in the current integer “block” and stores the integer there. Since our code allocates one million integers in a row, adjacent integers will (for the most part) be placed adjacently in memory. Therefore, iterating through the first million integers in order exhibits good cache locality, and iterating through the first million integers in a random order will result in frequent cache misses.

So, the answer to the question “why does sorting this array make this code faster” is that it doesn’t. The non-shuffled array is faster to iterate through since we access the integer objects in the same order that they were allocated (and they do have to be allocated).

Edit, September 4: In the original article, I included a bit about how many CPU instructions it takes to print out an integer in Python. The figure was a bit misleading so I removed it. Don’t go looking for it, it’s gone now.

About these ads

19 responses to “Why sorting an array makes a Python loop faster

  1. I tested the code on python3 (there was some indentation error with `sum = sum + i`, but i fixed it; along with print-as-function). im running Linux Mint on a 3.5 GHz Phenom II quad core processor.

    -> % python slow.py
    0.186490058899

    -> % python fast.py
    0.107305049896

    -> % python3 slow.py
    0.22517633438110352

    -> % python3 fast.py
    0.12798285484313965

    You seem to be right about the
    thanks for the interesting article, kinda sucks how much slower python3 is than python2!!

    • Hey, thanks for testing this out. Kinda crazy how the disparity between the two scripts isn’t any better in 3, but it actually seems to be a little bit worse!

      • I think an interesting followup to this article is a comparison of the same algorithms and how they act differently between python2 and python3. Although it may be out of the scope for this a measly blog post. I guess ill just do my own research…

      • this may be because in python3 the {int, long} are uniformed.

        struct _longobject {
        PyObject_VAR_HEAD
        digit ob_digit[1];
        };

        the struct is more general, and need more processing.

      • In python3 range() really is a generator, which should make the difference larger.

  2. Practically, timeit (%timeit in IPython) and numpy.random.randint may be helpful.

  3. Joes Staal

    Running python3.3 on windows 7 I get 0.13 s for the sorted version and 0.22 s for the unsorted version.
    With lists and numpy.sum(a) to compute the sum I get 0.083 s for the sorted and 0.22 s for the unsorted version.
    Replacing the lists with numpy arrays and using the loop to find the sum I get 0.29 s for both the sorted and the unsorted version.
    Using numpy arrays and numpy.sum(a) I get 0.0010 for both the sorted and unsorted version.
    I find it interesting that for the unsorted numpy array looping is as slow as with a list.

  4. Looks like you are right. If we alter slow version with `a = map(lambda x: int(str(x)), a)` (str hack to force allocating of new integer), than slow-version will be again as fast as fast one.

  5. A direct way to confirm your finding is to construct the list as follows:

    a = [random.randrange(1000000) for i in range(1000000)]

    Using that without the shuffle is fast, with the shuffle is slow. So it is really about allocation order, not value of the integers.

  6. And there I thought you were going to tell us was that the difference was that the sorted version accumulates the total sum as slowly as possible as thus says in plain-int land for as long as possible before having to flip over to the more expensive long type. But the difference in cost is not significant compared to the cost of having the CPU wait on memory?

  7. David Eyk

    There were a few problems with your methodology that made me wonder if you were really testing the for-loop. I factored these out, and I’ll be jiggered, you’re right. Here’s the code I used to test it: https://gist.github.com/eykd/6453378

    • Thanks for this. It was noted on Reddit that for loops have quite a bit of overhead that you can defer to compiled code if you use the “sum” built-in function. Of course, that exacerbated the difference in performance between the two scripts.

      • David Eyk

        Well, that made me more curious, so I refactored the gist to check both python for-loops and list comprehension for-loops, which run much faster. There appears to be no significant difference between sorted and unsorted in a list comprehension.

        Results from running this on my Mac, OS X 10.6.8, 2.7GHz Intel Core i7 yadda yadda:

        Sorted for loop:
        18.2923810482

        Unsorted for loop:
        56.756278038

        Sorted comprehension:
        124.576835871

        Unsorted comprehension:
        126.68524313

        I have no idea what that means, but it sure is interesting.

      • That looks interesting — can you include the code that you ran to get those results?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: