3.1.1. Numpy, Python, and Scientific Computing

This page covers two things you’ll want to think about in the background as we develop code:

  1. Memory (your computer’s)

  2. SPEED

Main takeaways

  1. Numpy deals with numeric operations on large arrays quickly so we should learn it

  2. Don’t overly stress about having “fast” code. Your time matters more than your computer’s runtime.

  3. Age-old wisdom: Premature profiling (maximizing code speed) is the root of all evil.

The issue of memory

MY MAIN POINT HERE: The files we deal with will get larger as the class moves forward. Typically, a file on the computer hard drive takes up twice as much space in memory (RAM).

Warning

Be mindful of that and don’t load a file that will exceed your computer’s available RAM!

Example: If the file is 4GB and your computer only has 6GB of memory, the file will max out your computer

Large file sizes will require us to develop workarounds, both in terms of how they are loaded and used, and how they are shared. We will deal with these later. If you run into issues with memory constraints before we cover it more formally, elevate the issue to me before/during/after class, your TA and me during office hours, and the GitHub discussion repo.

The issue of speed

Python’s biggest strength is the manner in which we write its code. We write plain statements like

for state in states: 
    print(capitol(state))

This makes it easy and fast to write and debug code.

We don’t explicitly write many particulars that the computer needs to know to execute code (e.g. what type is this variable, what operations should I call for that type, how much memory do I allocate for this process?).

This is why Python is a “high level” language - it is written at a level far from the computer’s language (binary).

So python is flexible, but that flexibility can make it slow.

The issue of speed

Your time is more valuable than your computer's

Tip

A GREAT RULE OF THUMB: total user time = programming time + executing time

  • Python tends to minimize programming time relative to other languages.

  • Executing time is, to an approximation, free: Start executing slow code and go see a movie, sleep through the night, or whatever! There is nothing better than having fun and being able to simultaneously claim “you’re working”.

  • Usually, your code is slow because of one or only a few lines of code. If the code must be speed up, we can identify the culprit, and apply one of a few fixes.

Illustrating why python _can_ be slow

An example that shows why Python is slow is the following:

import numpy as np   # this is how we load numpy, and "np" is
import random        # just a convenient abbrev convention

# a "+" sign isn't just a plus sign
a,b=3, 14
print(a+b)
a,b="py ","is good"
print(a+b)
17
py is good

The “+” operator saw numbers and added, but then it saw strings and concatenated.

Python handled that like a pro!

Q: How did it do that?

A: The python interpreter in each line checked the objects the “+” was operating on and applied the correct method!

However, checking objects before every operation involves considerable work:

  • If you want to add 10 numbers, that’s 9 addition operations, plus each time the code checks both objects involved, which is another 18 operations. So it took 27 steps to add 10 numbers.

  • Languages like FORTRAN and C are compiled, and users have to declare that all 10 numbers are numbers, so C code just plows ahead and is done in 9 steps.

  • It is even worse than that, because in C, the next number in an array is always the same distance from the prior number (in terms of its physical location in memory). Meanwhile, in Python, the program can’t quite assume the next number is the same distance away. These “look ups” can cause huge delays.

So here is an example: Let’s pull random numbers and square them:

%%time

y = 0     
for i in range(1000000):
    x = random.uniform(0, 1)
    y += x**2
Wall time: 624 ms

Oye, that seems slow!

We haven’t earned that high five yet, Ghostrider.

3.1.1.1. The speed solution

Vectorization. Send commands on arrays to the computer pre-compiled, with efficient native machine code.

Let’s redo the code above:

%%time

x = np.random.uniform(0, 1, 1000000)
y = np.sum(x**2)
Wall time: 18.4 ms

Bam! That is muuuuuch faster. I think we earned that five…

3.1.1.2. Why did that work?

  • In the slow version, we effectively sent the computer 3m commands: draw a number, square it, then add to prior.

  • In the fast version, we sent those 3m commands embedded into just 3 array commands: draw an array of numbers, square all the elements of the array, sum the array. The array functions are optimized to use native machine code and don’t have to check each element for type.

Verdict: arrays and vectorization are great*! And in Python, we implement them using numpy.

3.1.1.3. Does this talk of vectorization confuse you?

That’s ok! This isn’t a computer science class! Here are the takeaways of this page:

3.1.1.4. The takeaways:

Main takeaways

  1. Numpy deals with numeric operations on large arrays quickly so we should learn it

  2. Don’t overly stress about having “fast” code. Your time matters more than your computer’s runtime.

  3. Age-old wisdom: Premature profiling (maximizing code speed) is the root of all evil.

Ok, with that covered…

So let’s learn some numpy!