Intro to Numpy

Numpy

We start this module of class by covering numpy, because pandas is built on top of numpy.

import numpy as np # by convention, numpy is always imported with the alias np
import random

Open a folder on your computer with your participation files and open Jupyter to that location.

Outline

Beware of memory needs and don't overload your computer
If your code is slow, find the offending line(s), and try to optimize those
If you draw random numbers, set a seed so that your results can be replicated and so they don't change everytime you rerun the code
Numpy and basic commands

Python and Scientific Computing

The issue of memory

MY MAIN POINT HERE: The files we deal with will get larger as the class moves forward. Typically, a file on the computer hard drive takes up twice as much space in memory (RAM). Be mindful of that and don't load a file that will exceed your computer's available RAM!
Large file sizes will require us to develop workarounds, both in terms of how they are loaded and used, and how they are shared. We will deal with these later. If you run into issues with memory constraints before we cover it more formally, elevate the issue to me before/during/after class, your TA and me during office hours, and the GitHub discussion repo.

The issue of speed

Python's biggest strength is the manner in which we write its code. We write plain statements like for i in range(10): print(i). This makes it easy and fast to write and debug code. We don't explicitly write many particulars that the computer needs to know to execute code (e.g. what type is this variable, what operations should I call for that type, how much memory do I allocate for this process?). This is why Python is a "high level" language - it is written at a level far from the computer's language (binary).

The main cost of this is that Python code can be sloooooooooow to execute. But remember that total time = programming time + executing time:

Python, as I mentioned above, tends to minimize programming time relative to other languages.
Executing time is, to an approximation, free: Start executing slow code and go see a movie, sleep through the night, or whatever! There is nothing better than having fun and being able to simultaneously claim "you're working".
Usually, your code is slow because of one or only a few lines of code. If the code must be speed up, we can identify the culprit, and apply one of a few fixes.

An example that shows why Python is slow is the following:

# a "+" sign isn't just a plus sign
a,b=3, 14
print(a+b)
a,b="py ","is good"
print(a+b)

17
py is good

Python handled that like a pro! The "+" operator saw numbers and added, but then it saw strings and concatenated.

Q: How did it do that? A: The interpreter in each line checked the objects the "+" was operating on and applied the correct method!

However, checking objects before every operation involves considerable work:

If you want to add 10 numbers, that's 9 addition operations, plus each time the code checks both objects involved, which is another 18 operations. So it took 27 steps to add 10 numbers.
Languages like FORTRAN and C are compiled, and users have to declare that all 10 numbers are numbers, so C code just plows ahead and is done in 9 steps.
It is even worse than that, because in C, the next number in an array is always the same distance from the prior number (in terms of its physical location in memory). Meanwhile, in Python, the program can't quite assume the next number is the same distance away. These "look ups" can cause huge delays.

So here is an example: Let's pull random numbers and square them:

%%time

y = 0     
for i in range(1000000):
    x = random.uniform(0, 1)
    y += x**2

Wall time: 1.08 s

Oye, that's not good.

We haven't earned that high five yet, Ghostrider.

The speed solution

Vectorization. Send commands on arrays to the computer pre-compiled, with efficient native machine code.

Let's redo the code above:

%%time

x = np.random.uniform(0, 1, 1000000)
y = np.sum(x**2)

Wall time: 46.8 ms

Bam! That is muuuuuch faster. I think we earned that five...

Back to our regularly scheduled programming:

Why did that work?

In the slow version, we effectively sent the computer 3m commands: draw a number, square it, then add to prior.
In the fast version, we sent those 3m commands embedded into just 3 array commands: draw an array of numbers, square all the elements of the array, sum the array. The array functions are optimized to use native machine code and don't have to check each element for type.

Verdict: arrays and vectorization are great*! And in Python, we implement them using numpy.

So let's learn some numpy!

Numpy - A veeeery quick guide

(HINT: Jupyter's "Help" menu has a link to Numpy documentation.)

You can access an element of a numpy array just like a list: x=np.arange(1,5,1); x[1]. If the array is a matrix, x[row,col] works.

As usual, check out Whirlwind for a more comprehensive dive into splitting, slicing, and more operations.

Common methods:

YOUR TURN: I suggest copying (by literally typing, not pasting!) these commands into your personal cheat sheet. During class, I'm just going to quickly read these off, so return to this later.

np.array([user defined list, or lists of lists]) creates an array or matrix
np.ones(how many) and np.ones([rows,cols]), same but all elements are 1
np.zeros(how many) and np.zeros([rows,cols]), same but all elements are 0
np.arange(start,end,stepsize) , creates array, note that the array will not include any elements >=end
np.linspace(from,to,# of elements), creates array covering the range specified
np.eye(#) creates an identity matrix of size #x#
np.concatenate([x, y]) combines arrays x and y
np.nan is a NaN object (e.g. like a missing element in a data table)
np.ceil(#), np.floor(#) if #=3.4, ceil will return 4, and floor will return 3.
np.max(x), np.min(x), np.average(x), np.mediam(x) and many more statistical operations work as you would expect
np.reshape(x,[rows,cols]) works as it looks
np.random.<dist> can draw random numbers from many distributions
- use tab autocompletion to see all the options (type np.random. and then hit TAB)
- YOU MUST NEVER EVER EVER EVER EVER DRAW RANDOM NUMBERS WITHOUT SETTING A SEED!!!

ASIDE: ALWAYS SET A SEED!

Why have I emphasized a "seed" so much? Well, suppose you run analysis, and in it, you draw random numbers.

vector = np.random.uniform(0,1,10)
print(vector)

[0.0776388  0.79816138 0.89095822 0.19309575 0.22132198 0.99545593
 0.97664861 0.29155159 0.89413791 0.03145377]

Now we run some analysis that uses that vector...

But then later, someone else tries to replicate your results, so they run

vector = np.random.uniform(0,1,10)
print(vector)

[0.55146202 0.78933851 0.51876759 0.63832625 0.1738114  0.99936084
 0.91071885 0.3878562  0.84329425 0.23836685]

UH-OH. The rest of their analysis will not match yours. Your results can't be replicated.

Possibly more annoying: Every time YOU rerun it, the results will change!

HOW TO SET A SEED:

np.random.seed(100) # the seed is 100
vector = np.random.uniform(0,1,10)
np.random.seed(100)
vector2 = np.random.uniform(0,1,10)
print(vector==vector2)

[ True  True  True  True  True  True  True  True  True  True]

Now, using 100 as the seed, others can replicate your code.

Numpy with pandas

These operations all work on pandas objects. That's the tweet.

The dark side of vectors and `numpy`

You can't vectorize every operation :(
Can be prohibitive, memory-wise: When you run an array operation, Python creates the entire array and puts it into memory, then runs it. A vector of length 1,000,000,000,000 is huge and requires substantial memory to create. By contrast, you can execute for i in range(1,000,000,000,000): pass without causing an issue, because Python never created that vector, it just iterated over numbers. This is because range(#) is a "generator" and not an explicit object.

Before next class

YOUR TURN: Find "01a-numpy-practice.ipynb" worksheet inside the lecture repo, then answer all the questions and place the file inside your participation repo.
Find the last lecture in the github repo, and copy the "Golden Rules" into your navigating_github.md file. Then, at the bottom of that table, add a new row: the first column should say "Randoms" and the second column should say "NEVER DRAW RANDOM NUMBERS WITHOUT A SEED".