import numpy as np # by convention, numpy is always imported with the alias np
import random
Open a folder on your computer with your participation files and open Jupyter to that location.
Outline
- Beware of memory needs and don't overload your computer
- If your code is slow, find the offending line(s), and try to optimize those
- If you draw random numbers, set a seed so that your results can be replicated and so they don't change everytime you rerun the code
- Numpy and basic commands
The issue of memory
- MY MAIN POINT HERE: The files we deal with will get larger as the class moves forward. Typically, a file on the computer hard drive takes up twice as much space in memory (RAM). Be mindful of that and don't load a file that will exceed your computer's available RAM!
- Large file sizes will require us to develop workarounds, both in terms of how they are loaded and used, and how they are shared. We will deal with these later. If you run into issues with memory constraints before we cover it more formally, elevate the issue to me before/during/after class, your TA and me during office hours, and the GitHub discussion repo.
The issue of speed
Python's biggest strength is the manner in which we write its code. We write plain statements like for i in range(10): print(i)
. This makes it easy and fast to write and debug code. We don't explicitly write many particulars that the computer needs to know to execute code (e.g. what type is this variable, what operations should I call for that type, how much memory do I allocate for this process?). This is why Python is a "high level" language - it is written at a level far from the computer's language (binary).
The main cost of this is that Python code can be sloooooooooow to execute. But remember that total time = programming time + executing time
:
- Python, as I mentioned above, tends to minimize
programming time
relative to other languages. - Executing time is, to an approximation, free: Start executing slow code and go see a movie, sleep through the night, or whatever! There is nothing better than having fun and being able to simultaneously claim "you're working".
- Usually, your code is slow because of one or only a few lines of code. If the code must be speed up, we can identify the culprit, and apply one of a few fixes.
An example that shows why Python is slow is the following:
# a "+" sign isn't just a plus sign
a,b=3, 14
print(a+b)
a,b="py ","is good"
print(a+b)
Python handled that like a pro! The "+" operator saw numbers and added, but then it saw strings and concatenated.
Q: How did it do that? A: The interpreter in each line checked the objects the "+" was operating on and applied the correct method!
However, checking objects before every operation involves considerable work:
- If you want to add 10 numbers, that's 9 addition operations, plus each time the code checks both objects involved, which is another 18 operations. So it took 27 steps to add 10 numbers.
- Languages like FORTRAN and C are compiled, and users have to declare that all 10 numbers are numbers, so C code just plows ahead and is done in 9 steps.
- It is even worse than that, because in C, the next number in an array is always the same distance from the prior number (in terms of its physical location in memory). Meanwhile, in Python, the program can't quite assume the next number is the same distance away. These "look ups" can cause huge delays.
So here is an example: Let's pull random numbers and square them:
%%time
y = 0
for i in range(1000000):
x = random.uniform(0, 1)
y += x**2
%%time
x = np.random.uniform(0, 1, 1000000)
y = np.sum(x**2)
Bam! That is muuuuuch faster. I think we earned that five...
.
.
.
.
.
Back to our regularly scheduled programming:
Why did that work?
- In the slow version, we effectively sent the computer 3m commands: draw a number, square it, then add to prior.
- In the fast version, we sent those 3m commands embedded into just 3 array commands: draw an array of numbers, square all the elements of the array, sum the array. The array functions are optimized to use native machine code and don't have to check each element for type.
Verdict: arrays and vectorization are great*! And in Python, we implement them using numpy
.
So let's learn some numpy
!
Numpy - A veeeery quick guide
(HINT: Jupyter's "Help" menu has a link to Numpy documentation.)
You can access an element of a numpy array just like a list: x=np.arange(1,5,1); x[1]
. If the array is a matrix, x[row,col]
works.
As usual, check out Whirlwind for a more comprehensive dive into splitting, slicing, and more operations.
Common methods:
YOUR TURN: I suggest copying (by literally typing, not pasting!) these commands into your personal cheat sheet. During class, I'm just going to quickly read these off, so return to this later.
np.array([user defined list, or lists of lists])
creates an array or matrixnp.ones(how many)
andnp.ones([rows,cols])
, same but all elements are 1np.zeros(how many)
andnp.zeros([rows,cols])
, same but all elements are 0np.arange(start,end,stepsize)
, creates array, note that the array will not include any elements>=end
np.linspace(from,to,# of elements)
, creates array covering the range specifiednp.eye(#)
creates an identity matrix of size #x#np.concatenate([x, y])
combines arraysx
andy
np.nan
is a NaN object (e.g. like a missing element in a data table)np.ceil(#)
,np.floor(#)
if #=3.4, ceil will return 4, and floor will return 3.np.max(x)
,np.min(x)
,np.average(x)
,np.mediam(x)
and many more statistical operations work as you would expectnp.reshape(x,[rows,cols])
works as it looksnp.random.<dist>
can draw random numbers from many distributions- use tab autocompletion to see all the options (type
np.random.
and then hit TAB) - YOU MUST NEVER EVER EVER EVER EVER DRAW RANDOM NUMBERS WITHOUT SETTING A SEED!!!
- use tab autocompletion to see all the options (type
ASIDE: ALWAYS SET A SEED!
Why have I emphasized a "seed" so much? Well, suppose you run analysis, and in it, you draw random numbers.
vector = np.random.uniform(0,1,10)
print(vector)
Now we run some analysis that uses that vector...
But then later, someone else tries to replicate your results, so they run
vector = np.random.uniform(0,1,10)
print(vector)
UH-OH. The rest of their analysis will not match yours. Your results can't be replicated.
Possibly more annoying: Every time YOU rerun it, the results will change!
HOW TO SET A SEED:
np.random.seed(100) # the seed is 100
vector = np.random.uniform(0,1,10)
np.random.seed(100)
vector2 = np.random.uniform(0,1,10)
print(vector==vector2)
Now, using 100 as the seed, others can replicate your code.
The dark side of vectors and numpy
- You can't vectorize every operation :(
- Can be prohibitive, memory-wise: When you run an array operation, Python creates the entire array and puts it into memory, then runs it. A vector of length
1,000,000,000,000
is huge and requires substantial memory to create. By contrast, you can executefor i in range(1,000,000,000,000): pass
without causing an issue, because Python never created that vector, it just iterated over numbers. This is becauserange(#)
is a "generator" and not an explicit object.
Before next class
- YOUR TURN: Find "01a-numpy-practice.ipynb" worksheet inside the lecture repo, then answer all the questions and place the file inside your participation repo.
- Find the last lecture in the github repo, and copy the "Golden Rules" into your
navigating_github.md
file. Then, at the bottom of that table, add a new row: the first column should say "Randoms" and the second column should say "NEVER DRAW RANDOM NUMBERS WITHOUT A SEED".