Numpy

NumPy, short for "Numerical Python," is a fundamental library for scientific computing in Python. It's a favorite among programmers because it makes complex tasks simple. Since Python is a high-level language, we don't need to worry about manually allocating memory. Low-level languages, on the other hand, require us to define memory allocation and processing, which offers more control but can slow down our programming. NumPy strikes the perfect balance: fast processing without the hassle of manual allocation.

Introduction to Ndarrays

In programming, an array describes a collection of elements, similar to a list. The word n-dimensional refers to the fact that ndarrays can have one or more dimensions. For now, we'll start by working with one-dimensional (1D) ndarrays.

import numpy as np

Next, we'll learn how to create a 1D ndarray by directly converting a list to an ndarray using the numpy.array() constructor. Here's an example of how we can create a 1D ndarray:

data_ndarray = np.array([5, 10, 15, 20])

Benefits of Ndarrays and NumPy

Ndarrays and the NumPy library simplify data manipulation and analysis. Let's dive into why they are more efficient than using standard Python.

In standard Python, we might use a lists of lists to represent datasets. While this works for small datasets, it's not ideal for larger ones.

Consider an example with two columns of data, where each row has two numbers to be added. In standard Python, we could store the data using a list of lists structure and employ a for loop to iterate over it, extract the two values, sum them, and append the result to a new list called sums:

During each iteration, Python converts our code into bytecode, which instructs our computer's processor to add the numbers:

For our example, the computer would need eight processor cycles to process the eight rows of data.

The NumPy library, on the other hand, leverages a processor feature called Single Instruction Multiple Data (SIMD) for faster data processing. SIMD enables a processor to execute the same operation on multiple data points in a single cycle:

Consequently, NumPy requires only two processor cycles — four times faster than standard Python. This technique of replacing for loops with simultaneous operations on multiple data points is called vectorization, made possible by ndarrays.

Two dimensional arrays

ndarrays can also be two-dimensional:

import csv
import numpy as np

# import nyc_taxi.csv as a list of lists
f = open("nyc_taxis.csv", "r")
taxi_list = list(csv.reader(f))

# remove the header row
taxi_list = taxi_list[1:]

# convert all values to floats
converted_taxi_list = []
for row in taxi_list:
    converted_row = []
    for element in row:
        converted_row.append(float(element))
    converted_taxi_list.append(converted_row)

# start writing your code below this comment
taxi=np.array(converted_taxi_list)

Printing data

print(taxi)


[[  2016      1      1 ...  11.65  69.99      1]
 [  2016      1      1 ...      8   54.3      1]
 [  2016      1      1 ...      0   37.8      2]
 ...
 [  2016      6     30 ...      5  63.34      1]
 [  2016      6     30 ...   8.95  44.75      1]
 [  2016      6     30 ...      0  54.84      2]]

The ellipses (...) between rows and columns indicate that there is more data in our NumPy ndarray than can easily be printed.

Finding the shape of array

it's often useful to know the dimensions (number of rows and columns) of an ndarray. When we can't easily print the entire ndarray, we can use the ndarray.shape attribute instead:

data_ndarray = np.array([[5, 10, 15], 
                         [20, 25, 30]])
print(data_ndarray.shape)

(2, 3)

This output, which is a tuple, gives us a couple of important pieces of information:

The first number tells us that there are two rows in data_ndarray.
The second number tells us that there are three columns in data_ndarray.

Remember that tuples are similar to Python lists, but they cannot be modified.

Selecting and Slicing Rows from Ndarrays

As you can see, selecting rows in ndarrays is quite similar to selecting data from lists of lists. However, when working with ndarrays, we have a more convenient way to select data using the following syntax:

# select all columns for a given set of rows
ndarray[row_index] 

# select particular columns for a given set of rows
ndarray[row_index, column_index]

Here, row_index specifies the location along the row axis, and column_index specifies the location along the column axis. These can be single index values, a list of index values, or slices.

Keep in mind that, as with lists, array slicing starts at the first specified index and goes up to, but does not include, the second specified index. So, if we want to select the elements at index 1, 2, and 3, we should use the slice [1:4].

Here's how we select a single element from a 2D ndarray:

Notice the difference: when working with a list of lists, we use two separate pairs of square brackets back-to-back, whereas with a NumPy ndarray, we use a single pair of brackets with comma-separated row and column locations.

Selecting Columns and Custom Slicing Ndarrays

Now, let's see how we can easily select one or more columns of data using ndarrays compared to the same task for lists of lists:

With a list of lists, we typically need to use a for loop to extract specific column(s) and append them to a new list. But with ndarrays, we can achieve this more efficiently! We use single brackets with comma-separated row and column locations. We can use a colon (:) for the row locations, which gives us all of the rows. Also, when selecting multiple rows or columns that aren't consecutive, we place our indices within square brackets to create a list of the indices of interest to us.

If we want to select a partial 1D slice of a row or column, we can combine a single value for one dimension with a slice for the other dimension:

And if we're looking to select a 2D slice, we can simply use slices or lists for both/either dimension:

From the provided taxi ndarray:

Select every row for the columns at indices 1, 4, and 7. Assign the result to columns_1_4_7.
Select the columns at indices 5 to 8 inclusive for the row at index 99. Assign the result to row_99_columns_5_to_8.
Select the rows at indices 100 to 200 inclusive for the column at index 14. Assign the result to rows_100_to_200_column_14.

columns_1_4_7=taxi[:,[1,4,7]]
row_99_columns_5_to_8=taxi[99,5:9]
rows_100_to_200_column_14=taxi[100:201,14]

columns_1_4_7
ndarray(<class 'numpy.ndarray'>)
array([[ 1, 0, 21], [ 1, 0, 16.29], [ 1, 0, 12.7], ..., [ 6, 5, 17.48], [ 6, 5, 12.76], [ 6, 5, 17.54]])

row_99_columns_5_to_8
ndarray(<class 'numpy.ndarray'>)
array([ 2, 4, 20.91, 1744])

rows_100_to_200_column_14
ndarray(<class 'numpy.ndarray'>)
array([ 2, 1, 1, 1, 1, 1, 2, 1, 1, 2, 1, 1, 1, 2, 2, 2, 1, 2, 1, 2, 1, 1, 2, 2, 2, 1, 1, 2, 1, 2, 1, 1, 2, 2, 1, 1, 2, 2, 1, 1, 1, 2, 1, 1, 1, 2, 2, 2, 2, 2, 1, 4, 2, 1, 2, 1, 2, 2, 2, 2, 1, 1, 2, 1, 2, 2, 2, 2, 1, 2, 2, 1, 2, 1, 2, 1, 2, 2, 1, 1, 1, 1, 2, 1, 1, 2, 2, 1, 1, 2, 2, 1, 1, 2, 1, 1, 1, 1, 1, 2, 2])

Vector Operations.

NumPy ndarrays not only make selecting data much easier, they also allow us to perform vectorized operations more efficiently. Vectorized operations apply to multiple data points at once, making them faster than traditional loops.

Consider our previous example of adding two columns of data. With our data in a list of lists, we'd have to construct a for loop and add each pair of values from each row individually and then append the results to a new list to get the sum:

# convert the list of lists to an ndarray
my_numbers = np.array(my_numbers)

# create 1D ndarrays by selecting each of the columns
col1 = my_numbers[:, 0]
col2 = my_numbers[:, 1]

# add the two ndarrays element-wise to get the sums
sums = col1 + col2

We could even do it in a single line of code:

sums = my_numbers[:, 0] + my_numbers[:, 1]

Some key takeaways from this code:

We used the syntax ndarray_name[:, c] to select each column, where c is the column index. The colon (:) selects all rows.
To add the two 1D ndarrays element-wise, col1 and col2, we simply use the addition operator (+) between them.

Adding two 1D ndarrays element-wise results in a 1D ndarray of the same shape (i.e., they have the same dimensions) as the originals. In this context, ndarrays can also be called vectors — a term from linear algebra. Adding two vectors together is known as vector addition.

Here are some of the basic arithmetic operations we can use with vectors:

vector_a + vector_b — addition
vector_a - vector_b — subtraction
vector_a * vector_b — multiplication
vector_a / vector_b — division

Keep in mind that when performing these operations on two 1D vectors, they must have the same shape since all of these operations are performed element-wise.

To find the minimum value of a 1D ndarray, we can use the handy vectorized ndarray.min() method, like this:

mph_min = trip_mph.min()
print(mph_min)

0.0

urprisingly, the minimum value in our trip_mph ndarray is 0.0 — indicating a trip that didn't travel any distance at all!

NumPy ndarrays come with a variety of useful methods to make calculations a breeze. Here are a few key methods you'll find incredibly helpful:

You can explore the full list of ndarray methods in the NumPy ndarray documentation.

It's essential to familiarize yourself with the documentation because remembering the syntax for every variation of every data science library is quite a challenge! However, if you know what's possible and can read the documentation, you'll always be able to refresh your memory when needed.

When you see the syntax ndarray.method_name(), substitute ndarray with the name of your ndarray (in this case, trip_mph) like this:

urprisingly, the minimum value in our trip_mph ndarray is 0.0 — indicating a trip that didn't travel any distance at all!

NumPy ndarrays come with a variety of useful methods to make calculations a breeze. Here are a few key methods you'll find incredibly helpful:

You can explore the full list of ndarray methods in the NumPy ndarray documentation.

When you see the syntax ndarray.method_name(), substitute ndarray with the name of your ndarray (in this case, trip_mph) like this:

urprisingly, the minimum value in our trip_mph ndarray is 0.0 — indicating a trip that didn't travel any distance at all!

NumPy ndarrays come with a variety of useful methods to make calculations a breeze. Here are a few key methods you'll find incredibly helpful:

You can explore the full list of ndarray methods in the NumPy ndarray documentation.

When you see the syntax ndarray.method_name(), substitute ndarray with the name of your ndarray (in this case, trip_mph) like this:

Functions vs Methods

Functions are standalone pieces of code that typically take an input, perform some processing, and return some output. For example, the len() function calculates the length of a list or the number of characters in a string.

In contrast, methods are special functions associated with a specific type of Python object. For example, the list.append() method adds an item to the end of a Python list. Since the append() method is defined for list objects but not for string objects, using this method on a string object results in an error:

In NumPy, some operations are available as both methods and functions, which can be confusing. Here are a few examples:

CalculationFunction ImplementationMethod ImplementationCalculate the minimum value of trip_mphnp.min(trip_mph)trip_mph.min()Calculate the maximum value of trip_mphnp.max(trip_mph)trip_mph.max()Calculate the mean value of trip_mphnp.mean(trip_mph)trip_mph.mean()Calculate the median value of trip_mphnp.median(trip_mph)There is no ndarray median method

The following will help you remember the correct terminology to use:

Function calls usually start with the library name or its alias (e.g., np.mean()).
Method calls begin with an object or variable name from a particular class (e.g., trip_mph.mean()).

In Python, whether to use a function or a method for an operation depends on the context and the object being used. In general, if a particular operation is directly related to a particular object or data type, it is better to use a method. On the other hand, if the operation is not directly related to a specific object or data type, it is better to use a function. That being said, there is no hard and fast rule about whether to use a function or a method, and it ultimately comes down to personal preference and readability of the code.

When working with a 2D ndarray, using the ndarray.max() method without any additional parameters returns a single value like we saw with a 1D ndarray, representing the overall maximum:

But what if we want to find the maximum value of each row? We can use the axis parameter and set it to 1 to find the maximum value for each row:

Similarly, we set axis to 0 to find the maximum value of each column:

# extract the first 5 rows only
taxi_first_five = taxi[:5]
# select columns: fare_amount, fees_amount, tolls_amount, and tip_amount
fare_components = taxi_first_five[:, 9:13]


# sum the fare component columns
fare_sums = fare_components.sum(axis=1)

# select the total_amount column
fare_totals = taxi_first_five[:, 13]

# compare the summed columns to fare_totals
print(fare_totals)
print(fare_sums)

Output
[ 69.99   54.3   37.8  32.76   18.8]
[ 69.99   54.3   37.8  32.76   18.8]

Load file directly with Numpy

NumPy contains many helpful functions that make loading data directly into an ndarray much easier. In the exercise below, we'll use NumPy's numpy.genfromtxt() function to load our taxi data directly into an ndarray.

Here's the simplified syntax for using the numpy.genfromtxt() function to load a text file:

import numpy as np

np.genfromtxt(filename, delimiter=None)

filename: a positional argument; path (usually defined as a string) to the text file.
delimiter: a named argument; string used to separate each value in the text file. For CSV files, we use a comma – defined as a string – (',') as the delimeter.

To read and load a file named data.csv into an ndarray variable called data, we'd use the following syntax:

data = np.genfromtxt('data.csv', delimiter=',')

example

taxi = np.genfromtxt('nyc_taxis.csv', delimiter=',')
taxi_shape=taxi.shape

Output
taxi
ndarray(<class 'numpy.ndarray'>)
array([[ nan, nan, nan, ..., nan, nan, nan], [ 2016, 1, 1, ..., 11.65, 69.99, 1], [ 2016, 1, 1, ..., 8, 54.3, 1], ..., [ 2016, 6, 30, ..., 5, 63.34, 1], [ 2016, 6, 30, ..., 8.95, 44.75, 1], [ 2016, 6, 30, ..., 0, 54.84, 2]])

taxi_shape
tuple(<class 'tuple'>)
(2014, 15)

We also noticed something interesting about the first row of taxi: it's full of something called nan (or NaN) values.

print(taxi[0])

[   nan    nan    nan    nan    nan    nan    nan    nan    nan    nan
    nan    nan    nan    nan    nan]

We can use the ndarray.dtype attribute to see the internal data type that NumPy chose when creating our taxi ndarray:

print(taxi.dtype)

float64

NumPy chose the float64 data type since it allows the values from our CSV file to be accurately represented. You can think of NumPy's float64 type as identical to Python's float type, with the 64 referring to the number of bits used to store the value.

Going back to how we opened the file in the previous lesson, notice that before we converted all the numerical string values to floats, we removed the header row that contained the column names. When we loaded the same file on the previous screen using numpy.genfromtxt(), we didn't take any special steps to deal with that header row. If you're thinking that's why we're seeing NaN values now, you're absolutely right!

NumPy determined that the data stored in the CSV file would be best represented using the float64 data type. When it tried to convert the column headers (stored as strings) in the first row into float64 values, it didn't know how to do that so it converted them to NaN values instead.

NaN stands for Not a Number and indicates that the underlying value cannot be represented as a number. It's similar to Python's None constant and it is often used to represent missing values in datasets. In our case, the NaN values appear because the first row of our CSV file contains column names, which NumPy can't convert to float64 values. The solution is simple enough: we need to remove that problematic row!

To remove the header row from our ndarray, we can use a slice, just like with a list of lists:

taxi = taxi[1:]

Alternatively, we can avoid getting NaN values by skipping the header row(s) when loading the data. We do this by passing an additional argument, skip_header=1, to our call to the numpy.genfromtxt() function. The skip_header argument accepts an integer — the number of rows from the start of the file to skip. Remember that this integer measures the total number of rows to skip and doesn't use index values. To skip the first row, use a value of 1, not 0.

Based on their shapes and their first five rows, there is no difference between taxi_header_removed and taxi_header_skipped! For us, this means we can use either method for dealing with NaN values caused by the header row. That said, it's best practice to handle the header row while loading in the data rather than after.

the Boolean (or bool) type is a built-in Python type that can have one of two unique values:

True
False

We often create Boolean arrays with Python comparison operators. The comparison operators in Python are:

== — equal to
> — greater than
>= — greater than or equal to
< — less than
<= — less than or equal to
!= — not equal to

Boolean comparisons are expressions that evaluate to either True or False in Python. These expressions typically involve comparing two values using comparison operators.

Now, let's look at what happens when we perform a vectorized Boolean operation between an ndarray and a single value:

print(np.array([2, 4, 6, 8]) < 5)

[True True False False]

A similar pattern occurs – each value in the array is compared to 5. If the value is less than 5, True is returned. Otherwise, False is returned.

Example

Use vectorized Boolean operations to do the following:
- Evaluate whether the elements in array a are less than 3. Assign the result to a_bool.
- Evaluate whether the elements in array b are equal to "blue". Assign the result to b_bool.
- Evaluate whether the elements in array c are greater than 100. Assign the result to c_bool.

a = np.array([1, 2, 3, 4, 5])
b = np.array(["blue", "blue", "red", "blue"])
c = np.array([80.0, 103.4, 96.9, 200.3])
a_bool=a<3
b_bool=b=="blue"
c_bool=c>100

a_bool
ndarray(<class 'numpy.ndarray'>)
array([ True, True, False, False, False])

b_bool
ndarray(<class 'numpy.ndarray'>)
array([ True, True, False, True])

c_bool
ndarray(<class 'numpy.ndarray'>)
array([False, True, False, True])

Boolean Indexing with 1D ndarrays

To use Boolean indexing, simply insert the Boolean array into the square brackets like we would with other selection techniques:

Example

Calculate the number of rides in the taxi ndarray that are from February:
- Create a Boolean array, february_bool, that evaluates whether the items in pickup_month are equal to 2.
- Use the february_bool Boolean array to index pickup_month. Assign the result to february.
- Use the .shape attribute to find the number of rides in february. Assign the result to february_rides.

pickup_month = taxi[:, 1]
february_bool=pickup_month==2
print(type(february_bool))
february=pickup_month[february_bool]
february_rides=february.shape[0]

Boolean Indexing with 2D Ndarrays

When working with 2D ndarrays, we can combine Boolean indexing with other indexing methods for even more powerful data selection. Just remember that the Boolean array must have the same length as the dimension we're indexing.

Since a Boolean array doesn't store information about its origin, we can use it to index the entire array, even if it's created from just one column.

Assigning Values in Ndarrays

What if we want to modify the data, not just retrieve it; how do we do that? As luck would have it, we can assign values to ndarrays using indexing techniques we've already learned! Here's the syntax (in pseudocode) to keep in mind:.

ndarray[location_of_values] = new_value

Let's see this in action with a 1D array. We can change a value at a specific index location:

a = np.array(['red', 'blue', 'black', 'blue', 'purple'])
a[0] = 'orange'
print(a)

['orange', 'blue', 'black', 'blue', 'purple']

We can also update multiple values at once:

a[3:] = 'pink'
print(a)

['orange', 'blue', 'black', 'pink', 'pink']

Now, let's try with a 2D ndarray. Just like with a 1D ndarray, we can change a specific index location:

ones = np.array([[1, 1, 1, 1, 1],
                 [1, 1, 1, 1, 1],
                 [1, 1, 1, 1, 1]])
ones[1, 2] = 99
print(ones)

[[ 1,  1,  1,  1,  1],
 [ 1,  1, 99,  1,  1],
 [ 1,  1,  1,  1,  1]]

We can also update an entire row...

ones[0] = 42
print(ones)

[[42, 42, 42, 42, 42],
 [ 1,  1, 99,  1,  1],
 [ 1,  1,  1,  1,  1]]

...or an entire column:

ones[:, 2] = 0
print(ones)

[[42, 42, 0, 42, 42],
 [ 1,  1, 0,  1,  1],
 [ 1,  1, 0,  1,  1]]

Copy an Array

taxi_copy = taxi.copy()

Assignment Using Boolean Arrays

We saw how we can use standard indexing and slicing to assign values to our ndarrays. But wait, there's more! We can also use Boolean arrays for assigning values. Boolean arrays reveal their true potential when used for assignment. Let's dive into an example that showcases their potential:

a2 = np.array([1, 2, 3, 4, 5])

a2_bool = a2 > 2

a2[a2_bool] = 99

print(a2)

[ 1  2 99 99 99]

The Boolean array a2_bool acts like a filter, controlling which values are affected by the assignment operation. The other values remain unchanged, untouched by the transformation. Let's unravel how this code works:

You may notice in the diagram above that we took a "shortcut" – we inserted the definition of the Boolean array directly into the selection. This "shortcut" is the typical way to implement Boolean indexing.

let's dive into the exciting world of 2D Boolean arrays! We'll start by exploring an example together:

he b > 4 Boolean operation above generates a 2D Boolean array, which then determines the values the assignment affects.

But wait, there's more! We can also use a 1D Boolean array to modify a 2D array:

The Boolean operation c[:, 1] > 2 selects the the second column of c for comparison and creates a 1D Boolean array based on the values of that column being > 2 or not. We then use the resulting Boolean array as the row index for assignment and we use 1 as the column index for assignment. This ensures we select the proper rows and columns before we assign the value of 99 to them (c[c[:, 1] > 2, 1] = 99). This way, only the second column will (possibly) have its values changed, and all other columns will remain untouched.

Challenge: Which Is the Busiest Airport?

Time to test your newfound skills with a thrilling Boolean indexing challenge! In the exercise below, you'll apply the techniques you learned in this lesson to find out which airport in our dataset is the busiest in terms of pickup location.

You'll use Boolean indexing to create three filtered arrays and then compare the number of rows in each array. Don't worry, we have some hints ready in case you need them, but give it a shot without them first! Remember, it's normal for these challenges to take a few tries – working with data is an iterative process!

Let's determine the busiest airport by checking the pickup_location_code column (column index 5) for these specific values:

2 == JFK Airport
3 == LaGuardia Airport
5 == Newark Airport

Happy coding and good luck!

Instructions

Find the number of trips with JFK Airport as the pickup location:
- Use Boolean indexing to select only the rows of taxi where the pickup_location_code corresponds to JFK.
- Assign the resulting filtered array of all JFK Airport pickups to jfk.
- Calculate the number of rows in the jfk array and assign the result to jfk_count.
Find the number of trips with LaGuardia Airport as the pickup location:
- Use Boolean indexing to select only the rows of taxi where the pickup_location_code corresponds to LaGuardia.
- Assign the resulting filtered array of all LaGuardia pickups to laguardia.
- Calculate the number of rows in the laguardia array and assign the result to laguardia_count.
Find the number of trips with Newark Airport as the pickup location:
- Select only the rows of taxi where the pickup_location_code corresponds to Newark, and assign the result to newark.
- Calculate the number of rows in the newark array and assign the result to newark_count.
After running your code, inspect the values for jfk_count, laguardia_count, and newark_count to determine which airport has the most pickups.
- Based on the number of pickups at each airport, assign either "jfk", "laguardia", or "newark" to the variable busiest_airport.

jfk = taxi[taxi[:, 5] == 2]
jfk_count = jfk.shape[0]

laguardia = taxi[taxi[:, 5] == 3]
laguardia_count = laguardia.shape[0]

newark = taxi[taxi[:, 5] == 5]
newark_count = newark.shape[0]

busiest_airport = "laguardia"

PreviousJupyter Notebooks NextPandas

Last updated 1 year ago