Pandas

Even NumPy has its limitations:

It doesn't support column names, so we must frame questions as multi-dimensional array operations.
It only allows for one data type per ndarray, complicating the handling of mixed numeric and string data.
While there are many low-level methods, some common analysis patterns lack pre-built methods.

Fear not, for pandas 🐼 is here to the rescue! Pandas is an incredibly versatile and user-friendly Python library designed to make our data exploration and analysis journey both fun and efficient. It gets its name from the econometrics term "panel data." With pandas, we'll be able to easily manipulate, clean, and visualize data, all while enjoying the process. Pandas is not a replacement for NumPy, but rather an extension that builds upon its strengths. Since pandas' underlying code relies heavily on NumPy, our newly acquired skills will be invaluable as we explore this exciting new library.

Introducing the hero of our story: the pandas DataFrame! DataFrames are pandas' answer to NumPy's 2D ndarrays, but with some game-changing enhancements:

Axis values can have string labels, not just numeric ones.
DataFrames can contain columns with multiple data types, including: integer, float, and string.

Feast your eyes on the impressive structure of a pandas DataFrame:

As we explore the versatility of pandas, we'll use a dataset from Fortune magazine's 2017 Global 500 list. This list ranks the top 500 corporations worldwide by revenue. The dataset was initially compiled here, but we've tweaked it to make it more learner-friendly.

Our dataset is a CSV file called f500.csv. To help us understand the data better, here's a handy data dictionary for some of the columns in the CSV:

company: The company's name.
rank: Global 500 rank for the company.
revenues: Company's total revenue for the fiscal year, in millions of dollars (USD).
revenue_change: Percentage change in revenue between the current and prior fiscal year.
profits: Net income for the fiscal year, in millions of dollars (USD).
ceo: Company's Chief Executive Officer.
industry: The company's industry of operation.
sector: Sector in which the company operates.
previous_rank: Global 500 rank for the company for the prior year.
country: Country of the company's headquarters.

When working with pandas, we follow the conventional import method, similar to NumPy (import numpy as np):

import pandas as pd

Importing CSV File

import pandas as pd
f500 = pd.read_csv('f500.csv', index_col=0)
f500.index.name = None

The index_col=0 parameter specifies that the first column of the CSV file should be used as the index for the DataFrame. This line sets the name of the index to None. By default, the index of a DataFrame can have a name, but this line removes that name, making the index unnamed.

Checking basic information

f500_type=type(f500)
f500_shape=f500.shape

f500_type
type(<class 'type'>)
pandas.core.frame.DataFrame

f500_shape
tuple(<class 'tuple'>)
(500, 16)

Use the DataFrame.head() method to select the first 6 rows. Assign the result to f500_top_6.
Use the DataFrame.tail() method to select the last 8 rows. Assign the result to f500_bottom_8.

f500_top_6=(f500.head(6))
f500_bottom_8=(f500.tail(8))

Another feature that makes pandas a powerful tool for working with data is that DataFrames can contain more than one data type:

Axis values can have string labels, not just numeric ones.
DataFrames can contain columns with multiple data types, including: integer, float, and string.

To learn about the types of each column, we can use the DataFrame.dtypes attribute, similar to NumPy's ndarray.dtype attribute. Let's explore an example using a selection of data stored in the variable f500_selection.

print(f500_selection.dtypes)

rank          int64
revenues      int64
profits     float64
country      object
dtype: object

Here, we can see three different data types or dtypes.

You may recognize the float64 dtype from our work in NumPy. Pandas uses NumPy dtypes for numeric columns, including int64. Additionally, there's a type we haven't seen before: object. This is used for columns containing data that doesn't fit into any other dtypes, typically for columns with string values.

When we import data, pandas tries to guess the correct dtype for each column. In general, pandas does a great job, so we don't need to worry about specifying dtypes every time we work with data.

If we want an overview of all the dtypes used in our DataFrame, along with its shape and other information, we can use the DataFrame.info() method. Keep in mind that DataFrame.info() prints the information automatically without having to make a call the print() function. In fact, it returns the Python None object so we can't assign the information to a variable like we have been doing.

f500.info()

<class 'pandas.core.frame.DataFrame'>
Index: 500 entries, Walmart to AutoNation
Data columns (total 16 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   rank                      500 non-null    int64  
 1   revenues                  500 non-null    int64  
 2   revenue_change            498 non-null    float64
 3   profits                   499 non-null    float64
 4   assets                    500 non-null    int64  
 5   profit_change             436 non-null    float64
 6   ceo                       500 non-null    object 
 7   industry                  500 non-null    object 
 8   sector                    500 non-null    object 
 9   previous_rank             500 non-null    int64  
 10  country                   500 non-null    object 
 11  hq_location               500 non-null    object 
 12  website                   500 non-null    object 
 13  years_on_global_500_list  500 non-null    int64  
 14  employees                 500 non-null    int64  
 15  total_stockholder_equity  500 non-null    int64  
dtypes: float64(3), int64(7), object(6)
memory usage: 66.4+ KB

Selecting a Column from a DataFrame by Label

In our previous exercise, we got some information about our data with the useful DataFrame.info() method. We learned the number of rows, number of columns, data types used for each column, and memory usage of our dataset. But, it's time to go deeper into the world of pandas and learn how to select specific data points!

Labels

Pandas labels are our friends. There are two types of labels in pandas: Row Labels and Column Labels. Unlike NumPy, where we needed to know the exact index location, pandas allows us to select data using these friendly labels. The secret weapon? The DataFrame.loc[] attribute! Check out its syntax below:

df.loc[row_label, column_label]

Notice that we use square brackets ([]), not parentheses (()), when selecting by location.

By the way, you'll often see df used as shorthand for a generic DataFrame object in our examples and in the official pandas documentation. Just know that when you see it, it just means that df is a pandas DataFrame object that was created using pandas.DataFrame().

Select a Single Column

Now, let's work through an example together using a slice of our data, stored as f500_selection:

Let's select a single column by specifying a single label:

Did you notice we used : to select all rows? And the new DataFrame shares the same row labels as the original!

For an even quicker way to select a single column, we can use this shortcut that doesn't need the .loc attribute or : to select all the rows:

rank_col = f500_selection["rank"]
print(rank_col)

Walmart                     1
State Grid                  2
Sinopec Group               3
China National Petroleum    4
Toyota Motor                5
Name: rank, dtype: int64

Introduction to Series

henever we come across a 1D pandas object, it's a Series. And when we see a 2D pandas object, it's a DataFrame.

Think of a DataFrame as a team of Series objects working together, much like how pandas organizes the data behind the scenes.

As we continue our journey into data selection with pandas, keep an eye on which objects are DataFrames and which ones are Series. The functions, methods, attributes, and syntax available to us will vary depending on the type of pandas object we're working with

Select a List of Specific Columns

To select specific columns, we use a list of labels with .loc[], or directly with double square brackets, like this:

Since the object returned is two-dimensional, we've got a DataFrame, not a Series. As an alternative to using df.loc[:, ["col1", "col2"]] to select a specific list of columns, we can use the shorthand syntax df[["col1", "col2"]] to achieve the same result.

This "double bracket syntax" is often a source of confusion for many learners. Here's a little trick that can help with that. Think of the "first set" of brackets as belonging to the indexing operation, and the "second set" of brackets as defining a list object. Anytime we're making a selection, we'll need that first set of brackets, but that second set will only be necessary if we need to use a list to make our selection. No need for a list? Well, then there's no need to use nested brackets!

Select a Slice of Columns

Now, let's see how to select specific columns using a slice object with labels:

Once again, the object returned is a DataFrame. But notice we get all the columns from the first column label up to and including the last column label in our slice. This is different than most of the slicing we do in Python where the last element is not usually included – pandas does things a little differently! Also, keep in mind that there's no shortcut for selecting column slices; we need to use the .loc attribute and include : for selecting all the rows.

Here's a summary of the techniques we've learned for selecting columns:

Select by Label

Explicit Syntax

Common Shorthand

Single column

df.loc[:, "col1"]

df["col1"]

List of columns

df.loc[:, ["col1", "col7"]]

df[["col1", "col7"]]

Slice of columns

df.loc[:, "col1":"col4"]

None

Now it's time to put these techniques into practice! Let's select specific columns from our f500 DataFrame and continue sharpening our pandas skills.

Selecting Rows from a DataFrame by Label

Now that we've mastered selecting columns using the labels of the Column Axis, let's tackle selecting rows using the labels of the Index Axis:

The syntax for selecting rows from a DataFrame is the same as for columns:

df.loc[row_label, column_label]

This syntax works when we want to select specific rows and columns. However, if we want to select all columns for specific rows, we can simplify this syntax a little by using:

df.loc[row_label]

This works because when we use this syntax, pandas automatically treats it as df.loc[row_label, :] to select all the columns for us. We'll use this slightly shorter syntax when selecting specific rows for all columns.

Let's continue to work with our selection of data, stored in the variable f500_selection:

Select a Single Row

single_row = f500_selection.loc["Sinopec Group"]
print(type(single_row))
print(single_row)

class 'pandas.core.series.Series'

rank             3
revenues    267518
profits     1257.9
country      China
Name: Sinopec Group, dtype: object

Notice that the returned object is a Series because it's one-dimensional. This Series stores integer, float, and string values. Pandas uses the object dtype to accommodate all these values, as none of the numeric types could cater to them all.

Select a List of Specific Rows

list_rows = f500_selection.loc[["Toyota Motor", "Walmart"]]
print(type(list_rows))
print(list_rows)

class 'pandas.core.frame.DataFrame'

              rank  revenues  profits country
Toyota Motor     5    254694  16899.3   Japan
Walmart          1    485873  13643.0     USA

Notice how the order of the rows in our selection has been preserved in the resulting DataFrame even though Walmart appears before Toyota Motor in our original f500_selection DataFrame.

Select a Slice of Rows

For row selection using slices, we can use the shortcut below. This is why we couldn't use this shortcut for columns - because it's reserved for use with rows!

slice_rows = f500_selection["State Grid":"Toyota Motor"]
print(type(slice_rows))
print(slice_rows)

class 'pandas.core.frame.DataFrame'

                          rank  revenues  profits country
State Grid                   2    315199   9571.3   China
Sinopec Group                3    267518   1257.9   China
China National Petroleum     4    262573   1867.5   China
Toyota Motor                 5    254694  16899.3   Japan

Notice that we don't need to use the .loc attribute or nested brackets here. Also, the last element in the slice is included, unlike with regular Python slicing. Selecting rows using slices is very clean! Of course, we could use the explicit syntax to select a slice of rows (df.loc["row1":"row5"]), but now that we know a shortcut, we'll want to use it whenever we can!

Shortcut or No Shortcut?

Since they are very similar, it's very easy to confuse the shortcut syntax that doesn't use the .loc attribute with the explicit syntax that does. So how do we keep it all straight?! Well, there is no better way to learn than by practicing! But to set ourselves up for success, it helps to think about how often we'll perform one type of selection over the other.

For instance, are we more likely to select a single column or a single row when working with data? Which one should "get the shortcut?" With a little thought, it's clear that we'll select a single column more often than we'll select a single row, so selecting a column (df["col1"]) gets to use the shortcut syntax, not rows (df.loc["row1"]).

What about selection using a list? Again, it makes more sense for us to use the shortcut syntax for columns (df[["col1", "col5"]]) since we'll do that more often than selecting a list of specific rows (df.loc[["row1", "row5"]]).

But what about slices? This one favors rows since slices are continuous selections with no gaps; we'll want this functionality more often for rows (df["row1": "row5"]) than for columns since rows tend to be organized in some way, such as by date. The order of columns is rarely important, so the shortcut syntax for selection using a slice goes to the rows, not columns (df.loc[:, "col1":"col5"]).

It's important to note that these shortcuts only work if we're selecting either specific rows or specific columns, but not both. If we want to select specific rows and columns, we need to use the .loc attribute.

Here is a summary of the techniques we've learned for selecting rows:

Select by Label

Explicit Syntax

Common Shorthand

Single row

df.loc["row1"]

None

List of rows

df.loc[["row1", "row5"]]

None

Slice of rows

df.loc["row1":"row5"]

df["row1":"row5"]

Series vs DataFrames

On the past couple of screens, we created both Series objects and DataFrame objects as we selected data from our f500 DataFrame. Take a minute to review these examples before we continue:

Value Counts Method

Since Series and DataFrames are two distinct types of pandas objects, they each have their own special methods. Let's explore the powerful Series.value_counts() method to see how it counts the occurrences of each unique non-null value in a column. By default, this method will return the results from the most frequent value in the column to the least. Check out the official pandas documentation above to learn how to use the sort parameter to change the sort order.

Series.value_counts() in Action

First, let's select just one column from the f500 DataFrame:

sectors = f500["sector"]
print(type(sectors))

class 'pandas.core.series.Series'

Now, replace "Series" in Series.value_counts() with the name of our sectors Series from above, and like magic, we get:

sectors_value_counts = sectors.value_counts()
print(sectors_value_counts)

Financials                       118
Energy                            80
Technology                        44
Motor Vehicles & Parts            34
Wholesalers                       28
Health Care                       27
Food & Drug Stores                20
Transportation                    19
Telecommunications                18
Retailing                         17
Food, Beverages & Tobacco         16
Materials                         16
Industrials                       15
Aerospace & Defense               14
Engineering & Construction        13
Chemicals                          7
Media                              3
Household Products                 3
Hotels, Restaurants & Leisure      3
Business Services                  3
Apparel                            2
Name: sector, dtype: int64

value_counts() Method Meets DataFrame

What happens when we try using the value_counts() method directly on a DataFrame object instead of a Series object? Will it work? Let's find out!

First, we'll select the sector and industry columns of f500 to create another DataFrame called sectors_industries:

sectors_industries = f500[["sector", "industry"]]
print(type(sectors_industries))

< class 'pandas.core.frame.DataFrame' >

Now, let's see if the magic of value_counts() works on a DataFrame:

si_value_counts = sectors_industries.value_counts()
print(si_value_counts)

sector                         industry                                      
Financials                     Banks: Commercial and Savings                     51
Motor Vehicles & Parts         Motor Vehicles and Parts                          34
Energy                         Petroleum Refining                                28
Financials                     Insurance: Life, Health (stock)                   24
Food & Drug Stores             Food and Drug Stores                              20
Energy                         Mining, Crude-Oil Production                      18
Financials                     Insurance: Property and Casualty (Stock)          18
...
Industrials                    Construction and Farm Machinery                    2
Health Care                    Medical Products and Equipment                     2
Food, Beverages & Tobacco      Tobacco                                            1
Industrials                    Miscellaneous                                      1
Energy                         Oil & Gas Equipment Services                       1
Business Services              Travel Services                                    1
Wholesalers                    Wholesalers: Diversified                           1
dtype: int64

It worked! However, our results from calling value_counts() on a DataFrame object look slightly different than the results we got when we called it on a Series object. This is because the DataFrame.value_counts() method returns a Series object that uses a MultiIndex instead of a single index like we got when using Series.value_counts().

In the example above, the counts returned are based on a combination of the unique non-null values found in both the sector and industry columns. For example, the combination of "Energy" in the sector column and "Petroleum Refining" in the industry column occurs 28 times in our data. However, the combination of "Energy" in the sector column and "Oil & Gas Equipment Services" in the industry column only occurs once.

Although the value_counts() method works on both pandas Series and DataFrame objects, not all methods support both types of objects. When in doubt, we should check the official pandas documentation for Series attributes and methods or for DataFrame attributes and methods to make sure the attribute or method exists for the type of object we're working with.

Selecting Items from a Series by Label

We practiced using the Series.value_counts() method. We found the counts of each unique value in the country column for the entire f500 DataFrame, revealing "USA" as the country with the most companies on the Fortune 500 list:

What if we want to select the count for a single item, like India, from this Series? Or maybe we want to select the counts for a list of items, like a list of North American countries – how would we do that?

Similar to DataFrames, we can use the Series.loc[] attribute to select items from a Series using single labels, a list of labels, or a slice object. With pandas Series objects being 1D, we can take advantage of the shorthand syntax to omit the .loc attribute and use bracket shortcuts for all three types of selections. Assuming we have a pandas Series object called s, here are the different ways we can select items from it:

Select by Label

Explicit Syntax

Common Shorthand

Single Series item

s.loc["item8"]

s["item8"]

List of Series items

s.loc[["item1", "item7"]]

s[["item1", "item7"]]

Slice of Series items

s.loc["item2":"item4"]

s["item2":"item4"]

Keep in mind that when slicing in pandas, the last item in the slice is included in the results.

Example

Select the item at index label India from the country_counts Series. Assign the result to a new variable india.
- Use the print() and type() functions to display the type of variable for india.
Select the items with index labels USA, Canada, and Mexico from the country_counts Series. Assign the result to a new variable north_america.
- Use the print() and type() functions to display the type of variable for north_america.
Select the items with index labels from Japan to Spain (inclusive) from the country_counts Series. Assign the result to a new variable japan_to_spain.
- Use the print() and type() functions to display the type of variable for japan_to_spain.

countries = f500["country"]
country_counts = countries.value_counts()


india = country_counts["India"]
print(type(india))

north_america = country_counts[["USA", "Canada", "Mexico"]]
print(type(north_america))

japan_to_spain = country_counts["Japan":"Spain"]
print(type(japan_to_spain))

<class 'numpy.int64'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>

Select by Label

Explicit Syntax

Common Shorthand

Single column from DataFrame

df.loc[:, "col1"]

df["col1"]

List of columns from DataFrame

df.loc[:, ["col1", "col7"]]

df[["col1", "col7"]]

Slice of columns from DataFrame

df.loc[:, "col1":"col4"]

None

Single row from DataFrame

df.loc["row1"]

None

List of rows from DataFrame

df.loc[["row1", "row5"]]

None

Slice of rows from DataFrame

df.loc["row1":"row5"]

df["row1":"row5"]

Single item from Series

s.loc["item8"]

s["item8"]

List of items from Series

s.loc[["item1", "item7"]]

s[["item1", "item7"]]

Slice of items from Series

s.loc["item2":"item4"]

s["item2":"item4"]

The general syntax we use for all types of selections is:

df.loc[row_label, column_label]

For example, if we want to select a slice of rows across a list of columns, the syntax we use looks like this:

df.loc["row1":"row5", ["col1", "col3", "col7"]]

And if we want to select a single row across a slice of columns, the syntax we use looks like this:

df.loc["row3", "col3":"col7"]

Delete a column in dataframe

incidents = traffic.drop(['Hour (Coded)', 'Slowness in traffic (%)'],
                        axis=1)

traffic.drop(...) This uses pandas' .drop() method to remove columns or rows from the DataFrame.
['Hour (Coded)', 'Slowness in traffic (%)'] This is a list of column names you want to drop (i.e., remove) from the DataFrame.
axis=1 Specifies that you're dropping columns, not rows:
- axis=0 → rows
- axis=1 → columns
incidents = ... Assigns the new DataFrame (with the specified columns removed) to the variable incidents. The original traffic DataFrame remains unchanged unless you use inplace=True.

PreviousNumpy NextExploring Data with Pandas

Last updated 5 months ago