Pandas
Even NumPy has its limitations:
It doesn't support column names, so we must frame questions as multi-dimensional array operations.
It only allows for one data type per ndarray, complicating the handling of mixed numeric and string data.
While there are many low-level methods, some common analysis patterns lack pre-built methods.
Fear not, for pandas 🐼 is here to the rescue! Pandas is an incredibly versatile and user-friendly Python library designed to make our data exploration and analysis journey both fun and efficient. It gets its name from the econometrics term "panel data." With pandas, we'll be able to easily manipulate, clean, and visualize data, all while enjoying the process. Pandas is not a replacement for NumPy, but rather an extension that builds upon its strengths. Since pandas' underlying code relies heavily on NumPy, our newly acquired skills will be invaluable as we explore this exciting new library.
Introducing the hero of our story: the pandas DataFrame! DataFrames are pandas' answer to NumPy's 2D ndarrays, but with some game-changing enhancements:
Axis values can have string labels, not just numeric ones.
DataFrames can contain columns with multiple data types, including: integer, float, and string.
Feast your eyes on the impressive structure of a pandas DataFrame:
As we explore the versatility of pandas, we'll use a dataset from Fortune magazine's 2017 Global 500 list. This list ranks the top 500 corporations worldwide by revenue. The dataset was initially compiled here, but we've tweaked it to make it more learner-friendly.

Our dataset is a CSV file called f500.csv. To help us understand the data better, here's a handy data dictionary for some of the columns in the CSV:
company: The company's name.
rank: Global 500 rank for the company.
revenues: Company's total revenue for the fiscal year, in millions of dollars (USD).
revenue_change: Percentage change in revenue between the current and prior fiscal year.
profits: Net income for the fiscal year, in millions of dollars (USD).
ceo: Company's Chief Executive Officer.
industry: The company's industry of operation.
sector: Sector in which the company operates.
previous_rank: Global 500 rank for the company for the prior year.
country: Country of the company's headquarters.
When working with pandas, we follow the conventional import method, similar to NumPy (import numpy as np):
Importing CSV File
The index_col=0 parameter specifies that the first column of the CSV file should be used as the index for the DataFrame. This line sets the name of the index to None. By default, the index of a DataFrame can have a name, but this line removes that name, making the index unnamed.
Checking basic information
Use the DataFrame.head() method to select the first 6 rows. Assign the result to f500_top_6.
Use the DataFrame.tail() method to select the last 8 rows. Assign the result to f500_bottom_8.
Another feature that makes pandas a powerful tool for working with data is that DataFrames can contain more than one data type:
Axis values can have string labels, not just numeric ones.
DataFrames can contain columns with multiple data types, including: integer, float, and string.
To learn about the types of each column, we can use the DataFrame.dtypes attribute, similar to NumPy's ndarray.dtype attribute. Let's explore an example using a selection of data stored in the variable f500_selection.
Here, we can see three different data types or dtypes.
You may recognize the float64 dtype from our work in NumPy. Pandas uses NumPy dtypes for numeric columns, including int64. Additionally, there's a type we haven't seen before: object. This is used for columns containing data that doesn't fit into any other dtypes, typically for columns with string values.
When we import data, pandas tries to guess the correct dtype for each column. In general, pandas does a great job, so we don't need to worry about specifying dtypes every time we work with data.
If we want an overview of all the dtypes used in our DataFrame, along with its shape and other information, we can use the DataFrame.info() method. Keep in mind that DataFrame.info() prints the information automatically without having to make a call the print() function. In fact, it returns the Python None object so we can't assign the information to a variable like we have been doing.
Selecting a Column from a DataFrame by Label
In our previous exercise, we got some information about our data with the useful DataFrame.info() method. We learned the number of rows, number of columns, data types used for each column, and memory usage of our dataset. But, it's time to go deeper into the world of pandas and learn how to select specific data points!
Labels
Pandas labels are our friends. There are two types of labels in pandas: Row Labels and Column Labels. Unlike NumPy, where we needed to know the exact index location, pandas allows us to select data using these friendly labels. The secret weapon? The DataFrame.loc[] attribute! Check out its syntax below:
Notice that we use square brackets ([]), not parentheses (()), when selecting by location.
By the way, you'll often see df used as shorthand for a generic DataFrame object in our examples and in the official pandas documentation. Just know that when you see it, it just means that df is a pandas DataFrame object that was created using pandas.DataFrame().
Select a Single Column
Now, let's work through an example together using a slice of our data, stored as f500_selection:
Let's select a single column by specifying a single label:
Did you notice we used : to select all rows? And the new DataFrame shares the same row labels as the original!
For an even quicker way to select a single column, we can use this shortcut that doesn't need the .loc attribute or : to select all the rows:
Introduction to Series
henever we come across a 1D pandas object, it's a Series. And when we see a 2D pandas object, it's a DataFrame.
Think of a DataFrame as a team of Series objects working together, much like how pandas organizes the data behind the scenes.
As we continue our journey into data selection with pandas, keep an eye on which objects are DataFrames and which ones are Series. The functions, methods, attributes, and syntax available to us will vary depending on the type of pandas object we're working with
Select a List of Specific Columns
To select specific columns, we use a list of labels with .loc[], or directly with double square brackets, like this:
Since the object returned is two-dimensional, we've got a DataFrame, not a Series. As an alternative to using df.loc[:, ["col1", "col2"]] to select a specific list of columns, we can use the shorthand syntax df[["col1", "col2"]] to achieve the same result.
This "double bracket syntax" is often a source of confusion for many learners. Here's a little trick that can help with that. Think of the "first set" of brackets as belonging to the indexing operation, and the "second set" of brackets as defining a list object. Anytime we're making a selection, we'll need that first set of brackets, but that second set will only be necessary if we need to use a list to make our selection. No need for a list? Well, then there's no need to use nested brackets!
Select a Slice of Columns
Now, let's see how to select specific columns using a slice object with labels:
Once again, the object returned is a DataFrame. But notice we get all the columns from the first column label up to and including the last column label in our slice. This is different than most of the slicing we do in Python where the last element is not usually included – pandas does things a little differently! Also, keep in mind that there's no shortcut for selecting column slices; we need to use the .loc attribute and include : for selecting all the rows.
Here's a summary of the techniques we've learned for selecting columns:
Single column
df.loc[:, "col1"]
df["col1"]
List of columns
df.loc[:, ["col1", "col7"]]
df[["col1", "col7"]]
Slice of columns
df.loc[:, "col1":"col4"]
None
Now it's time to put these techniques into practice! Let's select specific columns from our f500 DataFrame and continue sharpening our pandas skills.
Selecting Rows from a DataFrame by Label
Now that we've mastered selecting columns using the labels of the Column Axis, let's tackle selecting rows using the labels of the Index Axis:
The syntax for selecting rows from a DataFrame is the same as for columns:
This syntax works when we want to select specific rows and columns. However, if we want to select all columns for specific rows, we can simplify this syntax a little by using:
This works because when we use this syntax, pandas automatically treats it as df.loc[row_label, :] to select all the columns for us. We'll use this slightly shorter syntax when selecting specific rows for all columns.
Let's continue to work with our selection of data, stored in the variable f500_selection:
Select a Single Row
Notice that the returned object is a Series because it's one-dimensional. This Series stores integer, float, and string values. Pandas uses the object dtype to accommodate all these values, as none of the numeric types could cater to them all.
Select a List of Specific Rows
Notice how the order of the rows in our selection has been preserved in the resulting DataFrame even though Walmart appears before Toyota Motor in our original f500_selection DataFrame.
Select a Slice of Rows
For row selection using slices, we can use the shortcut below. This is why we couldn't use this shortcut for columns - because it's reserved for use with rows!
Notice that we don't need to use the .loc attribute or nested brackets here. Also, the last element in the slice is included, unlike with regular Python slicing. Selecting rows using slices is very clean! Of course, we could use the explicit syntax to select a slice of rows (df.loc["row1":"row5"]), but now that we know a shortcut, we'll want to use it whenever we can!
Shortcut or No Shortcut?
Since they are very similar, it's very easy to confuse the shortcut syntax that doesn't use the .loc attribute with the explicit syntax that does. So how do we keep it all straight?! Well, there is no better way to learn than by practicing! But to set ourselves up for success, it helps to think about how often we'll perform one type of selection over the other.
For instance, are we more likely to select a single column or a single row when working with data? Which one should "get the shortcut?" With a little thought, it's clear that we'll select a single column more often than we'll select a single row, so selecting a column (df["col1"]) gets to use the shortcut syntax, not rows (df.loc["row1"]).
What about selection using a list? Again, it makes more sense for us to use the shortcut syntax for columns (df[["col1", "col5"]]) since we'll do that more often than selecting a list of specific rows (df.loc[["row1", "row5"]]).
But what about slices? This one favors rows since slices are continuous selections with no gaps; we'll want this functionality more often for rows (df["row1": "row5"]) than for columns since rows tend to be organized in some way, such as by date. The order of columns is rarely important, so the shortcut syntax for selection using a slice goes to the rows, not columns (df.loc[:, "col1":"col5"]).
It's important to note that these shortcuts only work if we're selecting either specific rows or specific columns, but not both. If we want to select specific rows and columns, we need to use the .loc attribute.
Here is a summary of the techniques we've learned for selecting rows:
Single row
df.loc["row1"]
None
List of rows
df.loc[["row1", "row5"]]
None
Slice of rows
df.loc["row1":"row5"]
df["row1":"row5"]
Series vs DataFrames
On the past couple of screens, we created both Series objects and DataFrame objects as we selected data from our f500 DataFrame. Take a minute to review these examples before we continue:

Value Counts Method
Since Series and DataFrames are two distinct types of pandas objects, they each have their own special methods. Let's explore the powerful Series.value_counts() method to see how it counts the occurrences of each unique non-null value in a column. By default, this method will return the results from the most frequent value in the column to the least. Check out the official pandas documentation above to learn how to use the sort parameter to change the sort order.
Series.value_counts() in Action
First, let's select just one column from the f500 DataFrame:
Now, replace "Series" in Series.value_counts() with the name of our sectors Series from above, and like magic, we get:
value_counts() Method Meets DataFrame
What happens when we try using the value_counts() method directly on a DataFrame object instead of a Series object? Will it work? Let's find out!
First, we'll select the sector and industry columns of f500 to create another DataFrame called sectors_industries:
Now, let's see if the magic of value_counts() works on a DataFrame:
It worked! However, our results from calling value_counts() on a DataFrame object look slightly different than the results we got when we called it on a Series object. This is because the DataFrame.value_counts() method returns a Series object that uses a MultiIndex instead of a single index like we got when using Series.value_counts().
In the example above, the counts returned are based on a combination of the unique non-null values found in both the sector and industry columns. For example, the combination of "Energy" in the sector column and "Petroleum Refining" in the industry column occurs 28 times in our data. However, the combination of "Energy" in the sector column and "Oil & Gas Equipment Services" in the industry column only occurs once.
Although the value_counts() method works on both pandas Series and DataFrame objects, not all methods support both types of objects. When in doubt, we should check the official pandas documentation for Series attributes and methods or for DataFrame attributes and methods to make sure the attribute or method exists for the type of object we're working with.
Selecting Items from a Series by Label
We practiced using the Series.value_counts() method. We found the counts of each unique value in the country column for the entire f500 DataFrame, revealing "USA" as the country with the most companies on the Fortune 500 list:
What if we want to select the count for a single item, like India, from this Series? Or maybe we want to select the counts for a list of items, like a list of North American countries – how would we do that?
Similar to DataFrames, we can use the Series.loc[] attribute to select items from a Series using single labels, a list of labels, or a slice object. With pandas Series objects being 1D, we can take advantage of the shorthand syntax to omit the .loc attribute and use bracket shortcuts for all three types of selections. Assuming we have a pandas Series object called s, here are the different ways we can select items from it:
Single Series item
s.loc["item8"]
s["item8"]
List of Series items
s.loc[["item1", "item7"]]
s[["item1", "item7"]]
Slice of Series items
s.loc["item2":"item4"]
s["item2":"item4"]
Keep in mind that when slicing in pandas, the last item in the slice is included in the results.
Example
Select the item at index label India from the country_counts Series. Assign the result to a new variable india.
Use the print() and type() functions to display the type of variable for india.
Select the items with index labels USA, Canada, and Mexico from the country_counts Series. Assign the result to a new variable north_america.
Use the print() and type() functions to display the type of variable for north_america.
Select the items with index labels from Japan to Spain (inclusive) from the country_counts Series. Assign the result to a new variable japan_to_spain.
Use the print() and type() functions to display the type of variable for japan_to_spain.
Single column from DataFrame
df.loc[:, "col1"]
df["col1"]
List of columns from DataFrame
df.loc[:, ["col1", "col7"]]
df[["col1", "col7"]]
Slice of columns from DataFrame
df.loc[:, "col1":"col4"]
None
Single row from DataFrame
df.loc["row1"]
None
List of rows from DataFrame
df.loc[["row1", "row5"]]
None
Slice of rows from DataFrame
df.loc["row1":"row5"]
df["row1":"row5"]
Single item from Series
s.loc["item8"]
s["item8"]
List of items from Series
s.loc[["item1", "item7"]]
s[["item1", "item7"]]
Slice of items from Series
s.loc["item2":"item4"]
s["item2":"item4"]
The general syntax we use for all types of selections is:
For example, if we want to select a slice of rows across a list of columns, the syntax we use looks like this:
And if we want to select a single row across a slice of columns, the syntax we use looks like this:
Delete a column in dataframe
traffic.drop(...)This uses pandas'.drop()method to remove columns or rows from the DataFrame.['Hour (Coded)', 'Slowness in traffic (%)']This is a list of column names you want to drop (i.e., remove) from the DataFrame.axis=1Specifies that you're dropping columns, not rows:axis=0→ rowsaxis=1→ columns
incidents = ...Assigns the new DataFrame (with the specified columns removed) to the variableincidents. The originaltrafficDataFrame remains unchanged unless you useinplace=True.
Last updated