Data Cleaning Basics

Reading CSV Files with Encodings

We've learned how to select, assign, and analyze data with pandas using pre-cleaned data. In reality, data is rarely in the format needed to perform analysis. Data scientists commonly spend over half their time cleaning data, so knowing how to clean "messy" data is an extremely important skill.

In this lesson, we'll learn the basics of data cleaning with pandas as we work with laptops.csv, a CSV file containing information about 1,300 laptop computers. The first five rows of the CSV file are shown below:

Manufacturer
Model Name
Category
Screen Size
Screen
CPU
RAM
Storage
GPU
Operating System
Operating System Version
Weight
Price (Euros)

0

Apple

MacBook Pro

Ultrabook

13.3"

IPS Panel Retina Display 2560x1600

Intel Core i5 2.3GHz

8GB

128GB SSD

Intel Iris Plus Graphics 640

macOS

NaN

1.37kg

1339,69

1

Apple

Macbook Air

Ultrabook

13.3"

1440x900

Intel Core i5 1.8GHz

8GB

128GB Flash Storage

Intel HD Graphics 6000

macOS

NaN

1.34kg

898,94

2

HP

250 G6

Notebook

15.6"

Full HD 1920x1080

Intel Core i5 7200U 2.5GHz

8GB

256GB SSD

Intel HD Graphics 620

No OS

NaN

1.86kg

575,00

3

Apple

MacBook Pro

Ultrabook

15.4"

IPS Panel Retina Display 2880x1800

Intel Core i7 2.7GHz

16GB

512GB SSD

AMD Radeon Pro 455

macOS

NaN

1.83kg

2537,45

4

Apple

MacBook Pro

Ultrabook

13.3"

IPS Panel Retina Display 2560x1600

Intel Core i5 3.1GHz

8GB

256GB SSD

Intel Iris Plus Graphics 650

macOS

NaN

1.37kg

1803,60

We can start by reading the data into pandas. Let's look at what happens when we use the pandas.read_csv() function with only the filename as an argument:

laptops = pd.read_csv("laptops.csv")
Traceback (most recent call last):
  File "pandas/_libs/parsers.pyx", line 1123, in pandas._libs.parsers.TextReader._convert_tokens
  File "pandas/_libs/parsers.pyx", line 1247, in pandas._libs.parsers.TextReader._convert_with_dtype
  File "pandas/_libs/parsers.pyx", line 1262, in pandas._libs.parsers.TextReader._string_convert
  File "pandas/_libs/parsers.pyx", line 1452, in pandas._libs.parsers._string_box_utf8
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 4: invalid continuation byte
...
[truncated]

We get an error! Reading the traceback, we can see it references UTF-8, which is a type of encoding. Computers, at their lowest levels, can only understand binary (0 and 1) and encodings are systems for representing characters in binary. This error is telling us that the encoding it used (utf-8) failed to convert the data into binary.

Thankfully, the pandas.read_csv() function has an encoding argument we can use to specify an encoding:

The top four most popular encodings, which we can use to set the encoding parameter of pandas.read_csv() above, are:

  • utf-8 - Universal Coded Character Set Transformation Format—8-bit, a dominant character encoding for the web.

  • latin1 - Also known as 'ISO-8859-1', a part of the ISO/IEC 8859 series.

  • Windows-1252 - A character encoding of the Windows family, also known as 'cp1252' or sometimes ANSI.

  • utf-16 - Similar to 'utf-8' but uses 16 bits to represent each character instead of 8.

Since the pandas.read_csv() function already tried to read in the laptops.csv file using the default encoding type (utf-8) and failed, we know the file's not encoded using that format!

Thankfully, the pandas.read_csv() function has an encoding argument we can use to specify an encoding:

Below is the output of the DataFrame.info() method

We can see that all columns are represented by the object dtype, indicating that they store string values, not numerical values. Also, one of the columns, Operating System Version, contains some null values.

The column labels also have a mix of upper and lowercase letters, as well as spaces and parentheses, which will make them harder to work with and read. One noticeable issue is that the " Storage" column name has a leading space in front of it. These quirks with column labels can sometimes be hard to spot, so removing extra whitespaces from all column names will avoid headaches in the long run.

We can access the column axis labels of a dataframe using the DataFrame.columns attribute. This returns an index object—a special type of NumPy ndarray—with the label (name) of each column:

Not only can we use the attribute to view the column labels, we can also assign new ones with it:

Example - Remove any whitespace from the start and end of each column name.

  1. Create an empty list named new_columns.

  2. Create a for loop to iterate over each column name by accessing the DataFrame.columns attribute.

  3. Inside the body of the for loop, use the str.strip() method to remove whitespace from the start and end of the string and append the updated column name to the new_columns list.

  4. Assign the updated column names to the DataFrame.columns attribute.

but we still need to standardize the column labels a bit more. Let's finish cleaning them up by:

  • Replacing spaces between words with underscores.

  • Removing any special characters, like parentheses.

  • Making all labels lowercase.

  • Shortening any long column names.

Since we need to perform these steps on each of our column labels, it makes sense for us to create a helper function that uses Python string methods to clean our column labels as described above. Then we can use a for loop to apply that function to each column label. Let's look at an example:

Our code example above:

  • Defined a function, which:

  • Used a loop to apply the function to each item in the column index object and assigned it back to the DataFrame.columns attribute.

  • Printed the updated values for the DataFrame.columns attribute.

Let's use this technique to further clean the column labels in our dataframe, adding a few extra cleaning 'chores' along the way.

  1. Define a function, clean_col, which accepts a string argument, col, that:

    • Removes any whitespace from the start and end of the string.

    • Replaces the substring Operating System with the abbreviation os.

    • Replaces all spaces with underscores.

    • Removes parentheses from the string.

    • Makes the entire string lowercase.

    • Returns the modified string.

  2. Use a for loop to apply the function to each item in the DataFrame.columns attribute for the laptops dataframe. Assign the result back to the DataFrame.columns attribute.

Converting String Columns to Numeric

We observed earlier that all 13 columns have the object dtype, indicating they're storing strings. Let's look at the first few rows of some of our columns:

Of these three columns, we have three different types of text data:

  • category: Purely text data; it has no numeric values.

  • screen_size: Numeric data stored as text data because of the " character that represents "inches."

  • screen: A combination of text data (screen type) and numeric data (screen size).

Because the values in the screen_size column are stored as text data, we can't easily sort them numerically. For instance, if we wanted to select laptops with screens 15" or larger, we'd be unable to do so without using some clever tricks.

Let's address this problem by converting the screen_size column to purely numeric values. Whenever we convert text to numeric data, we can follow this data cleaning workflow:

The first step is to explore the data. One of the best ways to start exploring the data is to use the Series.unique() method to view all of the unique values in the column:

Our next step is to identify patterns and special cases that block us from converting the column to numeric. Looking at the results above, we can observe the following:

  • All values in this column follow a pattern: two digits, followed by a decimal (.), followed by a single digit, followed by a double quotation mark ("). We'll eventually need to remove that " so we can convert the column to numeric.

  • There are no special cases; every unique value in the column matches this pattern.

  • Because the int dtype won't be able to store these decimal values, we'll eventually need to convert the column to a float dtype.

Let's see if we can identify any patterns and special cases in the ram column next.

A note about Series.unique(): The Series.unique() method returns a numpy array, not a list or pandas series. This means that we can't use the Series methods we've learned so far, like Series.head(). If you want to convert the result to a list, you can use the tolist() method of the numpy array:

A note about Series.unique(): The Series.unique() method returns a numpy array, not a list or pandas series. This means that we can't use the Series methods we've learned so far, like Series.head(). If you want to convert the result to a list, you can use the tolist() method of the numpy array:

  1. Use the Series.unique() method to identify the unique values in the ram column of the laptops dataframe. Assign the result to unique_ram.

  2. Use the print() function to display unique_ram and observe any patterns that will help with converting it to a numeric column.

We identified a clear pattern in the ram column; all values were integers, followed by the characters GB (gigabyte) at the end of the string:

To be able to convert both the ram and screen_size columns to numeric dtypes, we'll have to first remove the non-digit characters, GB and ", respectively.

Thankfully, the pandas library contains dozens of vectorized string methods we can use to manipulate text data. Many of them perform the same operations as the Python string methods we've used already. Most pandas vectorized string methods are available using the Series.str accessor. This means we can access them by adding str between the series object name and the method name

In our case, we can use the Series.str.replace() method, which is a vectorized version of the Python str.replace() method we used earlier when cleaning up column labels. Here's how we use it to clean up the screen_size column:

Although screen_size still has an object dtype, the unique string values it contains are clearly ready to be converted to numeric values. We'll handle that step on the following screen.

But first, let's remove the non-digit characters from the ram column like we've done for the screen_size column in the provided code.

  1. Use the Series.str.replace() method to remove the substring GB from the ram column.

  2. Use the print() function to display the changes to the unique values of the ram column.

  3. Confirm the dtype on the ram column is still object.

Now, we can convert the columns to a numeric dtype. This is also referred to as type casting or changing the data type.

To do this, we use the Series.astype() method. To convert the column to a numeric dtype, we can pass either int or float as the argument for the method. Since the int dtype can't handle decimal values, we'll convert the screen_size column to the float dtype:

  1. Use the Series.astype() method to cast the ram column to an int dtype.

  2. Use the print() function to display the dtype of the ram column.

  3. Use print() and the DataFrame.dtypes attribute to confirm that the screen_size and ram columns have been cast to numeric dtypes.

The final step is to rename the columns. This is an optional step, and can be useful if the non-digit values contain information that helps us understand the data.

we can use the DataFrame.rename() method to rename the column from screen_size to screen_size_inches.

Below, we specify the axis=1 parameter so pandas knows that we want to rename labels in the column axis as opposed to the index axis (axis=0):

  1. Because the GB characters contained useful information about the units (gigabytes) of the laptop's ram, use the DataFrame.rename() method to rename the column from ram to ram_gb.

  2. Use the Series.describe() method to return a series of descriptive statistics for the ram_gb column. Assign the result to ram_gb_desc.

  3. Use the print() function to display ram_gb_desc.

Extracting Values from Strings

Columns often contain useful information that's buried within some text so it's useful to be able to extract these values (substrings) from strings. For example, let's look at the first five values from the gpu (graphics processing unit) column to see if there's any useful information we can extract from it:

The information in this column tells us the chip manufacturer (e.g., Intel, AMD) followed by its model name/number. Being able to analyze the data by the manufacturer could be useful to us so let's extract it, with the idea that we'll store it in a new column, gpu_manufacturer.

The pandas library has a great vectorized string method for this situation: Series.str.split() method. We can use it to split the column on any character (or pattern), and store the results in a pandas series that contains a list of each element after splitting. By default, the method splits on a whitespace character (space) so that text is broken into individual words, like in this example:

Notice how the method returns a series object containing a list of the words from the original gpu column. Now all we need to do is select the first element in each list to create our new gpu_manufacturer column.

The pandas library comes to the rescue with another vectorized string method we can leverage here! The Series.str accessor can be used with [] notation to directly index by position locations:

Since we've been working on laptops["gpu"].head(), we're only seeing the first five rows of laptops["gpu"]. We could easily apply this technique to the entire dataframe by dropping the call to head(). Then, we could assign our results from str[0] to a new column, gpu_manufacturer.

Extact values from a column and store them in a new column.

  1. Extract the manufacturer name from the gpu column:

    • Use the Series.str.split() method to split the gpu column into a list of words. Assign the result to gpu_split.

    • Use the Series.str accessor with [] to select the first element of each list of words. Assign the results to a new column gpu_manufacturer of the laptops dataframe.

  2. Use the Series.value_counts() method to find the counts of each manufacturer in the gpu_manufacturer column. Assign the result to gpu_manufacturer_counts.

  3. Extract the manufacturer name from the cpu column and assign the results to a new column cpu_manufacturer of the laptops dataframe. Try to do it in one line of code; try not use an intermediate "cpu_split" variable.

  4. Use the Series.value_counts() method to find the counts of each manufacturer in the cpu_manufacturer column. Assign the result to cpu_manufacturer_counts.

Correcting Bad Values

If our data has been scraped from a webpage or if there was manual data entry involved at some point, we may end up with inconsistent values in our dataset. This can make it difficult to analyze our data holistically. Let's look at an example from our os column:

We can see that there are two representations of the Apple operating system in our dataset: Mac OS and macOS. One way we can fix this is with the Series.map() method. While we could use the Series.str.replace() method to fix this particular issue, the Series.map() method is ideal when we want to change multiple values in a column at once, so let's take this opportunity to learn how this other method works.

The most common way to use Series.map() is with a mapping dictionary. Let's look at an example using a series of misspelled fruit that's being stored in a series called s:

To fix all the spelling mistakes at the same time, we create a dictionary called corrections and pass that dictionary as an argument to Series.map() to map the incorrect words (keys) onto the correct ones (values):

Notice that each string key was replaced by its corresponding string value. One important thing to remember with the Series.map() method is that if a value from the series doesn't exist as a key in the dictionary, it will convert that value to NaN. To see this "mistake" in action, let's see what happens when we call map() on s_fixed using the same corrections dictionary:

Because none of the values in the s_fixed series matched any of the keys in our corrections dictionary, all the values in s_fixed have became NaN values! This is a very common occurence, especially when working in a Jupyter notebook environment where we can easily re-run cells accidentally.

When using the map() method, make sure that each unique value in the series is represented as a key in the dictionary being passed to the map() method, otherwise you'll get NaN values in your resulting series. If there are values in the series you don't want to change, ensure you set their keys and values equal to each other so that "no changes are mapped" but each unique value appears as a key in the dictionary.

Let's use Series.map() to clean the values in the os column.

  1. Use the Series.unique() method on the os column to display a list of all the unique values it contains.

  2. Create a dictionary called mapping_dict where each key is a unique value from the previous step, and the corresponding value is its replacement.

    • Remember, we only want to change Mac OS to macOS; all other unique values should remain unchanged.

  3. Use the Series.map() method along with the mapping_dict dictionary from the previous step to correct the values in the os column.

  4. Use Series.value_counts() on the os column to display and confirm your changes.

Dropping Missing Values

In pandas, null values will be indicated by either NaN or None.

Recall that we can use the DataFrame.isnull() method to identify missing values in each column. The method returns a boolean dataframe, which we can then use the DataFrame.sum() method on to give us a count of the True values for each column:

It's clear that we have only one column with null values, os_version, which has 170 missing values.

There are a few options for handling these missing values:

  • Remove all rows that contain missing values.

  • Remove all columns that contain missing values.

  • Fill each missing value with some other value.

  • Leave the missing values as they are.

The first two options are often used when preparing data for machine learning algorithms, which are unable to handle data with null values. We can use the DataFrame.dropna() method if we wanted to remove or drop rows and/or columns with null values.

The DataFrame.dropna() method accepts an axis parameter, which indicates whether we want to drop along the index axis (axis=0) or the column axis (axis=1). Let's look at an example:

The default value for the axis parameter is 0, so df.dropna() is equivalent to df.dropna(axis=0):

The rows with index labels x and z contain null values, so those rows were dropped. Let's look at what happens when we pass axis=1 to specify the column axis instead:

Only the column with label C contains null values, so, in this case, just that one column was removed.

Let's practice using DataFrame.dropna() to remove rows and columns:

  1. Use DataFrame.dropna() to remove any rows from the laptops dataframe that have null values. Assign the result to laptops_no_null_rows.

  2. Use DataFrame.dropna() to remove any columns from the laptops dataframe that have null values. Assign the result to laptops_no_null_cols.

  3. Use the variable inspector to compare laptops_no_null_rows and laptops_no_null_cols. Do they have the same shape?

Filling Missing Values

While dropping rows or columns is the easiest approach to dealing with missing values, it may not always be the best approach. For example, removing a disproportionate amount of one manufacturer's laptops could impact our analysis.

With this in mind, it's a good idea to explore the missing values in the os_version column before we make a decision. As we've seen, the Series.value_counts() method is a great way to explore all of the unique values in a column. Let's use it again here, but this time we'll use a parameter we haven't seen before:

Because we set the dropna parameter to False, the result includes null (NaN) values. Analyzing the restults, we can see that 10 is the most frequent value in the column, followed by our NaN missing values.

Since it's so closely related to the os_version column, let's also explore the os column. We'll only look at rows where the os_version is missing:

From these results, we can conclude a couple of important things:

  • The most frequent value is No OS. This is important to note because if there is no operating system on the laptop, there shouldn't be a version defined in the os_version column.

  • Thirteen of the laptops that come with macOS do not specify the version. We can use our knowledge of MacOS to confirm that os_version should be equal to X for these rows.

In both of these cases, we can fill in the missing values to make our data more complete. For the rest of the values, it's probably best to leave them as NaN so we don't remove important values.

We can use a boolean comparison and assignment to perform this replacement, like below:

For rows with No OS values in the os column, let's replace the missing value in the os_version column with the value Not Applicable.

Example - Clean a String Column

Now it's time to practice what we've learned so far! In this challenge, we'll clean the weight column. Let's look at a sample of the data in that column:

Your challenge is to convert the values in this column to numeric values. As a reminder, here's the data cleaning workflow you can use:

While it appears that the weight column may just need the kg characters removed from the end of each string, there is one special case ― one of the values ends with kgs, so you'll have to remove both kg and kgs characters.

In the last step of this challenge, we'll also ask you to use the DataFrame.to_csv() method to save the cleaned data to a CSV file. It's a good idea to save your dataframe as a CSV file when you finish cleaning in case you wish to perform your analysis later.

We can use the following syntax to save a dataframe as a CSV file:

By default, pandas will save the index labels as a column in the CSV file. Our dataset has integer labels that don't contain any data, so we don't need to save the index.

  1. Convert the values in the weight column to numeric values.

  2. Rename the weight column to weight_kg.

  3. Use the DataFrame.to_csv() method to save the laptops dataframe to a CSV file /tmp/laptops_cleaned.csv without index labels

Replace strings

Change Object Type

Last updated