Data Cleaning Basics

Reading CSV Files with Encodings

We've learned how to select, assign, and analyze data with pandas using pre-cleaned data. In reality, data is rarely in the format needed to perform analysis. Data scientists commonly spend over half their time cleaning data, so knowing how to clean "messy" data is an extremely important skill.

In this lesson, we'll learn the basics of data cleaning with pandas as we work with laptops.csv, a CSV file containing information about 1,300 laptop computers. The first five rows of the CSV file are shown below:

Manufacturer

Model Name

Example - Remove any whitespace from the start and end of each column name.

Create an empty list named new_columns.
Create a for loop to iterate over each column name by accessing the DataFrame.columns attribute.
Inside the body of the for loop, use the str.strip() method to remove whitespace from the start and end of the string and append the updated column name to the new_columns list.
Assign the updated column names to the DataFrame.columns attribute.

new_columns = []
for c in laptops.columns:
    clean_c = c.strip()
    new_columns.append(clean_c)
    
laptops.columns = new_columns

but we still need to standardize the column labels a bit more. Let's finish cleaning them up by:

Replacing spaces between words with underscores.
Removing any special characters, like parentheses.
Making all labels lowercase.
Shortening any long column names.

Since we need to perform these steps on each of our column labels, it makes sense for us to create a helper function that uses Python string methods to clean our column labels as described above. Then we can use a for loop to apply that function to each column label. Let's look at an example:

def clean_col(col):
    col = col.strip()
    col = col.replace("(","")
    col = col.replace(")","")
    col = col.lower()
    return col

new_columns = []
for c in laptops.columns:
    clean_c = clean_col(c)
    new_columns.append(clean_c)

laptops.columns = new_columns
print(laptops.columns)

Our code example above:

Defined a function, which:
- Used the str.strip() method to remove whitespace from the start and end of the string.
- Used the str.replace() method to remove parentheses from the string.
- Used the str.lower() method to make the string lowercase.
- Returns the modified string.
Used a loop to apply the function to each item in the column index object and assigned it back to the DataFrame.columns attribute.
Printed the updated values for the DataFrame.columns attribute.

Let's use this technique to further clean the column labels in our dataframe, adding a few extra cleaning 'chores' along the way.

Define a function, clean_col, which accepts a string argument, col, that:
- Removes any whitespace from the start and end of the string.
- Replaces the substring Operating System with the abbreviation os.
- Replaces all spaces with underscores.
- Removes parentheses from the string.
- Makes the entire string lowercase.
- Returns the modified string.
Use a for loop to apply the function to each item in the DataFrame.columns attribute for the laptops dataframe. Assign the result back to the DataFrame.columns attribute.

def clean_col(col):
    col = col.strip()
    col = col.replace("Operating System", "os")
    col = col.replace(" ", "_")
    col = col.replace("(", "")
    col = col.replace(")", "")
    col = col.lower()
    return col

new_columns = []
for c in laptops.columns:
    clean_c = clean_col(c)
    new_columns.append(clean_c)
    
laptops.columns = new_columns

Converting String Columns to Numeric

We observed earlier that all 13 columns have the object dtype, indicating they're storing strings. Let's look at the first few rows of some of our columns:

print(laptops.iloc[:5, 2:5])

Output:

    category screen_size                              screen
0  Ultrabook       13.3"  IPS Panel Retina Display 2560x1600
1  Ultrabook       13.3"                            1440x900
2   Notebook       15.6"                   Full HD 1920x1080
3  Ultrabook       15.4"  IPS Panel Retina Display 2880x1800
4  Ultrabook       13.3"  IPS Panel Retina Display 2560x1600

Of these three columns, we have three different types of text data:

category: Purely text data; it has no numeric values.
screen_size: Numeric data stored as text data because of the " character that represents "inches."
screen: A combination of text data (screen type) and numeric data (screen size).

Because the values in the screen_size column are stored as text data, we can't easily sort them numerically. For instance, if we wanted to select laptops with screens 15" or larger, we'd be unable to do so without using some clever tricks.

Let's address this problem by converting the screen_size column to purely numeric values. Whenever we convert text to numeric data, we can follow this data cleaning workflow:

The first step is to explore the data. One of the best ways to start exploring the data is to use the Series.unique() method to view all of the unique values in the column:

print(laptops["screen_size"].unique())

Output:

['13.3"' '15.6"' '15.4"' '14.0"' '12.0"' '11.6"'
 '17.3"' '10.1"' '13.5"' '12.5"' '13.0"' '18.4"'
 '13.9"' '12.3"' '17.0"' '15.0"' '14.1"'
 '11.3"']

Our next step is to identify patterns and special cases that block us from converting the column to numeric. Looking at the results above, we can observe the following:

All values in this column follow a pattern: two digits, followed by a decimal (.), followed by a single digit, followed by a double quotation mark ("). We'll eventually need to remove that " so we can convert the column to numeric.
There are no special cases; every unique value in the column matches this pattern.
Because the int dtype won't be able to store these decimal values, we'll eventually need to convert the column to a float dtype.

Let's see if we can identify any patterns and special cases in the ram column next.

A note about Series.unique(): The Series.unique() method returns a numpy array, not a list or pandas series. This means that we can't use the Series methods we've learned so far, like Series.head(). If you want to convert the result to a list, you can use the tolist() method of the numpy array:

unique_ram = laptops["ram"].unique().tolist()

Use the Series.unique() method to identify the unique values in the ram column of the laptops dataframe. Assign the result to unique_ram.
Use the print() function to display unique_ram and observe any patterns that will help with converting it to a numeric column.

unique_ram= laptops["ram"].unique()
print(unique_ram)

laptops

unique_ram
ndarray(<class 'numpy.ndarray'>)
array(['8GB', '16GB', '4GB', '2GB', '12GB', '6GB', '32GB', '24GB', '64GB'], dtype=object)

We identified a clear pattern in the ram column; all values were integers, followed by the characters GB (gigabyte) at the end of the string:

['8GB' '16GB' '4GB' '2GB' '12GB' '6GB' '32GB' '24GB' '64GB']

To be able to convert both the ram and screen_size columns to numeric dtypes, we'll have to first remove the non-digit characters, GB and ", respectively.

Thankfully, the pandas library contains dozens of vectorized string methods we can use to manipulate text data. Many of them perform the same operations as the Python string methods we've used already. Most pandas vectorized string methods are available using the Series.str accessor. This means we can access them by adding str between the series object name and the method name

In our case, we can use the Series.str.replace() method, which is a vectorized version of the Python str.replace() method we used earlier when cleaning up column labels. Here's how we use it to clean up the screen_size column:

laptops["screen_size"] = laptops["screen_size"].str.replace('"', '')
print(laptops["screen_size"].unique())
print("`screen_size` dtype:", laptops["screen_size"].dtype)

Output:

['13.3' '15.6' '15.4' '14.0' '12.0' '11.6' '17.3' '10.1' '13.5' '12.5'
 '13.0' '18.4' '13.9' '12.3' '17.0' '15.0' '14.1' '11.3']
`screen_size` dtype: object

Although screen_size still has an object dtype, the unique string values it contains are clearly ready to be converted to numeric values. We'll handle that step on the following screen.

But first, let's remove the non-digit characters from the ram column like we've done for the screen_size column in the provided code.

Use the Series.str.replace() method to remove the substring GB from the ram column.
Use the print() function to display the changes to the unique values of the ram column.
Confirm the dtype on the ram column is still object.

laptops["screen_size"] = laptops["screen_size"].str.replace('"', '')
laptops["ram"] = laptops["ram"].str.replace('GB', '')
print(laptops["ram"].unique())
print("`ram` dtype:", laptops["ram"].dtype)

['8' '16' '4' '2' '12' '6' '32' '24' '64']
`ram` dtype: object

Now, we can convert the columns to a numeric dtype. This is also referred to as type casting or changing the data type.

To do this, we use the Series.astype() method. To convert the column to a numeric dtype, we can pass either int or float as the argument for the method. Since the int dtype can't handle decimal values, we'll convert the screen_size column to the float dtype:

laptops["screen_size"] = laptops["screen_size"].astype(float)
print(laptops["screen_size"].unique())
print("`screen_size` dtype:", laptops["screen_size"].dtype)

Output:

[13.3 15.6 15.4 14.  12.  11.6 17.3 10.1 13.5 12.5 13.  18.4 13.9 12.3
 17.  15.  14.1 11.3]
`screen_size` dtype: float64

Use the Series.astype() method to cast the ram column to an int dtype.
Use the print() function to display the dtype of the ram column.
Use print() and the DataFrame.dtypes attribute to confirm that the screen_size and ram columns have been cast to numeric dtypes.

laptops["ram"] = laptops["ram"].astype(int)
print(laptops["ram"].dtype)
print(laptops.dtypes)

The final step is to rename the columns. This is an optional step, and can be useful if the non-digit values contain information that helps us understand the data.

we can use the DataFrame.rename() method to rename the column from screen_size to screen_size_inches.

Below, we specify the axis=1 parameter so pandas knows that we want to rename labels in the column axis as opposed to the index axis (axis=0):

laptops.rename({"screen_size": "screen_size_inches"}, axis=1, inplace=True)
print(laptops.dtypes)

Output:

manufacturer           object
model_name             object
category               object
screen_size_inches    float64
screen                 object
cpu                    object
ram                     int64
storage                object
gpu                    object
os                     object
os_version             object
weight                 object
price_euros            object
dtype: object

Because the GB characters contained useful information about the units (gigabytes) of the laptop's ram, use the DataFrame.rename() method to rename the column from ram to ram_gb.
Use the Series.describe() method to return a series of descriptive statistics for the ram_gb column. Assign the result to ram_gb_desc.
Use the print() function to display ram_gb_desc.

laptops.rename({"screen_size": "screen_size_inches"}, axis=1, inplace=True)


laptops.rename({"ram": "ram_gb"}, axis=1, inplace=True)
ram_gb_desc = laptops["ram_gb"].describe()
print(ram_gb_desc)

Extracting Values from Strings

Columns often contain useful information that's buried within some text so it's useful to be able to extract these values (substrings) from strings. For example, let's look at the first five values from the gpu (graphics processing unit) column to see if there's any useful information we can extract from it:

print(laptops["gpu"].head())

Output:

0    Intel Iris Plus Graphics 640
1          Intel HD Graphics 6000
2           Intel HD Graphics 620
3              AMD Radeon Pro 455
4    Intel Iris Plus Graphics 650
Name: gpu, dtype: object

The information in this column tells us the chip manufacturer (e.g., Intel, AMD) followed by its model name/number. Being able to analyze the data by the manufacturer could be useful to us so let's extract it, with the idea that we'll store it in a new column, gpu_manufacturer.

The pandas library has a great vectorized string method for this situation: Series.str.split() method. We can use it to split the column on any character (or pattern), and store the results in a pandas series that contains a list of each element after splitting. By default, the method splits on a whitespace character (space) so that text is broken into individual words, like in this example:

gpu_head_split = laptops["gpu"].head().str.split()
print(gpu_head_split)

Output:

0    [Intel, Iris, Plus, Graphics, 640]
1           [Intel, HD, Graphics, 6000]
2            [Intel, HD, Graphics, 620]
3               [AMD, Radeon, Pro, 455]
4    [Intel, Iris, Plus, Graphics, 650]
Name: gpu, dtype: object

Notice how the method returns a series object containing a list of the words from the original gpu column. Now all we need to do is select the first element in each list to create our new gpu_manufacturer column.

The pandas library comes to the rescue with another vectorized string method we can leverage here! The Series.str accessor can be used with [] notation to directly index by position locations:

print(gpu_head_split.str[0])

Output:

0    Intel
1    Intel
2    Intel
3      AMD
4    Intel
Name: gpu, dtype: object

Since we've been working on laptops["gpu"].head(), we're only seeing the first five rows of laptops["gpu"]. We could easily apply this technique to the entire dataframe by dropping the call to head(). Then, we could assign our results from str[0] to a new column, gpu_manufacturer.

Extact values from a column and store them in a new column.

Extract the manufacturer name from the gpu column:
- Use the Series.str.split() method to split the gpu column into a list of words. Assign the result to gpu_split.
- Use the Series.str accessor with [] to select the first element of each list of words. Assign the results to a new column gpu_manufacturer of the laptops dataframe.
Use the Series.value_counts() method to find the counts of each manufacturer in the gpu_manufacturer column. Assign the result to gpu_manufacturer_counts.
Extract the manufacturer name from the cpu column and assign the results to a new column cpu_manufacturer of the laptops dataframe. Try to do it in one line of code; try not use an intermediate "cpu_split" variable.
Use the Series.value_counts() method to find the counts of each manufacturer in the cpu_manufacturer column. Assign the result to cpu_manufacturer_counts.

# extract gpu manufacturer and store in a new column
gpu_split = laptops["gpu"].str.split()
laptops["gpu_manufacturer"] = gpu_split.str[0]
gpu_manufacturer_counts = laptops["gpu_manufacturer"].value_counts()

# extract cpu manufacturer and store in a new column
laptops["cpu_manufacturer"] = laptops["cpu"].str.split().str[0]
cpu_manufacturer_counts = laptops["cpu_manufacturer"].value_counts()

Correcting Bad Values

If our data has been scraped from a webpage or if there was manual data entry involved at some point, we may end up with inconsistent values in our dataset. This can make it difficult to analyze our data holistically. Let's look at an example from our os column:

print(laptops["os"].value_counts())

Output:

Windows      1125
No OS          66
Linux          62
Chrome OS      27
macOS          13
Mac OS          8
Android         2
Name: os, dtype: int64

We can see that there are two representations of the Apple operating system in our dataset: Mac OS and macOS. One way we can fix this is with the Series.map() method. While we could use the Series.str.replace() method to fix this particular issue, the Series.map() method is ideal when we want to change multiple values in a column at once, so let's take this opportunity to learn how this other method works.

The most common way to use Series.map() is with a mapping dictionary. Let's look at an example using a series of misspelled fruit that's being stored in a series called s:

print(s)

Output:

0       pair
1     oranje
2    bananna
3     oranje
4     oranje
5     oranje
dtype: object

To fix all the spelling mistakes at the same time, we create a dictionary called corrections and pass that dictionary as an argument to Series.map() to map the incorrect words (keys) onto the correct ones (values):

corrections = {
    "pair": "pear",
    "oranje": "orange",
    "bananna": "banana"
}
s_fixed = s.map(corrections)
print(s_fixed)

Output:

0       pear
1     orange
2     banana
3     orange
4     orange
5     orange
dtype: object

Notice that each string key was replaced by its corresponding string value. One important thing to remember with the Series.map() method is that if a value from the series doesn't exist as a key in the dictionary, it will convert that value to NaN. To see this "mistake" in action, let's see what happens when we call map() on s_fixed using the same corrections dictionary:

s_fixed_again = s_fixed.map(corrections)
print(s_fixed_again)

Output:

0    NaN
1    NaN
2    NaN
3    NaN
4    NaN
5    NaN
dtype: object

Because none of the values in the s_fixed series matched any of the keys in our corrections dictionary, all the values in s_fixed have became NaN values! This is a very common occurence, especially when working in a Jupyter notebook environment where we can easily re-run cells accidentally.

When using the map() method, make sure that each unique value in the series is represented as a key in the dictionary being passed to the map() method, otherwise you'll get NaN values in your resulting series. If there are values in the series you don't want to change, ensure you set their keys and values equal to each other so that "no changes are mapped" but each unique value appears as a key in the dictionary.

Let's use Series.map() to clean the values in the os column.

Use the Series.unique() method on the os column to display a list of all the unique values it contains.
Create a dictionary called mapping_dict where each key is a unique value from the previous step, and the corresponding value is its replacement.
- Remember, we only want to change Mac OS to macOS; all other unique values should remain unchanged.
Use the Series.map() method along with the mapping_dict dictionary from the previous step to correct the values in the os column.
Use Series.value_counts() on the os column to display and confirm your changes.

print(laptops["os"].unique())
mapping_dict = {
    'Android': 'Android',
    'Chrome OS': 'Chrome OS',
    'Linux': 'Linux',
    'Mac OS': 'macOS',
    'No OS': 'No OS',
    'Windows': 'Windows',
    'macOS': 'macOS'
}
laptops["os"]=laptops["os"].map(mapping_dict)
print(laptops["os"].value_counts())

Dropping Missing Values

In pandas, null values will be indicated by either NaN or None.

Recall that we can use the DataFrame.isnull() method to identify missing values in each column. The method returns a boolean dataframe, which we can then use the DataFrame.sum() method on to give us a count of the True values for each column:

print(laptops.isnull().sum())

Output:

manufacturer            0
model_name              0
category                0
screen_size_inches      0
screen                  0
cpu                     0
ram_gb                  0
storage                 0
gpu                     0
os                      0
os_version            170
weight_kg               0
price_euros             0
cpu_manufacturer        0
screen_resolution       0
cpu_speed               0
dtype: int64

It's clear that we have only one column with null values, os_version, which has 170 missing values.

There are a few options for handling these missing values:

Remove all rows that contain missing values.
Remove all columns that contain missing values.
Fill each missing value with some other value.
Leave the missing values as they are.

The first two options are often used when preparing data for machine learning algorithms, which are unable to handle data with null values. We can use the DataFrame.dropna() method if we wanted to remove or drop rows and/or columns with null values.

The DataFrame.dropna() method accepts an axis parameter, which indicates whether we want to drop along the index axis (axis=0) or the column axis (axis=1). Let's look at an example:

The default value for the axis parameter is 0, so df.dropna() is equivalent to df.dropna(axis=0):

The rows with index labels x and z contain null values, so those rows were dropped. Let's look at what happens when we pass axis=1 to specify the column axis instead:

Only the column with label C contains null values, so, in this case, just that one column was removed.

Let's practice using DataFrame.dropna() to remove rows and columns:

Use DataFrame.dropna() to remove any rows from the laptops dataframe that have null values. Assign the result to laptops_no_null_rows.
Use DataFrame.dropna() to remove any columns from the laptops dataframe that have null values. Assign the result to laptops_no_null_cols.
Use the variable inspector to compare laptops_no_null_rows and laptops_no_null_cols. Do they have the same shape?

laptops_no_null_rows = laptops.dropna(axis=0)
laptops_no_null_cols = laptops.dropna(axis=1)

Filling Missing Values

While dropping rows or columns is the easiest approach to dealing with missing values, it may not always be the best approach. For example, removing a disproportionate amount of one manufacturer's laptops could impact our analysis.

With this in mind, it's a good idea to explore the missing values in the os_version column before we make a decision. As we've seen, the Series.value_counts() method is a great way to explore all of the unique values in a column. Let's use it again here, but this time we'll use a parameter we haven't seen before:

print(laptops["os_version"].value_counts(dropna=False))

Output:

10      1072
NaN      170
7         45
X          8
10 S       8
Name: os_version, dtype: int64

Because we set the dropna parameter to False, the result includes null (NaN) values. Analyzing the restults, we can see that 10 is the most frequent value in the column, followed by our NaN missing values.

Since it's so closely related to the os_version column, let's also explore the os column. We'll only look at rows where the os_version is missing:

os_with_null_v = laptops.loc[laptops["os_version"].isnull(), "os"]
print(os_with_null_v.value_counts())

Output:

No OS        66
Linux        62
Chrome OS    27
macOS        13
Android       2
Name: os, dtype: int64

From these results, we can conclude a couple of important things:

The most frequent value is No OS. This is important to note because if there is no operating system on the laptop, there shouldn't be a version defined in the os_version column.
Thirteen of the laptops that come with macOS do not specify the version. We can use our knowledge of MacOS to confirm that os_version should be equal to X for these rows.

In both of these cases, we can fill in the missing values to make our data more complete. For the rest of the values, it's probably best to leave them as NaN so we don't remove important values.

We can use a boolean comparison and assignment to perform this replacement, like below:

laptops.loc[laptops["os"] == "macOS", "os_version"] = "X"

For rows with No OS values in the os column, let's replace the missing value in the os_version column with the value Not Applicable.

value_counts_before = laptops.loc[laptops["os_version"].isnull(), "os"].value_counts()
laptops.loc[laptops["os"] == "macOS", "os_version"] = "X"


laptops.loc[laptops["os"] == "No OS", "os_version"] = "Not Applicable"
value_counts_after = laptops.loc[laptops["os_version"].isnull(), "os"].value_counts()

Example - Clean a String Column

Now it's time to practice what we've learned so far! In this challenge, we'll clean the weight column. Let's look at a sample of the data in that column:

print(laptops["weight"].head())

Output:

0    1.37kg
1    1.34kg
2    1.86kg
3    1.83kg
4    1.37kg
Name: Weight, dtype: object

Your challenge is to convert the values in this column to numeric values. As a reminder, here's the data cleaning workflow you can use:

While it appears that the weight column may just need the kg characters removed from the end of each string, there is one special case ― one of the values ends with kgs, so you'll have to remove both kg and kgs characters.

In the last step of this challenge, we'll also ask you to use the DataFrame.to_csv() method to save the cleaned data to a CSV file. It's a good idea to save your dataframe as a CSV file when you finish cleaning in case you wish to perform your analysis later.

We can use the following syntax to save a dataframe as a CSV file:

df.to_csv('filename.csv', index=False)

By default, pandas will save the index labels as a column in the CSV file. Our dataset has integer labels that don't contain any data, so we don't need to save the index.

Convert the values in the weight column to numeric values.
Rename the weight column to weight_kg.
Use the DataFrame.to_csv() method to save the laptops dataframe to a CSV file /tmp/laptops_cleaned.csv without index labels

laptops["weight"] = laptops["weight"].str.replace("kgs","").str.replace("kg","").astype(float)
laptops.rename({"weight": "weight_kg"}, axis=1, inplace=True)
laptops.to_csv('/tmp/laptops_cleaned.csv', index=False)

Replace strings

traffic['Slowness in traffic (%)'] = traffic['Slowness in traffic (%)'].str.replace(',', '.')

Change Object Type

traffic['Slowness in traffic (%)'] = traffic['Slowness in traffic (%)'].astype(float)

PreviousExploring Data with Pandas NextData Visualization

Last updated 5 months ago