Files

CSV files

from csv import reader

opened_file = open('AppleStore.csv')
read_file = reader(opened_file)
print(opened_file)
print(read_file)
<_io.TextIOWrapper name='AppleStore.csv' mode='r' encoding='UTF-8'>
<_csv.reader object at 0x7fb6200e5ba0>

Convert read data to list

Now that we've read the file, we can transform it into a list of lists using the list() function:

from csv import reader
opened_file = open('AppleStore.csv')
read_file = reader(opened_file)
print(opened_file)
print(read_file)
apps_data=list(read_file)
print(len(apps_data))
print(apps_data[0])
print(apps_data[1:3])

Separating file header

The first line creates a header value, so need to separate it out.

  1. We remove the first row from apps_data, and then we start the iteration over. We do that by doing the following:

    • Saving the header row to a separate variable named header

    • Saving apps_data[1:] back to apps_dataapps_data[1:] is a list slice that excludes the first row (the header row)

Import CSV File in Pandas

  1. Select the rank, revenues, and revenue_change columns in f500. Then, use the DataFrame.head() method to select the first five rows. Assign the result to f500_selection.

  2. Use the variable inspector to view f500_selection. Compare it to the first few lines of our raw CSV file, shown above.

    • Do you notice the relationship between the raw data in the CSV file and the f500_selection pandas dataframe?

When you compared the first few rows and columns of the f500_selection dataframe to the raw values below, you may have noticed that the row labels (along the index axis) are actually the values from the first column in the CSV file, company:

If we check the documentation for the read_csv() function, we can see why. The index_col parameter is an optional argument that specifies which column to use to set the Row Labels for our dataframe. For example, when we used a value of 0 for this parameter, we specified that we wanted to use the first column (company) to set the row labels.

When we specify a column for the index_col parameter, the pandas.read_csv() funtion uses the values in that column to label each row. For this reason, we should only use columns that contain unique values (like the company column) when setting the index_col parameter because each row should have a unique label. This uniqueness ensures that each row can be uniquely identified and accessed by its index label. To be clear, pandas does allow indexes with duplicates, but having a unique index simplifies many operations and prevents issues with data retrieval.

Naming DataFrame Axes

Let's look at what happens if we use the index_col parameter but don't set the index name to None using the code: f500.index.name = None.

Notice above the row labels, we now have the text company where we didn't before. This corresponds to the name of the first column (column index: 0) in the CSV file. Pandas used the column name to set the Index Name for the index axis.

Also, notice how the dataframe no longer has a company column; instead, it's used to set the index for the dataframe. We know the company column is no longer a standard column in our f500 dataframe above because we specifically selected the columns rank, revenue, and revenue_change, but not company.

In pandas, both the index and column axes can have names assigned to them.

Notice how both the Column Axis and Index Axis now have names: Company Metrics and Company Names, respectively. You can think of these names as "labels for your labels." Some people find these names make their dataframes harder to read, while others feel it makes them easier to interpret. In the end, the choice comes down to personal preference, and it can change depending on the situation.

Not using indec column

There are two differences with this approach:

  • The company column is now included as a regular column, instead of being used for the index.

  • The index labels are now integers starting from 0.

This is the more conventional way to read in a dataframe

Reading CSV Files with Encodings

We can start by reading the data into pandas. Let's look at what happens when we use the pandas.read_csv() function with only the filename as an argument:

We get an error! Reading the traceback, we can see it references UTF-8, which is a type of encoding. Computers, at their lowest levels, can only understand binary (0 and 1) and encodings are systems for representing characters in binary. This error is telling us that the encoding it used (utf-8) failed to convert the data into binary.

Thankfully, the pandas.read_csv() function has an encoding argument we can use to specify an encoding:

The top four most popular encodings, which we can use to set the encoding parameter of pandas.read_csv() above, are:

  • utf-8 - Universal Coded Character Set Transformation Format—8-bit, a dominant character encoding for the web.

  • latin1 - Also known as 'ISO-8859-1', a part of the ISO/IEC 8859 series.

  • Windows-1252 - A character encoding of the Windows family, also known as 'cp1252' or sometimes ANSI.

  • utf-16 - Similar to 'utf-8' but uses 16 bits to represent each character instead of 8.

Since the pandas.read_csv() function already tried to read in the laptops.csv file using the default encoding type (utf-8) and failed, we know the file's not encoded using that format!

Example encoded read csv

  1. Import the pandas library using its common alias.

  2. Use the pandas.read_csv() function to read the laptops.csv file into a dataframe laptops.

    • Specify the encoding using the string "latin1".

  3. Use the DataFrame.info() method to display information about the laptops dataframe.

    • Specify the print() function to see the results.

Opening CSV File with different separator

  • The data points are separated by ;, so you'll need to use sep=';' to read in the file properly.

Last updated