Files

CSV files

from csv import reader

opened_file = open('AppleStore.csv')
read_file = reader(opened_file)
print(opened_file)
print(read_file)

<_io.TextIOWrapper name='AppleStore.csv' mode='r' encoding='UTF-8'>
<_csv.reader object at 0x7fb6200e5ba0>

Convert read data to list

Now that we've read the file, we can transform it into a list of lists using the list() function:

from csv import reader
opened_file = open('AppleStore.csv')
read_file = reader(opened_file)
print(opened_file)
print(read_file)
apps_data=list(read_file)
print(len(apps_data))
print(apps_data[0])
print(apps_data[1:3])

<_io.TextIOWrapper name='AppleStore.csv' mode='r' encoding='UTF-8'>
<_csv.reader object at 0x7fb63c1680b0>
7198
['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']
[['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1'], ['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']]

Separating file header

The first line creates a header value, so need to separate it out.

We remove the first row from apps_data, and then we start the iteration over. We do that by doing the following:
- Saving the header row to a separate variable named header
- Saving apps_data[1:] back to apps_data — apps_data[1:] is a list slice that excludes the first row (the header row)

header = apps_data[0]
apps_data = apps_data[1:]

Import CSV File in Pandas

import pandas as pd
f500 = pd.read_csv('f500.csv', index_col=0)
f500.index.name = None

Select the rank, revenues, and revenue_change columns in f500. Then, use the DataFrame.head() method to select the first five rows. Assign the result to f500_selection.
Use the variable inspector to view f500_selection. Compare it to the first few lines of our raw CSV file, shown above.
- Do you notice the relationship between the raw data in the CSV file and the f500_selection pandas dataframe?

import pandas as pd

# read the dataset into a pandas dataframe
f500 = pd.read_csv("f500.csv", index_col=0)
f500.index.name = None

# replace 0 values in the "previous_rank" column with NaN
f500.loc[f500["previous_rank"] == 0, "previous_rank"] = np.nan

f500_selection=f500[["rank","revenues","revenue_change"]].head()


rank	revenues	revenue_change
Walmart	1	485873	0.8
State Grid	2	315199	-4.4
Sinopec Group	3	267518	-9.1
China National Petroleum	4	262573	-12.3
Toyota Motor	5	254694	7.7

When you compared the first few rows and columns of the f500_selection dataframe to the raw values below, you may have noticed that the row labels (along the index axis) are actually the values from the first column in the CSV file, company:

company,rank,revenues,revenue_change
Walmart,1,485873,0.8
State Grid,2,315199,-4.4
Sinopec Group,3,267518,-9.1
China National Petroleum,4,262573,-12.3
Toyota Motor,5,254694,7.7

If we check the documentation for the read_csv() function, we can see why. The index_col parameter is an optional argument that specifies which column to use to set the Row Labels for our dataframe. For example, when we used a value of 0 for this parameter, we specified that we wanted to use the first column (company) to set the row labels.

When we specify a column for the index_col parameter, the pandas.read_csv() funtion uses the values in that column to label each row. For this reason, we should only use columns that contain unique values (like the company column) when setting the index_col parameter because each row should have a unique label. This uniqueness ensures that each row can be uniquely identified and accessed by its index label. To be clear, pandas does allow indexes with duplicates, but having a unique index simplifies many operations and prevents issues with data retrieval.

Naming DataFrame Axes

Let's look at what happens if we use the index_col parameter but don't set the index name to None using the code: f500.index.name = None.

f500 = pd.read_csv("f500.csv", index_col=0)
print(f500[['rank', 'revenues', 'revenue_change']].head())

Output:
                          rank  revenues  revenue_change
company                                                    
Walmart                      1    485873             0.8
State Grid                   2    315199            -4.4
Sinopec Group                3    267518            -9.1
China National Petroleum     4    262573           -12.3
Toyota Motor                 5    254694             7.7

Notice above the row labels, we now have the text company where we didn't before. This corresponds to the name of the first column (column index: 0) in the CSV file. Pandas used the column name to set the Index Name for the index axis.

Also, notice how the dataframe no longer has a company column; instead, it's used to set the index for the dataframe. We know the company column is no longer a standard column in our f500 dataframe above because we specifically selected the columns rank, revenue, and revenue_change, but not company.

In pandas, both the index and column axes can have names assigned to them.

f500.index.name = "Company Names"
f500.columns.name = "Company Metrics"

print(f500[['rank', 'revenues', 'revenue_change']].head())

Output:

Company Metrics           rank  revenues  revenue_change
Company Names                                            
Walmart                      1    485873             0.8
State Grid                   2    315199            -4.4
Sinopec Group                3    267518            -9.1
China National Petroleum     4    262573           -12.3
Toyota Motor                 5    254694             7.7

Notice how both the Column Axis and Index Axis now have names: Company Metrics and Company Names, respectively. You can think of these names as "labels for your labels." Some people find these names make their dataframes harder to read, while others feel it makes them easier to interpret. In the end, the choice comes down to personal preference, and it can change depending on the situation.

Not using indec column

f500 = pd.read_csv("f500.csv")
print(f500[['company', 'rank', 'revenues']].head())

Output:
                    company  rank  revenues
0                   Walmart     1    485873
1                State Grid     2    315199
2             Sinopec Group     3    267518
3  China National Petroleum     4    262573
4              Toyota Motor     5    254694

There are two differences with this approach:

The company column is now included as a regular column, instead of being used for the index.
The index labels are now integers starting from 0.

This is the more conventional way to read in a dataframe

Reading CSV Files with Encodings

We can start by reading the data into pandas. Let's look at what happens when we use the pandas.read_csv() function with only the filename as an argument:

laptops = pd.read_csv("laptops.csv")

Traceback (most recent call last):
  File "pandas/_libs/parsers.pyx", line 1123, in pandas._libs.parsers.TextReader._convert_tokens
  File "pandas/_libs/parsers.pyx", line 1247, in pandas._libs.parsers.TextReader._convert_with_dtype
  File "pandas/_libs/parsers.pyx", line 1262, in pandas._libs.parsers.TextReader._string_convert
  File "pandas/_libs/parsers.pyx", line 1452, in pandas._libs.parsers._string_box_utf8
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 4: invalid continuation byte
...
[truncated]

We get an error! Reading the traceback, we can see it references UTF-8, which is a type of encoding. Computers, at their lowest levels, can only understand binary (0 and 1) and encodings are systems for representing characters in binary. This error is telling us that the encoding it used (utf-8) failed to convert the data into binary.

Thankfully, the pandas.read_csv() function has an encoding argument we can use to specify an encoding:

df = pd.read_csv("filename.csv", encoding="encoding_type")

The top four most popular encodings, which we can use to set the encoding parameter of pandas.read_csv() above, are:

utf-8 - Universal Coded Character Set Transformation Format—8-bit, a dominant character encoding for the web.
latin1 - Also known as 'ISO-8859-1', a part of the ISO/IEC 8859 series.
Windows-1252 - A character encoding of the Windows family, also known as 'cp1252' or sometimes ANSI.
utf-16 - Similar to 'utf-8' but uses 16 bits to represent each character instead of 8.

Since the pandas.read_csv() function already tried to read in the laptops.csv file using the default encoding type (utf-8) and failed, we know the file's not encoded using that format!

Example encoded read csv

Import the pandas library using its common alias.
Use the pandas.read_csv() function to read the laptops.csv file into a dataframe laptops.
- Specify the encoding using the string "latin1".
Use the DataFrame.info() method to display information about the laptops dataframe.
- Specify the print() function to see the results.

import pandas as pd
laptops = pd.read_csv("laptops.csv", encoding="latin1")
print(laptops.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1303 entries, 0 to 1302
Data columns (total 13 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   Manufacturer              1303 non-null   object
 1   Model Name                1303 non-null   object
 2   Category                  1303 non-null   object
 3   Screen Size               1303 non-null   object
 4   Screen                    1303 non-null   object
 5   CPU                       1303 non-null   object
 6   RAM                       1303 non-null   object
 7    Storage                  1303 non-null   object
 8   GPU                       1303 non-null   object
 9   Operating System          1303 non-null   object
 10  Operating System Version  1133 non-null   object
 11  Weight                    1303 non-null   object
 12  Price (Euros)             1303 non-null   object
dtypes: object(13)
memory usage: 132.5+ KB

Opening CSV File with different separator

import pandas as pd
traffic = pd.read_csv('traffic_sao_paulo.csv', sep=';')
print(traffic.head())
print(traffic.tail())
traffic.info()

The data points are separated by ;, so you'll need to use sep=';' to read in the file properly.

PreviousModules NextFunctions

Last updated 8 months ago

hashtagCSV files

hashtagConvert read data to list

hashtagSeparating file header

hashtagImport CSV File in Pandas

hashtagNaming DataFrame Axes

hashtagNot using indec column

hashtag Reading CSV Files with Encodings

hashtagExample encoded read csv

hashtagOpening CSV File with different separator