Reading Data
Reading CSV Files
Comma-separated values (CSV) file is a text-based format with the typical structure
The first line often represents the header or the names of fields.
The other lines are records.
Values are separated by delimiters, frequently used delimiters are
,
(comma) hence the name of the format;
(semicolon)\t
(the tab character)
but other symbols for delimiters are also possible.
More information about the format and its ambiguity can be found here.
CSV file can be loaded into a DataFrame object with the read_csv()
method (API).
1import pandas as pd23df = pd.read_csv('file.csv')
Reading Non-Standard CSV Files
Many CSV files cannot correctly be read with read_csv()
without specifying additional parameters.
Some useful additional parameters that help to deal with non-standard CSV files:
header
: row number for the header, can be set toNone
if there is no header,names
: list of column names that will be assigned to columns of a DataFrame,sep
: the delimiter symbol or which symbol marks the end of one column and the beginning of the next,decimal
: the symbol used for decimals (can be useful to read numerical data with the decimal comma which is the standard convention in many European countries).
Example of using the additional arguments:
1import pandas as pd23column_names = [4 'country',5 'name',6 'capacity_mw',7 'latitude',8 'longitude',9 'primary_fuel',10 'owner'11]12data = pd.read_csv(13 '/datasets/gpp_modified.csv',14 sep='|',15 header=None,16 names=column_names,17 decimal=','18)
Reading Excel Files
Excel file can be loaded into a DataFrame with the read_excel()
method (API). It’s similar to read_csv()
but with different subset of arguments. The method reads the first sheet by default.
1import pandas as pd23df = pd.read_excel('product_reviews.xlsx')
Some useful additional parameters:
sheet_name
: the index of the sheet (starting from 0) or the name of the sheet to read.
Reading a certain sheet by specifying its name:
1import pandas as pd23df = pd.read_excel('product_reviews.xlsx', sheet_name='reviewers')
Reading a certain sheet by specifying its index number:
1import pandas as pd23df = pd.read_excel('product_reviews.xlsx', sheet_name=2)