Data Types
Categorical and Quantitative Variables
There are two basic types of variables in Statistics: categorical and quantitative. Categorical variables take their values from a limited set, while quantitative variables take numeric values from within a range. It is important to differentiate them because data analysis methods can be limited depending on the data type or even applicable to just one of types of data.
There are also other types of variables but this is more due to software-centric concepts than data analysis ones. Here are some examples of other data types:
- Logical (Boolean), meaning they indicate whether a statement is true or false. If a statement is true the variable takes the value 1. If a statement is false, it is 0.
- Strings
- Dates and timestamps
Converting Column Values to a Different Type
Converting column values to a different type (int for integers, for instance) with replacing the current column
1df['column'] = df['column'].astype('int')
Converting column values to a different type (int for integers, for instance) without replacing the current column
1df['new_column'] = df['column'].astype('int')
Turning String Values into Numbers
We use a standard pandas method to convert string values into numbers: to_numeric()
(API). It turns column values into float64 (floating point number) or int64 (integer) type, depending on the input value.
The to_numeric()
method has an errors
parameter. This parameter determines what to_numeric
will do when it encounters an invalid value:
errors='raise'
(default): an exception is raised when an incorrect value is encountered, halting the conversion to numbers.errors='coerce'
: Incorrect values are replaced withNaN
.errors='ignore'
: Incorrect values are left unchanged.
1pd.to_numeric(df['column'])2pd.to_numeric(df['column'], errors='raise' )3pd.to_numeric(df['column'], errors='coerce')4pd.to_numeric(df['column'], errors='ignore')
Converting String to Times and Dates
Python has a special data type we use when working with dates and times: datetime.
In order to convert strings into dates and times, we use the to_datetime()
method of pandas (API). The method’s parameters include the column name containing strings and the date format in a string.
We set the date format using a special designation system:
%d
: day of the month (01 to 31)%m
: month (01 to 12)%Y
: four-digit year (for example, 1994)%y
: two-digit year (for example, 94)Z
orT
: standard separator for date and time%H
: hour in a 24-hour format%I
: hour in a 12-hour format%M
: for minutes (00 to 59)%S
: for seconds (00 to 59)
1df['column']= pd.to_datetime(df['column'], format='%d.%m.%Y %H:%M:%S')
You can check the documentation here for more information on the date and times format codes.
Converting Unix Time to Times and Dates
There are various ways to represent dates and times, but the unix time format deserves special attention. This format gives us the number of milliseconds or seconds (sometimes, other fractions of time are used) that have passed since 00:00:00 on January 1, 1970. Unix time corresponds to Coordinated Universal Time, or UTC.
The to_datetime()
method works with the unix time format, as well. The first argument is the column name with unix times. The second argument is unit
; it communicates that the time needs to be converted to the usual format with the given unit.
Converting unix time given in seconds.
1df['date']= pd.to_datetime(df['timestamp'], unit='s')
Converting unix time given in milliseconds.
1df['date']= pd.to_datetime(df['timestamp'], unit='ms')
Extracting Time Components
We often have to study statistics by month, day, or year. To do so, we can place the time in the DatetimeIndex class and apply the month, day, or year attribute to it:
1pd.DatetimeIndex(df['column']).year2pd.DatetimeIndex(df['column']).month3pd.DatetimeIndex(df['column']).day4pd.DatetimeIndex(df['column']).hour5pd.DatetimeIndex(df['column']).minute6pd.DatetimeIndex(df['column']).second
For columns with datetime-like values, you can also access these properties via the .dt accessor (API)
1df['time'].dt.year2df['time'].dt.month3df['time'].dt.day4df['time'].dt.hour5df['time'].dt.minute6df['time'].dt.second
The complete list of components can be found here. For example, we can find the day of the week with the dt.weekday()
method.
Rounding Time Components
To round times, use the dt.round()
method. It gets passed strings that indicate whether rounding should be to the day, hour, minute, or second:
D
: dayH
: hourmin
orT
: minuteS
: second
Rounding time:
1df['datetime'] = df['datetime'].dt.round('1H') # round to hour2df['datetime'] = df['datetime'].dt.round('1D') # round to day3df['datetime'] = df['datetime'].dt.round('5T') # round to 5 minutes4df['datetime'] = df['datetime'].dt.round('10S') # round to 10 seconds5df['datetime'] = df['datetime'].dt.floor('1H') # always round down6df['datetime'] = df['datetime'].dt.ceil('1H') # always round up7df['datetime'] = df['datetime'].dt.round('3H') # round to 3 hours
If you want to round up, use the dt.ceil()
(ceiling) method. To round down, use dt.floor()
.
Shifting Time
1data['shifted_dt'] = data['datetime'] + pd.Timedelta(hours=10) # Add 10 hours