Knowledge Base

Data Types

Categorical and Quantitative Variables

There are two basic types of variables in Statistics: categorical and quantitative. Categorical variables take their values from a limited set, while quantitative variables take numeric values from within a range. It is important to differentiate them because data analysis methods can be limited depending on the data type or even applicable to just one of types of data.

There are also other types of variables but this is more due to software-centric concepts than data analysis ones. Here are some examples of other data types:

  • Logical (Boolean), meaning they indicate whether a statement is true or false. If a statement is true the variable takes the value 1. If a statement is false, it is 0.
  • Strings
  • Dates and timestamps

Converting Column Values to a Different Type

Converting column values to a different type (int for integers, for instance) with replacing the current column

1df['column'] = df['column'].astype('int')

Converting column values to a different type (int for integers, for instance) without replacing the current column

1df['new_column'] = df['column'].astype('int')

Turning String Values into Numbers

We use a standard pandas method to convert string values into numbers: to_numeric() (API). It turns column values into float64 (floating point number) or int64 (integer) type, depending on the input value.

The to_numeric() method has an errors parameter. This parameter determines what to_numeric will do when it encounters an invalid value:

  • errors='raise' (default): an exception is raised when an incorrect value is encountered, halting the conversion to numbers.
  • errors='coerce': Incorrect values are replaced with NaN.
  • errors='ignore': Incorrect values are left unchanged.
1pd.to_numeric(df['column'])
2pd.to_numeric(df['column'], errors='raise' )
3pd.to_numeric(df['column'], errors='coerce')
4pd.to_numeric(df['column'], errors='ignore')

Converting String to Times and Dates

Python has a special data type we use when working with dates and times: datetime.

In order to convert strings into dates and times, we use the to_datetime() method of pandas (API). The method’s parameters include the column name containing strings and the date format in a string.

We set the date format using a special designation system:

  • %d: day of the month (01 to 31)
  • %m: month (01 to 12)
  • %Y: four-digit year (for example, 1994)
  • %y: two-digit year (for example, 94)
  • Z or T: standard separator for date and time
  • %H: hour in a 24-hour format
  • %I: hour in a 12-hour format
  • %M: for minutes (00 to 59)
  • %S: for seconds (00 to 59)
1df['column']= pd.to_datetime(df['column'], format='%d.%m.%Y %H:%M:%S')

You can check the documentation here for more information on the date and times format codes.

Converting Unix Time to Times and Dates

There are various ways to represent dates and times, but the unix time format deserves special attention. This format gives us the number of milliseconds or seconds (sometimes, other fractions of time are used) that have passed since 00:00:00 on January 1, 1970. Unix time corresponds to Coordinated Universal Time, or UTC.

The to_datetime() method works with the unix time format, as well. The first argument is the column name with unix times. The second argument is unit; it communicates that the time needs to be converted to the usual format with the given unit.

Converting unix time given in seconds.

1df['date']= pd.to_datetime(df['timestamp'], unit='s')

Converting unix time given in milliseconds.

1df['date']= pd.to_datetime(df['timestamp'], unit='ms')

Extracting Time Components

We often have to study statistics by month, day, or year. To do so, we can place the time in the DatetimeIndex class and apply the month, day, or year attribute to it:

1pd.DatetimeIndex(df['column']).year
2pd.DatetimeIndex(df['column']).month
3pd.DatetimeIndex(df['column']).day
4pd.DatetimeIndex(df['column']).hour
5pd.DatetimeIndex(df['column']).minute
6pd.DatetimeIndex(df['column']).second

For columns with datetime-like values, you can also access these properties via the .dt accessor (API)

1df['time'].dt.year
2df['time'].dt.month
3df['time'].dt.day
4df['time'].dt.hour
5df['time'].dt.minute
6df['time'].dt.second

The complete list of components can be found here. For example, we can find the day of the week with the dt.weekday() method.

Rounding Time Components

To round times, use the dt.round() method. It gets passed strings that indicate whether rounding should be to the day, hour, minute, or second:

  • D: day
  • H: hour
  • min or T: minute
  • S: second

Rounding time:

1df['datetime'] = df['datetime'].dt.round('1H') # round to hour
2df['datetime'] = df['datetime'].dt.round('1D') # round to day
3df['datetime'] = df['datetime'].dt.round('5T') # round to 5 minutes
4df['datetime'] = df['datetime'].dt.round('10S') # round to 10 seconds
5df['datetime'] = df['datetime'].dt.floor('1H') # always round down
6df['datetime'] = df['datetime'].dt.ceil('1H') # always round up
7df['datetime'] = df['datetime'].dt.round('3H') # round to 3 hours

If you want to round up, use the dt.ceil() (ceiling) method. To round down, use dt.floor().

Shifting Time

1data['shifted_dt'] = data['datetime'] + pd.Timedelta(hours=10) # Add 10 hours
Send Feedback
close
  • Bug
  • Improvement
  • Feature
Send Feedback
,