Imbalanced Classification

Balance and Imbalance of the Classes

Classes are unbalanced when their ratio is far from 1:1. Class balance is observed if their number is approximately equal.

Accuracy doesn't get rid of class imbalance.

Balance and Imbalance of the Classes

Classes are unbalanced when their ratio is far from 1:1. Class balance is observed if their number is approximately equal.

Accuracy doesn't get rid of class imbalance.

Class weight adjustment

If we need to indicate that some observations are more important, we assign a weight to the respective class.

Logistic regression, decision tree, and random forrest in the sklearn library have the class_weight argument. By default, it is None — i.e., classes are equivalent:

class "0" weight = 1.0

class "1" weight = 1.0

If we specify class_weight='balanced', the algorithm will calculate how many times class "0" occurs more often than class "1". We’ll denote this number as N (an unknown number of times).

class "0" weight = 1.0

class "1" weight = N

The rare class will have a higher weight.

Model training with class weight adjustment


1model = LogisticRegression(class_weight='balanced', random_state=12345)

Upsampling

Upsampling is performed in several steps:

Split the training sample by class.
Determine the class with fewer observations. Call it the rare class.
Duplicate the rarer class observations several times.
Create a new training sample based on the data obtained.
Shuffle the data.

Copying observations several times can be done using Python's list multiplication syntax. To repeat the list elements, the list is multiplied by a number (the required number of repetitions):


1answers = [0, 1, 0]
2print(answers)
3answers_x3 = answers * 3
4print(answers_x3)


1[0, 1, 0]
2[0, 1, 0, 0, 1, 0, 0, 1, 0]

Use the pd.concat() function to concatenate the tables. It takes a list of tables as input. The source data can be obtained as follows:


1pd.concat([table1, table2])

To shuffle observations randomly, use the shuffle method from the sklearn.utils library:


1features, target = shuffle(features, target, random_state=54321)

The upsample() function:


1def upsample(features, target, repeat):
2    features_zeros = features[target == 0]
3    features_ones = features[target == 1]
4    target_zeros = target[target == 0]
5    target_ones = target[target == 1]
6
7    features_upsampled = pd.concat([features_zeros] + [features_ones] * repeat)
8    target_upsampled = pd.concat([target_zeros] + [target_ones] * repeat)
9
10    features_upsampled, target_upsampled = shuffle(
11        features_upsampled, target_upsampled, random_state=54321)
12
13    return features_upsampled, target_upsampled

Downsampling

Downsampling is performed in several steps:

Split the training sample by class;
Determine the class with more observations. Let's call it the majority class;
Randomly drop a portion of majority class observations;
Create a new training sample based on the data obtained;
Shuffle the data.

To randomly discard some of the table elements use the sample() function. It takes frac (from fraction) and returns random elements in such amounts that their fraction in the initial table equals frac.


1features_sample = features_train.sample(frac=0.1, random_state=54321)

The downsample() function:


1def downsample(features, target, fraction):
2    features_zeros = features[target == 0]
3    features_ones = features[target == 1]
4    target_zeros = target[target == 0]
5    target_ones = target[target == 1]
6
7    features_downsampled = pd.concat(
8        [features_zeros.sample(frac=fraction, random_state=54321)] + [features_ones])
9    target_downsampled = pd.concat(
10        [target_zeros.sample(frac=fraction, random_state=54321)] + [target_ones])
11
12    features_downsampled, target_downsampled = shuffle(
13        features_downsampled, target_downsampled, random_state=54321)
14
15    return features_downsampled, target_downsampled