Estimating Distribution with Location and Dispersion

Estimating Location

We can use a measure of location like the median and mean, to estimate approximately where a dataset is located on the numerical axis. Here the mean is more formally called the algebraic measure of location and the median the structural measure of location.

Here the Greek letter mu, $\mu$ , stands for the arithmetic mean of the data.

\mu = \frac{\sum \left (x_{i} \right )}{n}

Estimating Dispersion?

To really understand the data we need more than just the measure of location we also need to know how the data is scattered or dispersed around these measures.

This is where we can use variance to measure dispersion to learn more about the data.

To calculate variance you just need to take the average squared distance from the mean.

\sigma^{2} = \frac{\sum \left ( \mu - x_{i} \right )^{2}}{n}

To calculate variance in Python we can use the var() method from the numpy library.


1import numpy as np
2
3variance = np.var(x)

Calculating this gives us units of measurement that are squares of the variable’s original units. But, what if we want the original units of measurement?

Standard deviation is the value we get after we take the square root of the variance.

\sigma = \sqrt{\frac{\sum \left ( \mu - x_{i} \right )^{2}}{n}}

It’s a little complicated, but luckily in Python we can use the std() method from the numpy library.


1import numpy as np
2
3standard_deviation = np.std(x)

And, if you already know the variance, you can use numpy's sqrt() method to get the standard deviation.


1import numpy as np
2
3variance = 2.9166666666666665
4standard_deviation = np.sqrt(variance)

From here we can use the rule of three standard deviations, or the three-sigma rule. This rule states that almost all values (approximately 99%) are found within three standard deviations of the mean:

(\mu - 3\sigma, \mu + 3\sigma)

This rule not only helps you find the interval where most of the values you are interested in fall, but also helps you find values outside that interval (outliers).