Algorithm Analysis

Computational complexity

Algorithm running time isn't measured in seconds. It is determined by the number of elementary operations performed by the algorithm. The running time on any given computer is usually called real running time. An algorithm's running time is also influenced by its arguments.

Denote the length of the list as $n$ . The working time is a function of $n$ , written as $T(n)$ . The asymptotic running time of an algorithm shows how $T(n)$ grows when $n$ increases.

When $T(n)$ is a polynomial, the asymptotic running time is equal to the term of the highest power without a coefficient (for example, $n^2$ instead of $5n^2$ ). As $n$ reaches greater values, the other terms become unimportant.

If $T(n) = 4n + 3$ , the asymptotic running time $T(n) \sim n$ . The algorithm has linear complexity. The tilde symbol ( $\sim$ ) means that the asymptotic runtime is $n$ .
If $T(n) = 5n^2 + 3n - 1$ , the asymptotic running time $T(n) \sim n^2$ . The algorithm has quadratic complexity.
If $T(n) = 10n^3 - 2n^2 + 5$ , then $T(n) \sim n^3$ . The algorithm has cubic complexity.
If $T(n) = 10$ , then $T(n) \sim 1$ . The algorithm has constant complexity, that is, it doesn't depend on $n$ .

Linear regression model training time

The linear regression training objective is represented as follows:

w=\argmin_{\qquad \enspace w}\text{MSE}(Xw,y)

Weights are calculated by this formula:

w=(X^\top X)^{-1}X^\top y

Define the number of observations in the training set as $n$ , and the number of features as $p$
The size of matrix $X$ will be $n \times p$ , and the size of vector $y$ will be $n$
The computational complexity will be defined as $T(n, p)$ because it depends on two parameters: $n$ and $p$

To calculate the complexity of algorithm training, add up the answers:

T(n, p) \sim np^2 + p^3 + np^2 + np

There are usually fewer features than observations, meaning $p < n$ . Multiplying both parts by $p^2$ results in $p^3 < np^2$ . Taking only the term with the highest power, we get: $T(n, p) \sim np^2$ .

Iterative methods

The following formula is employed as a direct method in linear regression model training:

w=(X^\top X)^{-1}X^\top y

Direct methods help to find a precise solution using a given formula or algorithm. Their computational complexity is independent of the data.

Iterative methods, or iterative algorithms perform similar iterations repeatedly, the solution becoming more accurate with each step. If there's no need for high accuracy, just a few iterations will do.

The computational complexity of iterative methods depends on the number of steps they take, which may be affected by the amount of data.

Bisection method

The bisection method takes a continuous function and segment $[a, b]$ as input. The values $f(a)$ and $f(b)$ have different signs.

When these two conditions are fulfilled:

The function is continuous
The values at the ends of the segment have different signs

then the root of the equation is located somewhere on the given segment.

At each iteration, the bisection method:

Checks if any value $f(a)$ or $f(b)$ equals zero. If it does, the solution has been found
Finds the middle of the segment $c = \frac{a + b}{2}$
Compares the sign of $f(c)$ with the signs of $f(a)$ and $f(b)$
- If $f(c)$ and $f(a)$ have different signs, the root is located on the segment $[a, c]$ . The algorithm analyzes this segment on its next iteration
- If $f(c)$ and $f(b)$ have different signs, the root is located on the segment $[b, c]$ . The algorithm analyzes this segment on its next iteration
- The signs of $f(a)$ and $f(b)$ are different, so there are no other options

The solution's accuracy is usually chosen beforehand, for example, $e$ (margin of error) $= 0.000001$ . At each iteration, the segment with the root is divided by 2. Once it reaches a length of the segment that is less than $e$ , the algorithm can be stopped. This condition is called the stopping criterion.

Comparing methods

Gradient descent:

Works faster with large datasets in linear regressions with the MSE loss function
Is also suitable for linear regressions with other loss functions (not all of them have formulas)
Can be used for training neural networks which also lack direct formulas