Hello. This statistics post will be an introduction on sample variances, sample covariances and sample correlations. These topics are typically found in an introductory course to probability and statistics.
No expected values will be shown here as there is more of a focus of sample data versus population data. (With that being said be mindful of the math notation.)
Review Of The Sample Mean
Recall that the formula for the sample mean of known data point values is .
The sample mean is the equally weighted average of the data points and can be thought of as the center of the data. We use the sample mean to estimate (or make an informed guess about) the population mean . Remember that the sample data is a subset of the larger population. (E.g. A sample of 100, 000 Canadians from Canada)
Means, medians, and modes are measures of central tendency which determine the most likely values given sample data. The terms variances, covariances and correlations are measures of variation. These measures of variation are useful in determining how random a random variable is.
When people say for example that the stock market is random, it is a vague statement as it does not specify any sort of quantity associated with the randomness of financial stocks (the range of values a financial stock can take on)
The Sample Variance
In the realm of probability and statistics, the variance can be thought of how far a set of (random) variables are from the mean.
The population variance is given by
We take the values of each , subtract it by the mean and square it. We then take the sum of these differenced squares and divide by .
This population variance is estimated by the sample variance (with known values). The sample variance is given by
In the sample variance, we divide by (n – 1) which makes the sample variance an unbiased estimator of the population variance. In the long run or as the sample size gets larger (approaching ), the sample variance would eventually reach (converge to) the population variance in theory.
Because of the exponent of 2, the variance is a non-negative value. It can be 0 or greater. If the variance is 0, that means that there is no randomness/variation on the random variable.
A Slight Note on Notation
If we have then it is random and the value is unknown or unrealized. Once is known or realized it is no longer random. Iit is now a known quantity and we use . Similarly, is for the sample variance with unknown values but the sample variance goes with known values and is no longer random.
The standard deviation is the square root of the variance and is used as a measure of how spread out the values of a sample data are. If one knows about z-scores, the standard deviation is the number of z-scores from the mean. The variance being non-negative (0 or positive) ensures that the number inside the square root is positive.
The standard deviation for a population is .
The sample standard deviation (of known values) is .
The covariance is a varability measure of how two random variables change together. If the covariance is positve for random variables and (as an example) then as X increases in numeric value then Y increases as well. For the negative covariance case, as X increases in numeric value then Y decreases in value.
The sample covariance (with known values) is:
where is the sample mean associated with X and is the sample mean associated with Y.
If then we have or which is in the same form as from earlier. Do keep in mind that goes with the s and goes with the s.
Correlation is not much different than covariance. Correlation is a variability measure which measures the relationship between two random variables (or sets of data). The sample correlation formula is as follows:
The numerator is the covariance and the denominator is the square root of the sample variance with the s multiplied by the sample variance with s . Correlations can be viewed as scaled versions of covariances.
Correlations are between -1 to 1.
A positive correlation between random variables and means that as increases then increases as well. With the negative correlation case, as increases then decreases.
Correlations close to zero suggest that the two variables have no relation with each other.
A correlation close to -1 suggests a very strong relationship where as increases then decreases. A correlation of +1 suggests a very strong relationship where as increases then increases.
Correlation measures of about 0.5 or -0.5 suggest a moderate correlation and values closer to 0 suggest a weak association between two random variables.
The table image below is one aid to help associate correlation measures with relationship strengths between two (random) variables. (Other tables would have other definitions of moderate/strong/weak correlation strengths.)
Correlation Does Not Imply Causation
The most important thing you should remember when it comes to correlations is that “Correlation does not mean causation!”. Correlations measure the relationship and dependence between two variables based on a sample from the population. The sample size may not be “large” relative to the population size. Also, there may be other variables which could affect the dependent variable . The take home message is that just because appears to cause , it doesn’t mean it actually does (as we don’t have all the data/information).