Statistics
Statistics is about organizing data in a way that can allow further meaning and inferences.
One of the means for organizing data is frequency distribution
When using pictograms, beware of the possibility of giving a false impression with the data.
When describing the data, it is good to have the measures of central tendency and measures of dispersion
Statistical hypothesis testing
Null hypothesis - general statement that nothing is happening.
Errors
Type 1 - False positive. The mistaken rejection of an actually true null hypothesis. “An innocent person is convicted.”
Type 2 - False negative. The mistaken acceptance of an actually false null hypothesis. “A guilty person is not convicted.”
A crossover error rate (CER) is the point at which both types are equal. A lower CER means something is more accurate.
Frequency distribution
Frequency distribution is where the you take each possible value of data and enumerate the total number of times that data has appeared (the frequency)
Where the number of different of data points becomes difficult (eg the income of every individual in Canada), consider grouping data.
Frequency distribution lends itself well to statistics
Grouping data
When making a frequency distribution, it can sometimes make the information clearer by grouping data. An example for this is “income per individual.” Instead of having thousands of individual points, you can group them like so:
Income | Frequency |
120k-159k | 45 |
90-119k | 345 |
60-89k | 44453 |
30-59k | 3345 |
0-29k | 600 |
Rules of thumb when creating classes:
- Keep number of classes reasonable to what you’re trying to convey (8-15)
- Classes should be of the same size
- Classes should be easy to handle
One way to handle this is to take the range from highest to lowest, then divide by the number of classes. While you can do something like “120k+” plus in the above table, you lose the ability to do arithmetic on the data. It might also hide outliers and anomalies.
Law of Large Numbers
Don’t mistake probability for reality
Measures of central tendency
MCT can be best thought of as averages. A useful tool in Statistics
Mean | weighted center, summing the values then dividing by size of the set |
Median | the numerical center, halving the set length |
Mode | the most frequent value in a set |
In a set of [1, 2, 2, 4, 6], the mean is 3, the mode is 2, the median is 2.
If the number of values in a set is even, the median of the set is the average of the two middle-most numbers.
range = highest value - lowest value (obvious, but worth stating)
Measures of dispersion
Dispersion (also called variability, scatter, or spread) is the extent to which a distribution is stretched or squeezed
On symmetrical distributions, you’ll want to find the standard deviation (how close numbers are to the mean).
Where distributions are asymmetric or have outlier values, consider using the semi-interquartile range: Q = 1/2(Q3 - Q1)
Percentile
The percentile of a set is the number of n percent of the set.
Bar graphs
Bar graphs are a way of visualizing data that can make numbers more obvious, for example showing frequency distribution.
A histogram is similar to a bar graph, but the variables are broken into intervals.