docs.daveops.net

Snippets for yer computer needs

Statistics

Statistics is about organizing data in a way that can allow further meaning and inferences.

One of the means for organizing data is frequency distribution

When using pictograms, beware of the possibility of giving a false impression with the data.

When describing the data, it is good to have the measures of central tendency and measures of dispersion

Statistical hypothesis testing

Null hypothesis - general statement that nothing is happening.

Errors

Type 1 - False positive. The mistaken rejection of an actually true null hypothesis. “An innocent person is convicted.”

Type 2 - False negative. The mistaken acceptance of an actually false null hypothesis. “A guilty person is not convicted.”

A crossover error rate (CER) is the point at which both types are equal. A lower CER means something is more accurate.

Frequency distribution

Frequency distribution is where the you take each possible value of data and enumerate the total number of times that data has appeared (the frequency)

Where the number of different of data points becomes difficult (eg the income of every individual in Canada), consider grouping data.

Frequency distribution lends itself well to statistics

Grouping data

When making a frequency distribution, it can sometimes make the information clearer by grouping data. An example for this is “income per individual.” Instead of having thousands of individual points, you can group them like so:

Income Frequency
120k-159k 45
90-119k 345
60-89k 44453
30-59k 3345
0-29k 600

Rules of thumb when creating classes:

One way to handle this is to take the range from highest to lowest, then divide by the number of classes. While you can do something like “120k+” plus in the above table, you lose the ability to do arithmetic on the data. It might also hide outliers and anomalies.

Law of Large Numbers

Don’t mistake probability for reality

Measures of central tendency

MCT can be best thought of as averages. A useful tool in Statistics

Mean weighted center, summing the values then dividing by size of the set
Median the numerical center, halving the set length
Mode the most frequent value in a set

In a set of [1, 2, 2, 4, 6], the mean is 3, the mode is 2, the median is 2.

If the number of values in a set is even, the median of the set is the average of the two middle-most numbers.

range = highest value - lowest value (obvious, but worth stating)

Measures of dispersion

Dispersion (also called variability, scatter, or spread) is the extent to which a distribution is stretched or squeezed

On symmetrical distributions, you’ll want to find the standard deviation (how close numbers are to the mean).

Where distributions are asymmetric or have outlier values, consider using the semi-interquartile range: Q = 1/2(Q3 - Q1)

Percentile

The percentile of a set is the number of n percent of the set.

Bar graphs

Bar graphs are a way of visualizing data that can make numbers more obvious, for example showing frequency distribution.

A histogram is similar to a bar graph, but the variables are broken into intervals.