The famous statistician John W. Tukey created a branch of data analysis called Exploratory Data Analysis.
In exploratory data analysis we get familiar with data, ask questions, visualize data in a number of forms, look for relationships between the variables, look for outliers, patterns and trends in data. One way to do that is using measures of centre. We will find that interesting stories may arise from both the norm and the exceptions in data.
Basic statistics for exploring data : Measures of Centre
Measure of centre is a value at the centre or middle of a dataset. There are 3 measures of centre commonly used in exploratory data analysis: Mean, Median and Mode.
Mean
The mean, also known as average is the measure of centre found by adding the data values and dividing the total by the number of items.
How to calculate the mean?
Mean = Sum of all data values/number of data values
Example
Let’s look at an example,
Salaries of 10 employees : $25,000, $26,936, $53,423, $109,876, $87,402, $92,653, $28,004, $17,656, $75,200, $21,000
Mean = ($25,000, + $26,936 + $53,423 + $109,876 + $87,402 + $92,653 + $28,004 + $17,656 + $75,200 + $21,000) / 10
= $537,150/10 = $53,715
Outliers
Let’s add another value to the salary dataset – a CEO’s salary.
$25,000, $26,936, $53,423, $109,876, $87,402, $92,653, $28,004, $17,656, $75,200, $21,000, $19,628,585
Mean = $20,165,735/11 = $1,833,249
The CEO’s salary is an outlier, meaning it is markedly higher than the other salaries in the dataset. In our example, Mean without the outlier is $53,715 and mean with the outlier is $1,833,249. An outlier (high or low) can dramatically alter the mean. Hence the mean is called a non-resistant measure of centre.
Understanding outliers:
Here is a simple explanation to understand the effect of outliers:
Median
Median is the middle value when the values in the dataset are arranged in ascending or descending order.
How to calculate the median?
If the number of values in the dataset is odd, median is the number located at the exact middle of the sorted data. If the number of values in the dataset is even, median is the mean of the middle two numbers in the sorted data.
Example
Salaries of 10 employees : $25,000, $26,936, $53,423, $109,876, $87,402, $92,653, $28,004, $17,656, $75,200, $21,000
Sorted: $17,656, $21,000, $25,000, $26,936, $28,004, $53,423, $75,200, $87,402, $92,653, $1,09,876
Median = ($28,004 + $53,423) / 2 = $40,714
Now let’s add the CEO’s salary to the list.
$17,656, $21,000, $25,000, $26,936, $28,004, $53,423, $75,200, $87,402, $92,653, $1,09,876, $19,628,585
Median = $53,423
Median without the outlier is $40,714 and median with the outlier is $53,423. The median did not change by a large amount with the addition of an outlier. Hence the median is a Resistant measure of centre.
Mode
Mode of a dataset is the most frequently occurring value.
Here are some common types of modes in a dataset:
- Bimodal: Two values occur with the same greatest frequency.
- Multimodal: More than 2 values occur with the same greatest frequency.
- No mode: No values are repeated in the dataset.
Mode is the only measure of centre that can be used with nominal values.
Example
Let’s look at the same salary dataset again.
Salaries of 10 employees : $25,000, $26,936, $53,423, $109,876, $87,402, $92,653, $28,004, $17,656, $75,200, $21,000
None of the salaries repeat here, so there is no mode.
Ages of 10 employees: 22, 23, 27, 25, 27, 30, 29, 22, 32, 26 years
22 and 27 both appear twice in the dataset. So there are 2 modes: 22 and 27 years.
Pingback: Exploratory Vs. Explanatory Analysis - Daydreaming Numbers
Pingback: Basic statistics for exploring data : Measures of Variation