Basic statistics for exploring data : Measures of Centre

The famous statistician John W. Tukey created a branch of data analysis called Exploratory Data Analysis.

In exploratory data analysis we get familiar with data, ask questions, visualize data in a number of forms, look for relationships between the variables, look for outliers, patterns and trends in data. One way to do that is using measures of centre. We will find that interesting stories may arise from both the norm and the exceptions in data.

Basic statistics for exploring data : Measures of Centre

Measure of centre is a value at the centre or middle of a dataset. There are 3 measures of centre commonly used in exploratory data analysis: Mean, Median and Mode.

Mean

The mean, also known as average is the measure of centre found by adding the data values and dividing the total by the number of items.

How to calculate the mean?

Mean = Sum of all data values/number of data values

Example

Let’s look at an example,

Salaries of 10 employees : $25,000, $26,936, $53,423, $109,876, $87,402, $92,653, $28,004, $17,656, $75,200, $21,000

Mean = ($25,000, + $26,936 + $53,423 + $109,876 + $87,402 + $92,653 + $28,004 + $17,656 + $75,200 + $21,000) / 10

= $537,150/10 = $53,715

Outliers

Let’s add another value to the salary dataset – a CEO’s salary.

$25,000, $26,936, $53,423, $109,876, $87,402, $92,653, $28,004, $17,656, $75,200, $21,000, $19,628,585

Mean = $20,165,735/11 = $1,833,249

The CEO’s salary is an outlier, meaning it is markedly higher than the other salaries in the dataset. In our example, Mean without the outlier is $53,715 and mean with the outlier is $1,833,249. An outlier (high or low) can dramatically alter the mean. Hence the mean is called a non-resistant measure of centre.

Understanding outliers:

Here is a simple explanation to understand the effect of outliers:

Median

Median is the middle value when the values in the dataset are arranged in ascending or descending order.

How to calculate the median?

If the number of values in the dataset is odd, median is the number located at the exact middle of the sorted data. If the number of values in the dataset is even, median is the mean of the middle two numbers in the sorted data.

Example

Salaries of 10 employees : $25,000, $26,936, $53,423, $109,876, $87,402, $92,653, $28,004, $17,656, $75,200, $21,000

Sorted: $17,656, $21,000, $25,000, $26,936, $28,004, $53,423, $75,200, $87,402, $92,653, $1,09,876

Median = ($28,004 + $53,423) / 2 = $40,714

Now let’s add the CEO’s salary to the list.

$17,656, $21,000, $25,000, $26,936, $28,004, $53,423, $75,200, $87,402, $92,653, $1,09,876, $19,628,585

Median = $53,423

Median without the outlier is $40,714 and median with the outlier is $53,423. The median did not change by a large amount with the addition of an outlier. Hence the median is a Resistant measure of centre.

Mode

Mode of a dataset is the most frequently occurring value.

Here are some common types of modes in a dataset:

  • Bimodal: Two values occur with the same greatest frequency.
  • Multimodal: More than 2 values occur with the same greatest frequency.
  • No mode: No values are repeated in the dataset.

Mode is the only measure of centre that can be used with nominal values.

Example

Let’s look at the same salary dataset again.

Salaries of 10 employees : $25,000, $26,936, $53,423, $109,876, $87,402, $92,653, $28,004, $17,656, $75,200, $21,000

None of the salaries repeat here, so there is no mode.

Ages of 10 employees: 22, 23, 27, 25, 27, 30, 29, 22, 32, 26 years

22 and 27 both appear twice in the dataset. So there are 2 modes: 22 and 27 years.

Similar Posts

2 Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.