Data Analysis and Visualization: 1a. Female corporate talent pipeline


In this first post on Data Analysis and Visualization, we start with a simple dataset. The problem statement is to analyse and communicate a message from this dataset. We are answering the question – What does this data tell us?

In this post, we will start with exploratory analysis and then move on to explanatory analysis.

Check out this post, to learn more about Exploratory Vs. Explanatory analysis.

The end result of this exercise will be a visualization of this dataset that can be shared with an audience.

Tools used

  • R-Studio for exploratory analysis
  • Tableau/Excel for explanatory analysis
  • Data format: XLSX/CSV


The dataset for our first exercise is about Female Corporate Talent Pipeline.

You can download the data here.


Exploratory Analysis Using R

Though this is a simple dataset and we could analyse it in Excel, we will use R to get some practice.

Preparing Data for Analysis

Fire up R-Studio and import the dataset using this command. Read.csv reads strings as factors by default, use StringAsFactors = FALSE to prevent this.

corp_talent <- read.csv("Female Corporate Talent Pipeline.csv",

Next, lets see the variables in the data.


This gives us:

'data.frame': 12 obs. of  4 variables:

 $ Year  : int  2012 2015 2012 2015 2012 2015 2012 2015 2012 2015 ...

 $ Level : chr  "Entry Level" "Entry Level" "Manager" "Manager" ...

 $ Female: chr  "42%" "45%" "33%" "37%" ...

 $ Male  : chr  "58%" "55%" "67%" "63%" ...

Our data has 4 variables : Year, which is 2012 or 2015. Level, which describes the role. Female and Male, which are percentage of employees in that level.

Since Year and Level are variables with pre-defined values, we will convert them to Factors.

corp_talent$Year <- as.factor(corp_talent$Year)
corp_talent$Level <- as.factor(corp_talent$Level)

Female and Male variables have values in percentages. Lets convert them to numeric values with decimals.

corp_talent$Female <- as.numeric(sub("%", "e-2", corp_talent$Female))
corp_talent$Male <- as.numeric(sub("%", "e-2", corp_talent$Male))

In our data, the columns Female and Male are in fact values of what can be categorised as Gender. This is where “tidyr” comes in handy. We will group the Female and Male columns into a column called Gender. The values in the Female and Male columns will go into a new column called Percent.

corp_talent <- gather(corp_talent, "Female", "Male", 
key = "Gender", value = "Percent")

If we see the structure of our data frame now:

'data.frame': 24 obs. of  4 variables:

 $ Year   : Factor w/ 2 levels "2012","2015": 1 2 1 2 1 2 1 2 1 2 ...

 $ Level  : Factor w/ 6 levels "C-Suite","Entry Level",..: 2 2 3 3 4 4 6 6 5 5 ...

 $ Gender : chr  "Female" "Female" "Female" "Female" ...

 $ Percent: num  0.42 0.45 0.33 0.37 0.28 0.32 0.23 0.27 0.2 0.23 ...

Great. Now we can start plotting this data.

Plotting for Data Analysis

Let’s start with a simple plot of Male vs. Female employees for 2012 and 2015.

ggplot(data = corp_talent, aes(fill = Gender, x=Year, y= Percent)) + 
geom_bar(position = "fill", stat = "identity")

So here we define X and Y axis to be Year and Percent.

fill=Gender : This splits the bars by Gender.

position=”fill” : This gives us a 100% stacked bar.


So the female talent has increased from 2012 to 2015. Women still contribute to only about 27% of the talent pipeline.

Next lets see the male vs female split across levels for 2012. To do this, we subset the data by 2012 and use Level on the x-axis.

ggplot(data = subset(corp_talent, Year=="2012"), aes(fill = Gender, x=Level, y= Percent)) + geom_bar(position = "fill", stat = "identity")


Here we can see that some levels have fewer female talent than others. Lets order the levels on the x-axis from Entry level to C-Suite. We can do that using:

corp_talent$Level <- factor(corp_talent$Level, levels = c("Entry Level", "Manager", 
"Senior Manager / Director",
"Vice President", 
"Senior Vice President",

Now let’s plot again

ggplot(data = subset(corp_talent, Year=="2012"), aes(fill = Gender, x=Level, y= Percent)) + geom_bar(position = "fill", stat = "identity")

There is a pattern. The percent of female talent decreases with seniority in level.

Let’s now plot 2012 and 2015 charts side-by-side to see if there is a shift in 2015.  To do this, we use GridArrange. Grid arrange takes variables which are ggplot objects as input and plot them in multiple rows or columns. In our case, we want the plots to appear in 2 columns, so we specify nCol = 2.

plot_2012 <- ggplot(data = subset(corp_talent, Year=="2012"), aes(fill = Gender, x=Level, y= Percent)) + 
geom_bar(position = "fill", stat = "identity")

plot_2015 <- ggplot(data = subset(corp_talent, Year=="2015"), aes(fill = Gender, x=Level, y= Percent)) + 
geom_bar(position = "fill", stat = "identity")

grid.arrange(plot_2012, plot_2015, ncol=2)


So here are our findings from this analysis:

  1. Female and Male talent is almost in equal numbers at entry level. With seniority though, the percentage of female talent decreases steadily.
  2. Between 2012 and 2015, there is more female talent across all levels, but this growth is slow.

Now we are ready to move on to explanatory analysis.



Leave a Reply

You have to agree to the comment policy.

This site uses Akismet to reduce spam. Learn how your comment data is processed.