
Net stacked distribution graphs are a nice way of comparing data on a Likert scale (i.e. when respondents are asked whether they "Strongly Disagree", "Disagree", etc. with a statement). It strips out the neutral responses and centers the responses around the center of the graph so you can quickly compare agreement and disagreement on different issues. Here we'll learn how to do this in ggplot2 -- it takes a dosage of deviousness.
Jason Becker provides some code for doing this -- I've taken his basic idea and made it more readable and flexible, including being able map multiple questions at the same time.
net_stacked(), code below, takes a single argument, x, which is a data.frame where each column is an ordered factor containing Likert-style responses. The factor levels must be ordered from the "most negative" possible response (e.g. "Strongly Disagree") to "most positive" (e.g. "Strongly Agree"). If there is an odd number of possible responses/levels, such as in a 5 or 7 point Likert scale, net_stacked chops out the central level (assumed to be "Neutral", "Neither Agree nor Disagree", or similar).
All the columns of the data.frame need to have the same levels. The function can actually accept a list where the factor elements have different lengths, as well. NAs are omitted from each column before plotting.
How do we actually accomplish this effect in ggplot2? Here's the full text of the function:
net_stacked <- function(x) { ## x: a data.frame or list, where each column is a ordered factor with the same levels ## lower levels are presumed to be "negative" responses; middle value presumed to be neutral ## returns a ggplot2 object of a net stacked distribution plot ## Test that all elements of x have the same levels, are ordered, etc. all_levels <- levels(x[[1]]) n <- length(all_levels) levelscheck <- all(sapply(x, function(y) all(c(is.ordered(y), levels(y) == all_levels)) )) if(!levelscheck) stop("All levels of x must be ordered factors with the same levels") ## Reverse order of columns (to make ggplot2 output look right after coord_flip) x <- x[length(x):1] ## Identify middle and "negative" levels if(n %% 2 == 1) neutral <- all_levels[ceiling(n/2)] else neutral <- NULL negatives <- all_levels[1:floor(n/2)] positives <- setdiff(all_levels, c(negatives, neutral)) ## remove neutral, summarize as proportion listall <- lapply(names(x), function(y) { column <- (na.omit(x[[y]])) out <- data.frame(Question = y, prop.table(table(column))) names(out) <- c("Question", "Response", "Freq") if(!is.null(neutral)) out <- out[out$Response != neutral,] out }) dfall <- do.call(rbind, listall) ## split by positive/negative pos <- dfall[dfall$Response %in% positives,] neg <- dfall[dfall$Response %in% negatives,] ## Negate the frequencies of negative responses, reverse order neg$Freq <- -neg$Freq neg$Response <- ordered(neg$Response, levels = rev(levels(neg$Response))) stackedchart <- ggplot() + aes(Question, Freq, fill = Response, order = Response) + geom_bar(data = neg, stat = "identity") + geom_bar(data = pos, stat = "identity") + geom_hline(yintercept=0) + scale_y_continuous(name = "", labels = paste0(seq(-100, 100, 20), "%"), limits = c(-1, 1), breaks = seq(-1, 1, .2)) + scale_fill_discrete(limits = c(negatives, positives)) + coord_flip() stackedchart } |
Once we have the function, here's the code for the image above:
require(ggplot2) ## generate fake likert data set.seed(200) response_scale <- c("Strongly Disagree", "Disagree", "Neither Agree or Disagree", "Agree", "Strongly Agree") x <- replicate(5, ordered(sample(response_scale, 20, replace = TRUE), levels = response_scale), simplify = F) x <- as.data.frame(x) names(x) <- paste0("Q", 1:5) ## plot it as net stacked distribution net_stacked(x) |
This gives a warning, since ggplot2 really isn't sure why we're stacking negative numbers. But that is, in fact, what we're intending to do here: embrace the devious!
Jason Becker's post provides some colors to heuristically represent the intensity of feelings. These and any other customizations we can add onto the ggplot object returned by our function in the usual ways
Most of the function is simply preparing and summarizing the data in the form of proportions for each level of the ordered factor, applied to each column of the data.frame; but notice that we separate the "positive" (more-agreeing) and "negative" (more-disagreeing) levels into two separate objects:
## split by positive/negative pos <- dfall[dfall$Response %in% positives,] neg <- dfall[dfall$Response %in% negatives,] |
And then we make the frequencies negative because we want them to actually show up on the negative side of 0 in our plot:
neg$Freq <- -neg$Freq |
And we reorder the levels in reverse, because we want them oriented so that the "most neutral" responses are stacked first on top of zero in the negative direction and then progressively "more negative" responses:
And here's where we bring that home -- in the plot command, we actually have two different layers. One represents the positive half, and one the negative half, which are drawing on these separate datasets. We need to separate them, otherwise ggplot2 will get confused stacking positives and negatives together.
geom_bar(data = neg, stat = "identity") + geom_bar(data = pos, stat = "identity") + |
This is the clever trick that Jason Becker does that makes this whole thing possible!
Also, notice that in specifying the mapping, we explicitly tell ggplot2 to order the levels by Response (a column containing the text of each Likert-type response in an ordered factor):
aes(Question, Freq, fill = Response, order = Response) |
This is important because the negative side won't be in the right order if we don't do this explicitly and AFTER reversing the order of the negative levels to fan out away from zero.
Then we flip to make the whole thing horizontal with coord_flip(). coord_flip() makes later columns in the data appear on top, which isn't what we want here, which is why earlier in the functon I simply reverse the order of the elements in the input data:
x <- x[length(x):1] |
Happy net ranked distribution visualizing!
Jason
September 14, 2012
Hi Ethan! Super excited to see this post. This is exactly why I put up my code-- so others could run with it. There are a few things that you do here that I actually already had implemented into my code and removed in an attempt to be more neutral to scale that I really like.
For starters, in my actual production code I also separate out the positive and negative responses. In my code, I have a parameter called "scaleName" that allows me to switch between all of the scales that are available in my survey data. This includes Strongly Disagree to Strongly Agree (scaleName=='sdsa'), Never -> Always (scaleName=='sdsa') and even simple yes/no (scaleName=='ny'). This is not ideal because it does require 1) knowing all possible scales and including some work in the function to treat them differently 2) including an additional parameter. However, because I use this work to analyze just a few surveys, the upfront work of including this as a parameter has made this very flexible in dealing with multiple scales. As a result, I do not need to require that the columns are ordered in any particular way, just that the titles match existing scales. So I have a long set of if elseif statements that look something like this:
if(scaleName=='sdsa'){
scale <- c('Strongly Disagree','Disagree','Agree','Strongly Agree')
pos <- c('Agree','Strongly Agree')
neg <- c('Strongly Disagree','Disagree')
}
This is actually really helpful for producing negative values and including some scales in my function which do not have values that are negative (so that it can be used for general stacked charts instead of just net-stacked):
if(length(neg)>0){quest[,names(quest) %in% c(neg)] <-
-(quest[,names(quest) %in% c(neg)])
(Recall that quest is what I call the dataframe and is equivalent to x in your code)
Another neat trick that I have instituted is having dynamic x-axis limits rather than always going from -100 to 100. I generally like to keep my scales representing the full logical range of data (0 - 100 for percentages, etc) so I might consider this a manipulation. However, after getting many charts with stubby centers, I found I was not really seeing sufficient variation by sticking to my -100 to 100 setup. So I added this:
pos_lims =0)+1))[1,]),sum(subset(quest,select=c(which(quest[2,-1]>=0)+1))[2,])))
neg_lims <- max(abs(c(sum(subset(quest,select=c(which(quest[1,-1]<=0)+1))[1,]),
sum(subset(quest,select=c(which(quest[2,-1]<=0)+1))[2,]))))
x_axis_lims <- max(pos_lims,neg_lims)
Which helps to determine the value furthest from 0 in either direction across the data frame (I have to admit, this code looks a bit like magic reading it back. My comments actually are quite helpful:
# pos_lims and neg_lims subset each row of the data based on sign, then
# sums the values that remain (gettting the total positive or negative
# percentage for each row). Then, the max of the rows is saved as a candidate
# for the magnitude of the axis.
To make this more generalizable (my production code always compares two bars at once) , it would be fairly trivial to loop over all the rows (or use the apply functions which I'm still trying to get a hang of).
I then pad the x_limits value by some percent inside the limits attribute.
In my production code I also have the scale_fill_manual attribute added separately to the ggplot object. However, rather than add this after the fact like at the point of rendering, I include this in my function again set by scaleName. However, I think the best organization is probably to have a separate function that makes it easy to select the color scheme you want and apply it so that your final call could be something like:
colorNetStacked(net_stacked(x), 'blues')My actual final return looks like this:
return(stackedchart + scale_fill_manual(limits=scale,values=colors) +
coord_flip())
Where colors is set by a line like:
colors <- brewer.pal(name='Blues',n=7)[3:7]Seriously though, I am super excited you found my post and thought it was useful and improved what I presented!
Ethan Brown
September 14, 2012
Thanks for the feedback! This is easily the most enthusiastic and detailed comment I've ever recieved!
I think the differences in our versions of this function are largely a matter of philosophy and audience -- you're writing a function that works well for you to generate a large amount of visualizations; and you know a lot of the kinds of data that are coming in.
It probably solves the problems of your situation far better than mine, which is intended to be general and is primarily written to be clear, readable, and simple to an audience of unknown R users. This is the reason I chose to require that the columns be ordered factors -- that is, in fact, the way this kind of data should be stored in R, that's precisely why the data type exists. The ease of describing this requirement, to me, outweighs the pain of complying with it when I'm creating a function for other people to use, as opposed to an unknown user trying to figure out how to specify their scale (or the scale actually not being included in your options and having to figure out how to add it). I'm still not totally happy with this mechanism, because I think it might raise issues if the positive/negative responses aren't split down the middle (for instance if there were 2 negative, 3 positive, and no neutral).
I like your idea of the dynamic axis labels -- is this better than what ggplot2 produces by default if you specify no plot limits? Speaking of plot limits -- I believe the argument
limits = percentforscale_y_continuous()no longer works in recent versions of ggplot2 -- I couldn't get that section of your code (either in the original post or in your new version) to work in ggplot2 0.9.2.You definitely win in the colors department. I didn't know how to make a good general functionality for changing colors, so I just stuck with ggplot2's headache-inducing (for color-deficit eyes like mine) scheme.
Cheers!
Ethan
Jason
September 18, 2012
Hi Ethan:
The limits= percent requires
scales. It's a convenient way to label percents without doing the paste work on yourlimits, but wil require that your percents are stored between 0 and 1. It always surprises me how often data representing percents are stored as between 0 and 100.As to whether my ranges are better than ggplot2's defaults-- I'm not really sure. I think for me the consistency of setting it myself (and control, even if it is a baked in formula) makes me prefer this method. I actually have several other visualization functions for non-Likert data where I use this scale-setting technique and I often find myself playing around with the padding. It's hard to balance my desire to not be manipulative in what I present without obscuring interesting information.
Jose Iparraguirre
October 2, 2012
Hi Ethan
What if we have different number of observations per question (say, 20 for Q1 but 25 for Q2) and we still want to produce a graph such as the one above?
Thanks,
José
Ethan Brown
October 3, 2012
Hi José,
My function above can handle this easily -- just pass in a
listinstead of adata.frame, where each element of the list is an ordered factor with the same levels (they don't actually have to be of the same length).Here's an example of a list where each question has a different number of observations:
Hope this helps, and feel free to write back if this isn't clear or doesn't work for you!
Best,
Ethan
Jose Iparraguirre
October 4, 2012
Thanks, Ethan.
It works fine.
José
Andrew
November 29, 2012
I am having trouble getting my data into ordered factor levels. Do you have a guide to that? My data is a 5 point Likert Scale with the following possible values:
Strongly Disagree, Somewhat Disagree, Neutral , Somewhat Agree, Strongly Agree
Any help would be welcome, as this is a really useful function.
Ethan Brown
December 2, 2012
Not sure where you're getting caught, but I'm happy to help. Could you give more detail on what code you've tried and what's not working for you?
I'd also recommend checking out
?ordered, if you haven't already.Best,
Ethan