Beautiful Plots: The Lollipop
The lollipop chart is great at visualizing differences in variables along a single axis. In this post, we create an elegant lollipop chart, in Matplotlib, to show the differences in model performance.
As a data scientist, I am often looking for ways to explain results. It’s always fun, then, when I discover a type of data visualization that I was not familiar with. Even better when the visualization solves a potential need! That’s what happened when I came across the Lollipop chart – also called a Cleveland plot, or dot plot.
I was looking to represent model performance after having trained several classification models through k-fold cross-validation. Essentially, I training classical ML models to detect tool wear on a CNC machine. The data set was highly imbalanced, with very few examples of worn tools. This led to a larger divergence in the results across the different k-folds. I wanted to represent this difference, and the lollipop chart was just the tool!
Here’s what the original plot looks like from my thesis. (by the way, you can be one of the few people to read my thesis, here. lol)
Not bad. Not bad. But this series is titled Beautiful Plots, so I figured I’d beautify it some more… to get this:
I like the plot. It’s easy on the eye and draws the viewers attention to the important parts first. In the following sections I’ll highlight some of the important parts of the above lollipop chart, show you how I built it in Matplotlib, and detail some of the sources of inspiration I found when creating the chart. Cheers!
Anatomy of the Plot
I took much inspiration of the above lollipop chart from the UC Business Analytics R Programming Guide, and specifically, this plot:
Here are some of the key features that were needed to build my lollipop chart:
Scatter Points
I used the standard Matplotlib ax.scatter
to plot the scatter dots. Here’s a code snip:
|
- The
x
andy
are inputs from the data, in a Pandas dataframe - A simple white “edge” around each dot adds a nice definition between the dot and the horizontal line.
- I think the color scheme is important – it shouldn’t be too jarring on the eye. I used blue color scheme which I found on this seaborn plot. Here’s the code snippet to get the hex values.
|
['#90c1c6', '#72a5b4', '#58849f', '#446485', '#324465', '#1f253f']
The grey horizontal line was implemented using the Matplotlibe ax.hlines
function.
|
- The grey line should be at the “back” of the chart, so set the zorder to 0.
Leading Line
I like how the narrow “leading line” draws the viewer’s eye to the model label. Some white-space between the dots and the leading line is a nice aesthetic. To get that I had to forgo grid lines. Instead, each leading line is a line plot item.
|
Score Values
Placing the score, either the average, minimum, or maximum, at the dot makes it easy for the viewer to result. Generally, this is a must do for any data visualization. Don’t make the reader go on a scavenger hunt trying to find what value the dot, or bar, or line, etc. corresponds to!
Title
I found the title and chart description harder to get right than I would have thought! I wound up using Python’s textwrap module, which is in the standard library. You learn something new every day!
For example, here is the description for the chart:
|
Feeding the plt_desc
string into the textwrap.fill
function produces a single string, with a \n
new line marker at every n characters. Let’s try it:
|
'The top performing models in the feature engineering approach, as sorted by the precision-
\nrecall area-under-curve (PR-AUC) score. The average PR-AUC score for the k-folds-cross-
\nvalidiation is shown, along with the minimum and maximum scores in the cross-validation.
\nThe baseline of a naive/random classifier is demonstated by a dotted line.'
Putting it All Together
We have everything we need to make the lollipop chart. First, we’ll import the packages we need.
|
We’ll load the cross-validation results from a csv.
|
clf_name | auc_max | auc_min | auc_avg | auc_std | |
---|---|---|---|---|---|
0 | random_forest_classifier | 0.543597 | 0.25877 | 0.405869 | 0.116469 |
1 | knn_classifier | 0.455862 | 0.315555 | 0.387766 | 0.0573539 |
2 | xgboost_classifier | 0.394797 | 0.307394 | 0.348822 | 0.0358267 |
3 | gaussian_nb_classifier | 0.412911 | 0.21463 | 0.309264 | 0.0811983 |
4 | ridge_classifier | 0.364039 | 0.250909 | 0.309224 | 0.0462515 |
|
clf_name | auc_max | auc_min | auc_avg | auc_std | |
---|---|---|---|---|---|
0 | sgd_classifier | 0.284995 | 0.22221 | 0.263574 | 0.0292549 |
1 | ridge_classifier | 0.364039 | 0.250909 | 0.309224 | 0.0462515 |
2 | gaussian_nb_classifier | 0.412911 | 0.21463 | 0.309264 | 0.0811983 |
3 | xgboost_classifier | 0.394797 | 0.307394 | 0.348822 | 0.0358267 |
4 | knn_classifier | 0.455862 | 0.315555 | 0.387766 | 0.0573539 |
… and plot the chart! Hopefully there are enough comments in code block below (expand to view) to help you if you’re stuck.
|
More Inspiration
I’ve found a couple other good example of lollipop charts, with code, that you might find interesting too. Let me know if you find other good examples (tweet or DM me at @timothyvh) and I’ll add them to the list.
- This lollipop chart is from Graipher on StackExchange.
- Pierre Haessig has a great blog post where he creates dot plots to visualize French power system data over time. The Jupyter Notebooks are on his github, here.
Conclusion
There you have it! I hope you learned something, and feel free to share and modify the images/code however you like. I’m not sure where I’ll go next with the series, so if you have any ideas, let me know on Twitter!
And with that, I’ll leave you with this: