plt.scatter() – Old School Style (plus a best fit line)
A scatterplot is easy to recognize, a bunch of dots without any lines (with the occasional exception of a best fit line running through them). However, that doesn’t seem to really get to the depth of how they differ from a conventional line plot, and particularly when you would use one as opposed to the other.
A note about terminology: Matplotlib is often imported with “import matplotlib.pyplot as plt“, and we’ll be using ‘plt’ as shorthand for the rest of the article.
Differences between plt.scatter and other Matplotlib plots
The first major way in which a scatterplot differs from matplotlib line plots if you will far more commonly have two data points which share the same location on the X axis. On a line plot, that would create a vertical line, which more often than not means that the visualization contains a mistake. On a scatterplot, it’s perfectly normal. More broadly, the dots on the scatterplot do not have an implied order. On a line graph, each point (whether or not it is demarcated by a dot, or just the line) is connected to one other point on each side. This is the easiest to understand in the context of a time series. Each may represent a given day, for example, in which case it would be connected to the day before and the day after. Or maybe it represents a distance along a race track, and the y axis represents your speed at a certain measurement point. All of the data points have a clear ordering, which makes it easy to draw a line from one to the next. In a scatterplot, there is no such clear order. For example, the data sets could be the height and weight of a group of people. There’s no clear way to see that one person “comes before” another. In fact, there may be data points that lie exactly on top of each other! It’s clear that these sorts of plots are valuable for data that is less structured and, well, more “scattered”.
The possibility of the data points lie right on top of each other, or perhaps so close that there markers overlap, is a common source of frustration when visualizing. You don’t want the markers to be so small that the outliers are invisible (this creates the annoying problem of graphs that are so zoomed out without being able to see which data point is so extreme). But if your markers are too big, there will be a dense and open cluster at the meet of the data. Your viewer won’t know whether there are 10 data points in a given area, or just one.
This is where a scatterplot can sometimes start to look like a density plot. The easiest way to do this, is to turn down the ‘alpha’ variable on the plt.scatterplot in matplotlib (there are similar controls in other libraries like Seaborn and Plotly). This makes each of the dots somewhat translucent, which means that it is easier to tell when they’re overlapping. The downside is that points that are not overlapping will be that much harder to see, which is why it’s important to choose bold colors that will stand out even when faded somewhat. Here’s an example of FiveThirtyEight doing a good job.
In the same way that a time series graph can benefit from a best fit line, there is value in having a density plot overlaid on your scatterplot, to make it clear where the real center of the data exists. It may also make sense to do a best fit line, if you’re trying to imply that there is a correlation between the measurements along the x axis in the one along the y axis. When doing something like this, it’s valuable to show the statistics of how good the fit is, because people are notoriously bad at judging correlation (are you?). Even better, give your readers a subjective sense of how good the fit is, and what it means for them.
Careful use of colors and data markers
It’s important not to overwhelm your viewers with the number of colors that you use, even though it may feel like a way to make your graph look more sophisticated. It is simply adds another dimension to your breath. A plot point can have an X dimension, Y dimension, a size, and a color. The problem with color is that it cannot be used for fine grained comparison. For example, how would you determine if something is twice as blue as another data point? Determining size is also fairly tricky, and it’s important to always scale by radius, not by volume. People are remarkably bad at intuitively judging when something is about twice the volume of another thing, often thinking but the difference as far as smaller. Sticking to variation of radius will allow your users to more easily understand the subtle differences between your data points.
Another way that you can make your graph interesting is to use different marker types. Often we think of data points is being round dots, but there’s no reason that it has to be that way. You could use Xs, triangles, squares, or anything else you can think of. This was more popular back when printing in color was more expensive, but still have applications today. For example, you could put a big red X over some data points that you really want to call out, Or simply save the cost of color ink by differentiating your lines on marker type and printing in black and white.
One particular advantage of this approach is that a surprisingly large portion of people are colorblind. Too many graphs use red and green as the core colors, which is exactly the pair of colors that people most commonly have trouble differentiating. You can use both colors for most people, as long as you also use some difference in size or marker type, to enable be colorblind people to understand as well.
A reminder about 3D Scatterplots
The principles of scatter plotting in two dimensions are not fundamentally different then making 3-D plots. You are simply adding another axis along which the data can sit. It becomes important to have good markers when you are plotting in three dimensions, because bad ones can easily ruin the effects of having real physical form, which can make a 3-D graph feel cheesey. For 3-D graphs round data points, generally small ones, work best. Think of it as a data cloud, and if you don’t have enough points to give a sense of density, you might want to re-imagine your approach to visualization.
As a final trick: If you want to highlight one or two of the data points anywhere in the scatterplot (perhaps there’s an outlier, or some example that you’re expanding upon in the accompanying text), a clever way to do that is to plot the overall graph as a light color, and then plot the points of interest again right over it, in a different color. It may feel inelegant to double plot, but in digital there’s no need to “save ink”!