Looking for the only a set of Matplotlib guides that even tries to be entertaining? Click here and remember: At least we tried.
I’ll start with a confession. I used to hate Matplotlib. It was one of the primary reasons that I had a brief, torrid affair with R a few years ago. I found Matplotlib awkward, particularly in contrast to ggplot, and was relieved to move beyond the confusing web of objects inheriting attributes from each other.

As I move back to Python, I decided to focus on deeply understanding the philosophy behind the library, to see if I could come to peace with it. I’m happy to say, that after many messy fights, we have come to respect each other and develop a close working relationship. I wanted to share some of that understanding with you. The scope of the library is massive, so I’m focusing on a common but tricky use case – time series plots.
To illustrate the examples, let’s dive into our motivating story:
Your robot runs for president
Your revolution started as just a joke. You were building a Twitter bot for fun, wondering whether it would say anything funny if left to its own devices. But over time, you realized it was sounding more and more more and more like a presidential candidate. And you had a thought “if the AI revolution is inevitable, 2016 is as good a year as any”.
Sure, your bot can’t currently play checkers, let alone manage it’s own campaign. But it already had more followers than the average senator, and you have plenty of time to flesh out its decision-making skills before inauguration day. Get the votes now, with some manual assistance, and you’ll be able to figure out the details of hard AI within a year or so, right?
Tracking Your Polls with a Matplotlib Time Series Graph
The first question to consider is how you’re robot candidate is doing in the polls. Of course, you conducted all of your polling on Twitter, and it’s pretty easy to pull down some results.
polls = pd.read_csv('data_polls.csv',index_col=0,date_parser=parse)
ax = polls.plot()
More Matplotlib Examples >>

That growth looks good, but you’re a rational person, and you know that it’s important to scale things appropriately before getting too excited. So let’s modify the plot’s yticks
ax = polls.plot()
ax.set_ylim(0,1)
Convert the Axis Label Text to Percentage
It’s also nice to have things in terms of actual percentages. Why this isn’t a standard default option is unclear to me. It’s one of the annoying aspects of the library. On the bright side, it makes for a useful introduction to some of the more advanced functionality, which will come in handy as you want to do truly customized things in the future. As you can see below, we’re defining a custom function to take the numbers in decimal form and convert them into percentages. If you wanted to, you could change the numbering to any string that a function can output. Want it in scientific notation? A thousands separator? Written out in Spanish? Just code it up!
ax = polls.plot()
ax.set_ylim(0,1)
def pct(x,pos): return "{}%".format(int(x*100))
ax.yaxis.set_major_formatter(plt.FuncFormatter(pct))
More Matplotlib Examples >>
Format Plot Background
Now, whenever I have a political question, my go to reference is Nate Silver, whose brilliant work predicting candidates rise and fall sets the standard for predicted analysis across all domains. While making your graphs that look like his is not guaranteed to produce similar predictive success, it certainly couldn’t hurt! Let’s change the background color to white (that’s the ‘#FFFFFF’), and add subtle gray (that’s the ‘#DDDDDD’) indicators.
ax = polls.plot()
ax.set_ylim(0,1)
def pct(x,pos): return "{}%".format(int(x*100))
ax.yaxis.set_major_formatter(plt.FuncFormatter(pct))
ax.patch.set_facecolor('#FFFFFF')
ax.grid(b=True, which='major', color='#DDDDDD', linestyle='-')
More Matplotlib Examples >>
Matplotlib axis Labels and Title Text
Finally, let’s remember that we’ll want to reference this graph in the future, without having to dig through all the code to remember what the numbers mean. Fortunately, adding annotations it’s quite easy. Let’s add an X label, Y label and plot title
ax = polls.plot()
ax.set_ylim(0,1)
def pct(x,pos): return "{}%".format(int(x*100))
ax.yaxis.set_major_formatter(plt.FuncFormatter(pct))
ax.patch.set_facecolor('#FFFFFF')
ax.grid(b=True, which='major', color='#DDDDDD', linestyle='-')
ax.legend().set_visible(False)
ax.spines['bottom'].set_color('#CCCCCC')
ax.spines['bottom'].set_linewidth(1)
ax.spines['left'].set_color('#CCCCCC')
ax.spines['left'].set_linewidth(1)
ttl = ax.set_title("Approval Rating (Twitter Polls)",fontweight='bold',fontsize=15,position=[.5,1.05])
#ttl.set_position([.5, 1.05])
ax.set_ylabel("% of Responses",fontsize=12,fontstyle='italic')
ax.set_xlabel("Polling Date",fontsize=12,fontstyle='italic')
More Matplotlib Examples >>
Python Scatter Plots
Now you have proven out that your robot president is getting increasingly popular, but how are people finding out about it? Let’s check in to modern democracy’s answer to clever bumper stickers – the retweet.
Credit: https://www.etsy.com/listing/192986880/funny-political-bumper-sticker-dont
Credit: Well, I bet you can figure out where this came from
First, we’ll import the data and look at it quickly. As always, making an ugly version of the graph it pretty easy. Take that as either points for or against the library, but you shouldn’t feel any shame in doing a rough first draft. Just use the code below and you’ll see a whole tangled mess of lines in no time:
retweets = pd.read_csv('data_retweets.csv',index_col=0,date_parser=parse) ax = retweets.plot()
Fortunately, this one is pretty easy to clean up. Just tell it that you want “style=’.’”, which means that you want dots instead of lines:
ax = retweets.plot(style='.')
Converting the Y Axis to Thousands
Clearly, there’s an upwards trend, but let’s zoome in to see what’s happening in the latest few months. While we’re at it, we’re also going to give the y axis a thousands separator. As with percent, I would have expected this to be default, but we’ll have to use a custom function. As you can see, I’ve used lambda to roll it all into one line, and the versatile Python format function to handle the placement of commas.
ax = retweets.plot(style='.')
ax.set_xlim(parse('1/1/2016'),ax.get_xlim()[1])
ax.get_yaxis().set_major_formatter(plt.FuncFormatter(lambda x, loc: "{:,}".format(int(x))))
A machine learning on time series graphs
Now come up because you’ve been such a good student up to this point, we’re going to try something pretty advanced. We want to answer this question: The full graph had a definite upwards trend. Is that trend still continuing in the last few months?
Let’s use a machine learning to fit a linear regression and see whether the data it’s actually increasing over time. We’re going to have to use a new library called scikit-learn. This library has a huge number of machine learning algorithms, and the beauty of it is that the simple one that I’m going to show you right now will illustrate a vast majority of the principles that you’ll need for more advanced stuff. So congratulations – for all of your hard work you’re getting lessons in both a visualization AND a machine learning at the same time!
ax = retweets.plot(style='.',markersize=6.5,markerfacecolor='#888888')
from sklearn import linear_model lr = linear_model.LinearRegression() lr.fit(X=retweets_modified.index.values.reshape(-1,1),y=retweets_modified.retweets.values.reshape(-1,1))
prediction_df = pd.DataFrame(lr.predict(retweets_modified.index.values.reshape(-1,1)),index=retweets.index) prediction_df.plot(linestyle='-',color='#0000CC',linewidth= 1.5,ax=ax)
ax.patch.set_facecolor('#FFFFFF') ax.grid(b=True, which='major', color='#DDDDDD', linestyle='-') ax.legend().set_visible(False) ax.spines['bottom'].set_color('#CCCCCC') ax.spines['bottom'].set_linewidth(1) ax.spines['left'].set_color('#CCCCCC') ax.spines['left'].set_linewidth(1)
ax.set_xlim(parse('1/1/2016'),ax.get_xlim()[1]) ax.get_yaxis().set_major_formatter(plt.FuncFormatter(lambda x, loc: "{:,}".format(int(x))))
More Matplotlib Examples >>
How does the fitting process work? The software tries to draw a line that is closest possible to everything to one of the dots, then it uses an equation to minimize the distance from the line to the data points. Though the assumption that the underlying model is linear is a big one, this approach has the advantage of running very quickly. Complex machine learning methods like neural networks require more trial and error, which is why they can take longer to run. In this case it is very clearly linear, and there are test you can run to confirm this, although those are outside of our scope today.
Matplotlib’s Text Annotation Functions
One of the useful thing for you can do when you have a best fit line, is search for which point fall for the outside of it. In this case it doesn’t require too much calculation to see if there is one data point down there that is way outside of the rest of the pattern. What’s going on there? This is something that is worth some manual investigation.
With a little bit of manual digging, you find that, like 1000 monkeys banging on keyboards and creating Shakespeare, the random word association of your politician robot has accidentally stumbled on a coherent policy positions. That’s ruining all the fun! Nobody wants to share that with their friends.
It’s important that your graph is able to communicate all of your insights, and sometimes the most important lesson is a nonquantitative one like this. Fortunately, it’s pretty easy to add a text and attention and get some background.
ax = retweets.plot(style='.',markersize=6.5,markerfacecolor='#888888')
from sklearn import linear_model lr = linear_model.LinearRegression() lr.fit(X=retweets_modified.index.values.reshape(-1,1),y=retweets_modified.retweets.values.reshape(-1,1))
prediction_df = pd.DataFrame(lr.predict(retweets_modified.index.values.reshape(-1,1)),index=retweets.index) prediction_df.plot(linestyle='-',color='#0000CC',linewidth= 1.5,ax=ax)
ax.patch.set_facecolor('#FFFFFF') ax.grid(b=True, which='major', color='#DDDDDD', linestyle='-') ax.legend().set_visible(False) ax.spines['bottom'].set_color('#CCCCCC') ax.spines['bottom'].set_linewidth(1) ax.spines['left'].set_color('#CCCCCC') ax.spines['left'].set_linewidth(1)
highlight_x = parse('2/1/2016') highlight_y = retweets.ix[highlight_x].values[0] retweets.loc[[highlight_x],:].plot(style='.',ax=ax,markersize=8,markerfacecolor='#CC0000')
an = ax.annotate('Accidentally tweeted\na policy position', xy=(highlight_x - pd.Timedelta(.25,'D'), highlight_y + 250), xytext=(highlight_x - pd.Timedelta(2.5,'D'), highlight_y + 3000), arrowprops={'facecolor':'black', 'width':1, 'headwidth':5,'headlength':5}, )
ax.set_xlim(parse('1/1/2016'),ax.get_xlim()[1]) ax.set_ylim(0,ax.get_ylim()[1]*1.5) ax.get_yaxis().set_major_formatter(plt.FuncFormatter(lambda x, loc: "{:,}".format(int(x)))) ax.set_xlabel("Date",fontsize=12,fontstyle='italic') ax.set_ylabel("Retweets",fontsize=12,fontstyle='italic') ax.set_title("Robo-candidate Retweets increasing by an average of {:0.2f} per day".format(lr.coef_[0][0]),fontweight='bold',fontsize=15) ax.legend((ax.lines[0],ax.lines[1]), ('Retweets per Day','Retweets per Day (Linear Regression)'), loc='upper left',fontsize=15)
More Matplotlib Examples >>
Note that I’m also taking the opportunity to massage the fonts and bolding a bit, just to make the graph readable at a glance. One of the most important things that I’ve learned about presenting data is that people are going to need it rolled up in different ways at different times. Sometimes this whole plot is exactly what they want, and sometimes they just need the summary. To keep things together, I’ve put the summary (the average number of followers added daily) right there in the title. Depending on the norms of your organization, this may or may not be the right way to do it, but remember that you can also have text annotations anywhere else on the graph, if you don’t want to put the information right there in the title.
You can see that the matplotlib legend text is created in the very last line, and we explicitly tell it which of the portions of the graph we want to have labeled, and what their label should be. Like so many other functions, the legend function has a default, which is to label every one of the lines on the graph. However, this would create a label for the point that you colored red and called out with an arrow. While that wouldn’t be terrible, you’ve already done a pretty good job of labeling that point so it would be unnecessary. Passing in the sections of a graph that you want labeled explicitly allows you to have full control over both what is in the legend, and how it is described. As with so many other things, you can use the default at first, but never forget that you can be more explicit if you need to.
Why time series data is key to predicting the future
Let’s step back and remember why we are doing all of this. Very frequently you want to use data to predict the future, and the simplest way to do that is simply to look at trends in the past. That is why data that is arranged around timestamps, as opposed to geography or other dimensions, is so powerful. Nothing is as useful for predicting the future like data that has time intrinsically built into it. Everything that we do if humans is trying to extrapolate from the recent past to the future, and these tools are the most explicit and easy to share conclusions from with large data sets. This use case is a slightly silly one, but you can already see how it applies to all sorts of domains, from predicting the rise of a real human candidate, to understanding the fluctuation of visitors to your website. Being able to explain the story, particularly when you are describing the extremely movements like these outliers, lets you cut right into the most important portion of what the data can tell you.
Outliers are a critical part of any data set. It’s tough because a majority of the time, they represent measurement error, or some other situation that should just be filtered out and not considered again. But sometimes, just sometimes, they are the key to revolution. Everything that has radically been transformed in the course of human history has started as an outlier, something that people used to the status quo couldn’t quite explain, so they ignored. If you are looking for those outliers, and looking to see whether there is an explanation, a hidden disturbance in the force, glitch in the matrix, or whatever, you will be the first to take advantage (or create!) the radical new world that they represent. Just remember the antibiotic power of penicillin was once for an outlier, as was the speed of a computer transistor.
Now go out there and predict the future!
p.s. You may be asking: Where did all of this excellent data come from?
First, thank you for complementing my data.
Second, it’s often useful to play around with fake data as you’re trying to learn a new library. Though there are huge advantages to exploring the unexpected quirks of real data, it can be easiest to start with fairly a simple generator:
polls = pd.DataFrame(index=pd.period_range('1/1/2015', '3/1/2016', freq='M')) polls['approval'] = [x/100. + random.random()/10. for x in range(polls.index.size)] polls.to_csv('data_polls.csv')
retweets = pd.DataFrame(index=pd.period_range('1/1/2015', '3/1/2016', freq='D')) retweets['retweets'] = [random.randint(int(x*100./4.),int(x*100./2.)) for x in range(retweets.index.size)] retweets.loc[parse('2/1/2016'),'retweets'] = 100 #This is the outlier that we find later on retweets.to_csv('data_retweets.csv')
Well, you don’t see writing like that every day! Master matplotlib the fun way here: Python data: a simple list.