PreInvented Wheel

On the shoulders of giants

Python time series plotting

Whether with matplotlib or other python libraries, every article you need about data visualization

  • Stress-Free Reporting
  • Driving Organizational Change
  • Insightful Analytics
  • Quick Tips
  • Incomplete Articles
  • About

Easy Matplotlib Bar Chart

When you’re designing a new visualization, the first question to consider should always be: What is each ‘tool’ uniquely good for? Matplotlib’s Bar charts, in contrast to line graphs and scatter plots, are useful for discreet categories that have amounts (often counts) associated with them. A line graph would indicate that there is a continuous connection between the categories, which makes sense for time series, but not for other types of binned independent variables. For example, statistics for different departments or regions – drawing a line between “South” and “West” implies that there is some meaningful value between the two, which is misleading.


In a bar plot, the bar represents a bin of data. Often, it’s a count of items in that bin. An obvious example would be the number of sales made by a sales person, or their success as a percentage relative to goal. Making a bar plot in matplotlib is super simple, as the Python Pandas package integrates nicely with Matplotlib.

Bar Chart Example

For an example, let’s imagine that you’re the CEO of an electric car company called Edison, which is trying to compete with Elon Musk’s latest Tesla car. Let’s start by looking at sales by city. We’ll start with the simplest possible chart, which we’ll beautify in stages:

np.random.seed(104)
cities = ['New York', 'Boston', 'San Francisco', 'Seattle', 'Los Angeles', 'Chicago']
car_sales = pd.DataFrame({
 'sales':np.random.randint(1e3,1e4,len(cities)),
 'goal':np.random.randint(4e3,8e3,len(cities)),
 'sales_last_year':np.random.randint(1e3,1e4,len(cities)),
}, index=cities)
ax = car_sales.sales.plot(kind='bar')

basic matplotlib bar chart

There are a few basic improvement opportunities that jump out immediately. It can be tempting to write these things off as ‘just aethetic’, but they can be critical for allowing the reader to really understand what point the graphic is trying to convey

  • Order the cities by volume
  • Tone down the bar colors a bit
  • Clean up the background
  • Add a matplotlib thousands separator to the y axis labels

 

car_sales_sorted = car_sales.sort_values('sales',ascending=False)
ax = car_sales_sorted.sales.plot(kind='bar',facecolor='#AA0000')
ax.get_yaxis().set_major_formatter(plt.FuncFormatter(lambda x, loc: "{:,}".format(int(x))))
ax.patch.set_facecolor('#FFFFFF')
ax.spines['bottom'].set_color('#CCCCCC')
ax.spines['bottom'].set_linewidth(1)
ax.spines['left'].set_color('#CCCCCC')
ax.spines['left'].set_linewidth(1)

 

Formatted Bar Plot

 

It looks like there’s an interesting story brewing about what’s going wrong in San Francisco. What’s the best way to highlight this?

One underutilized technique is changing the color of one of the bars to make it stand out. Very frequently the whole purpose of the chart is the show how one category difference from expectations. Maybe you are focused on that outlier, or you’re trying to show two different distributions and how one tails off quickly well the other one extends well to the right. There are a few different ways to implement it. The one used below takes advantage of matplotlib’s nice ability to take a list of colors.

As long as everything stays in the right order, you can simply make a list on the spot by iterating through the rows using ‘iterrows’. For each row, it outputs the index for that row (which we’re not interested in here) and a Pandas Series object, which allows you to access the columns of the row. Here, we’re just doing a comparison, and outputting one of two strings (which happen to be ones that Matplotlib will recognize as color definitions).

Another valuable approach is data labels. But don’t use them just the way that Excel does, as a redundant statement of the bar’s height. Think of other numbers besides those that are already represented on the graph itself. One example would be differences between consecutive bars, or over/under performance relative to goal, or since last month. You could put so much information into a properly formatted label, don’t waste space being redundant. In this case, let’s include the performance relative to the city’s goal. It can be a little tricky to make sure that these text annotations don’t bang into anything. Fortunately, it should be simple enough in this case, particularly because we’re not including a legend.

Doing the data labels is a little tricky, and we’re relying on the fact that Matplotlib’s bar graph has each of the bars, in order, as children of the axis object. Each of the rectangles has four corners, of course, and we’re looking for the upper right. Then we shift the text just a bit from there, and insert. The text itself is generated with a similar list iteration strategy, and a call to Python’s excellent formatting function.

The method that we used for selecting rectangles, which assumes that they would be the first in the list of children, could break for more complex graphs, so be careful, but we’re following our usual strategy of dialinig up the complexity only as we need it. Another option would be to test for the child being a rectangle, as opposed to an axis label or anything else, using isinstance(child, matplotlib.patch.Rectangle).

Another concern is that the rectangles are not tightly tied to their labels, which should make us nervous about things getting out of order. There’s not much we can call in matplotlib to make sure these are aligned, except to check explicitly. Again, we’ll rely on the fact that this is a pretty simple graph and leave that verification step out

 

ax = car_sales_sorted.sales.plot(
 kind='bar',
 color=['#AA0000' if row.sales > row.goal else '#000088' for name,row in car_sales_sorted.iterrows()],
 )

percent_of_goal = ["{}%".format(int(100.*row.sales/row.goal)) for name,row in car_sales_sorted.iterrows()]
 for i,child in enumerate(ax.get_children()[:car_sales_sorted.index.size]):
 ax.text(i,child.get_bbox().y1+200,percent_of_goal[i], horizontalalignment ='center')

ax.get_yaxis().set_major_formatter(plt.FuncFormatter(lambda x, loc: "{:,}".format(int(x))))
 ax.patch.set_facecolor('#FFFFFF')
 ax.spines['bottom'].set_color('#CCCCCC')
 ax.spines['bottom'].set_linewidth(1)
 ax.spines['left'].set_color('#CCCCCC')
 ax.spines['left'].set_linewidth(1)
 plt.show()

 

highlighted matplotlib bar chart

 

So this is a pretty sweet graph, but you could imagine wanting to have more plots. For example, what if you wanted to be able to see whether SF’s underperformance was unique to this year? There are several options. The simplest is to insert another bar representing last year’s sales right next to each city’s data:

 

ax = car_sales_sorted[['sales','sales_last_year']].plot(
 kind='bar',
 )

percent_of_goal = ["{}%".format(int(100.*row.sales/row.goal)) for name,row in car_sales_sorted.iterrows()]
 pairs = len(cities)
 make_pairs = zip(*[ax.get_children()[:pairs],ax.get_children()[pairs:pairs*2]])
 for i,(left, right) in enumerate(make_pairs):
 ax.text(i,max(left.get_bbox().y1,right.get_bbox().y1)+200,percent_of_goal[i], horizontalalignment ='center')

ax.get_yaxis().set_major_formatter(plt.FuncFormatter(lambda x, loc: "{:,}".format(int(x))))
 ax.patch.set_facecolor('#FFFFFF')
 ax.spines['bottom'].set_color('#CCCCCC')
 ax.spines['bottom'].set_linewidth(1)
 ax.spines['left'].set_color('#CCCCCC')
 ax.spines['left'].set_linewidth(1)

 

two bar matplotlib chart

 

This is fantastic! But sometimes you have labels that are too long to read sideways, and you want to have the bars face in the other direction. This requires just a few changes:

  • The plot ‘kind’ is now ‘barh’
  • Everywhere that you had an ‘x’, it should be a ‘y’, and vice-versa
  • The ax.text() arguments are always x,y, so you will want to swap the order that you’re passing in the arguments
  • I overwrote the original ticklabels with longer ones, just because I can!

 

ax = car_sales_sorted.sales.plot(
 kind='barh',
 color=['#AA0000' if row.sales > row.goal else '#000088' for name,row in car_sales_sorted.iterrows()],
 )

percent_of_goal = ["{}%".format(int(100.*row.sales/row.goal)) for name,row in car_sales_sorted.iterrows()]
 for i,child in enumerate(ax.get_children()[:car_sales_sorted.index.size]):
 ax.text(child.get_bbox().x1+200,i,percent_of_goal[i], verticalalignment ='center')

ax.get_xaxis().set_major_formatter(plt.FuncFormatter(lambda x, loc: "{:,}".format(int(x))))
 ax.patch.set_facecolor('#FFFFFF')
 ax.spines['bottom'].set_color('#CCCCCC')
 ax.spines['bottom'].set_linewidth(1)
 ax.spines['left'].set_color('#CCCCCC')
 ax.spines['left'].set_linewidth(1)
 ax.yaxis.set_ticklabels([
 'Seattle: City of Fish',
 'New York: City of Taxis',
 'Los Angeles: City of Stars and Smog',
 'Boston: City of Sox',
 'Chicago: City of Bean',
 'San Francisco: City of Rent'
 ], fontstyle='italic')#={'weight':'bold'}
 plt.show()

 

Horizontal Bar Chart

 

The final step that you might want to take is to add some error bars to your graphs. You will instantly look more scientific, but it’s not just something that you can slap on to any existing chart (well, you could, but they would be utterly meaningless).

Let’s take a closer look at the original data that we decided to work with

 

car_sales_sorted

 

Python Pandas dataframe (Initial)

 

 

This isn’t enough to create error bars around, because they require some sense of how variable the data was. Getting into the statistics of standard deviations and confidence intervals is beyond the scope of this visualization focused article (you’re welcome), but I do want to leave you with a sense of how to do this from a matplotlib prospective.

So, for simplicity’s sake, let’s pretend that you hired a terrible accountant. After you’ve sent out the graphs above to your local dealers, they come back to you and say “I’m not totally sure that this number is right. I haven’t been keeping careful track, but I have a sense of how far off this count might be (in either direction). That should be represented on the graph”

You sigh and take down their guess at the error, and add it to the data set:

 

car_sales_error = pd.DataFrame({
 'error':np.random.randint(3e2,2e3,len(cities)),
 }, index=cities)

car_sales_sorted_error = car_sales_sorted.join(car_sales_error)
 car_sales_sorted_error

 

Expanded Python Pandas Data Frame

 

That’s the information that you need to draw your bars, which you can do below. In addition to the ‘yerr’ argument under ‘plot’ (which draws the error bars. Told you it was super simple!) there are two changes that you might not expect:

  • The percentages had previously been floating just a bit above the bars, but are now set at the same height across the top
  • The color of the San Francisco bar is lightened to make the error range more visible. Remember, a higher number (or letters, which pick up after the number 9 and extend through F) means that more ‘light’ gets through, making the shape brighter.

 

ax = car_sales_sorted_minmax.sales.plot(
 kind='bar',
 color=['#AA0000' if row.sales > row.goal else '#4444AA' for name,row in car_sales_sorted.iterrows()],
 yerr=car_sales_sorted_error.error
 )

percent_of_goal = ["{}%".format(int(100.*row.sales/row.goal)) for name,row in car_sales_sorted.iterrows()]
 for i in range(len(cities)):
 ax.text(i,ax.get_ylim()[1]*.9,percent_of_goal[i], horizontalalignment ='center')

ax.get_yaxis().set_major_formatter(plt.FuncFormatter(lambda x, loc: "{:,}".format(int(x))))
 ax.patch.set_facecolor('#FFFFFF')
 ax.spines['bottom'].set_color('#CCCCCC')
 ax.spines['bottom'].set_linewidth(1)
 ax.spines['left'].set_color('#CCCCCC')
 ax.spines['left'].set_linewidth(1)
 plt.show()

 

Matplotlib error bars

 

The value of these error bars that they tell you whether you should consider two values to be close enough to be essentially equal. In the real world, you will almost never see data that is perfectly equal between two categories. For example, think about how election polls are often presented as basically neck and neck, even though when candidate may be a fraction of a percent ahead of the other. The reason for that is that the people doing the poll understand that there’s going to be some error in the data’ ability to represent the real situation, so they consider the grey area to be a wash.

There are a whole host of other things that you could do with our charts, including annotations (as I covered him a time series guide), mouseover interactions, or even subplots. With all of this design decisions, it’s important to remember the user, and be willing to work with them in an iterative fashion to figure out what information is most important to them, considering how their unique workflow and background translates into a best visualization for the specific context.

As I always say: When in doubt, try it out!

Need Business Intelligence and Data Science consulting?

* indicates required

© Copyright 2016 PreInvented Wheel · All Rights Reserved