Everyone at sometime in their life has had to use and deal with the nuances of Excel. If you’re a small business, you may have had to rely on spreadsheets to do every function from general accounting such as salaries, benefits, and vacation to your company expenditures and revenues. If you’ve been a student, you probably have used spreadsheets to perform some data entry and charting functions but only to find out how inaccurate the results were or all of the difficulties with becoming familiar with the functions. These days, there are options so we can finally ditch Excel once and for all. According to the TIOBE Index for March 2016, Python was listed in fifth place regarding top programming languages just behind Java, C, C++, and C#. This guide will show you how to use the Pandas library to have excel-like functionality.
Your guide through the libraries
In using Python, you have a powerful object-oriented programming language that is not only useful in performing calculations such as FORTRAN did many years ago but one that is also feature rich in performing analytics similarly to R programming. In particular, there are a multitude of library functions available and, if you are actively ‘piping’ functions or daisy-chaining together, you’ll soon realize how much computational power and speed that you do have. If Python were a Greek God, if would have combined traits of Zeus (strength) and its’ library, csvkits, would be identical to Metis (wisdom). Before you can begin using csvkits, just remember that it all starts with downloading your data from a web site and typically we do this using the curl function. The curl or cURL function is a command-line, software tool and a library for transferring data using URL syntax.
Using Python Pandas and CSVKit
The multiple libraries are most useful for parsing and manipulating specific text formats. Specifically, there are seven categories of libraries which include a general library named tablib, eight different office libraries, three Adobe (Portable Document Format; PDF) libraries, two markdown libraries, one YAML (Yet Ain’t Markup Language) library, one archive library, and one comma-separated values (CSV) library – where a record is comprised of each line and commas separate fields within a line. In particular, the CSV library is typically referred to as csvkit and it is a set of command-line utilities for converting to and working with CSV files.
CSV is not the easiest format to work with considering that there are alternatives and because CSV data sets are usually broken and incompatible with each other without a standard syntax similarly with extensible markup language (XML) and hypertext markup language (HTML) formats. The following includes a listing of seven of the most common functions within csvkit:
- in2csv: this input command-line function is used for converting fixed-width files such as Microsoft Excel or XLS files into CSV files. There are many systems that rely on Excel files and organizations are still using this application in their daily operations.
$ in2csv state_tax_data2015.xlsx > state_tax_data2015.csv
- csvlook: this function is used for output and analysis tasks. If a sort, using the csvsort function, had been performed on the data, then we need a way to display this same data and so you would invoke the ‘csvlook’ function. Another function that is helpful when improving the display of data is to use csvcut to change the order and then use the tee function to save the changes.
$ csvlook taxes/annual_converted.csv
or the combination results in,
If you enjoy command-line control and getting into the weeds of very heart and soul of programming and data analysis, look no further because Python and, specifically, the libraries, offer more than you could ever imagine. Python and its’ libraries is programming with an attitude. Don’t leave your data to chance with inefficient and inaccurate spreadsheets, you should take control of your data and ensure that you optimize and streamline your data products – you’re in control, why not act like it?