# A Primer on Spreadsheets and Visualization

May 23, 2016 by Coleman McCormick

This is part 1 in a series about spreadsheets, data, and visualization. In this part, we’ll cover the basics of spreadsheet applications and an introduction to selecting dataviz methods.

I’ve yet to run into a business that doesn’t use spreadsheets extensively for managing information. Over the years, Microsoft Excel has become the utility belt application of choice not only for working with data, but even for developing applications and automating business processes (for good or for ill).

Desktop spreadsheet applications have been around for decades, with Excel being the current giant of the industry. Long-time computer users might also remember VisiCalc, Quattro, or Lotus 1-2-3. People have been using them since the beginning of computing. As a result, they’ve become a de facto tool for doing anything and everything with data.

VisiCalc, Quattro, and Excel

For some spreadsheet users, formulas and functions for statistics and financial calculations are as deep as they get into working with tabular data. Many others take their spreadsheets to another level with multiple cross-referenced sheets, pivot tables, complex calculations, macros, and more. As useful as spreadsheets can be for manipulating data, it’s hard to look at thousands of rows and columns of numbers without extracting additional meaning from the data through visuals. Let’s explore the concept of charting as a basic method of data visualization.

## Making Visuals of Data

Converting tabular data into visuals is one of the best ways to surface the underlying meaning out of dense information. Effective communication of data through visualization is far from trivial. Every day we all see charts and graphs that butcher raw data or miscommunicate (or undercommunicate) the meaning of data. You can go to an entire graduate study program just to learn how to effectively summarize and communicate with data. It’s that complicated.

• Trends - do you want to show change over a time series?
• Distribution - do you want to see patterns, common occurrences, and outliers?
• Relative comparison - do your data fall into a discrete number of categories you want to see side-by-side?
• Composition - do you need to view a breakdown of percentages?

Take a look at this excellent flow chart for determining which style is best:

Chart selector diagram (Full version). Source: Andrew Abela.

Occasionally, you’ll have multiple of these objectives, in which case it’s sometimes best for clarity to have two separate charts, but also some chart styles are good at achieving multiple communication goals. Let’s look a little deeper into some of the common types and review some examples.

### Column and Bar Charts

When you need simple side-by-side item comparisons, this is a simple approach to showing quantities of discrete items. Bar charts (horizontal) are good when you’ve got more than 10 or 15 items to display because they can be stacked in vertical space. Column charts are good if you want to see relative comparison with time as a variable. My example shows cities in China ranked by total population. With a bar chart, I could even expand to show the top 50 while keeping it readable:

### Pie Charts

Pie charts are the most common way to display composition of a whole. Sections of a circle visualize relative percentages nicely, assuming you don’t have too many categories. More than 15 or so categories can make it difficult to properly label your pie chart and convey the total composition of items. This example shows the 115 active volcanoes in the United States aggregated by type:

### Line Graphs

Highlighting trends over time is best done with simple line graphs, and you’ll often see things like month-to-month revenue charts, rainfall by year, or website visits by day using a line format to show the pattern of movement over time periods. You can also include multiple variables along common time scales to compare relative magnitude between several lines.

### Histogram

Showing distribution is a job for a histogram, particularly useful with large amounts of data. This type of graph helps not only to show distribution but also frequency and where you might have outliers in your data. They work for well distributions of a single variable. My example here uses the previous dataset, this time with over 1500 volcanoes to see distribution of volcanoes by elevation:

### Scatter Plots

If each data point has two related variables you’re trying to connect visually, a scatter plot does this by showing the two values on separate axes, with the data scattered across the plane to visualize distribution in two dimensions. They’re great for large datasets with many records. You can see the correlation between the two variables spatially. In this example, we can look at how time spent studying for exams affects test scores in a classroom: