Every business decision taken today should be driven by data. The collection of data has become big business itself as companies try and make the best decisions possible based on what the data reveals rather than by assumption or guesswork.
Data is usually collected to confirm whether a hypothesis is correct. Someone already has a question in their mind that they want to answer. Whatever data they collect is meant to provide that answer. Charts are made, reports get written, data gets analyzed, and answers are disseminated across the organization confirming whether the hypothesis is correct. A decision might get made, and that’s usually where things end.
That process is fine, but it assumes the data you have is static and only useful for answering predefined questions.
But data that you have already collected can have more than one purpose. With the right data visualization tool, you can leverage your data into charts and analyses that can answer questions you didn’t even know you had.
What if by simply looking at your data you could find new trends, lines of inquiry, and business opportunities?
That’s where Exploratory Data Analysis (EDA) comes in. With EDA you can use charts, graphs, and other visualizations as a starting point for investigating almost every part of your data and how it all relates. You can see relationships that may not be apparent, trends that surprise, and make new predictions using the data you already have.
Here are 5 ways you can use exploratory data analysis to begin seeing what your own data can reveal:
1. Box Plots – Your starting point
Box Plots are the first steps in EDA for many data scientists. Developed by famed statistician John Tukey (who also pioneered EDA!), a box plot is a great way to get a visual sense of an entire range of data.
Box plots divides data into its quartiles. The “box” shows a user the data set between the first and third quartiles. The median gets drawn somewhere inside the box and then you see the most extreme non-outliers to finish the plot. Those lines are known as the “whiskers”. If there are any outliers then those can be plotted as well.
Figure 1 – These box plots show the distribution of male and female customers. The data here shows us that sales are split pretty evenly.
With box plots you can answer how diverse or uniform your data might be. You can identify what is normal and what is extreme. Box plots help give a shape to your data that is broad without sacrificing the ability to look at any piece and ask more questions.
2. Measure your categories with Bar Charts
Bar charts let you look at every piece of data as it relates to one or more overall metrics. While the box plot gives you an outline of the data, the Bar chart lets you see individual categories and how big those categories are. A uniform bar chart can tell you there is a lot of variety in your data while a bar chart with an uneven range can show you what might be responsive (or not) in the future.
Figure 2 – A bar chart of weapons used in crimes in Baltimore. EDA might show you where you need to improve data collection.
With a bar chart, you can see how different things are between separate categories of data. That is good when you want to know what separates your variables. If you have a lot of categories you may want to compare a limited set of categories and see how things stack up.
3. See your data range with Histograms
Histograms use bars to visualize data as well. Many people may not even realize there is a difference between a histogram and a bar chart. They practically look the same from a distance.
Figure 3 – This histogram shows a range of public transportation ridership. This way you find ways to predict how often ridership will meet certain goals.
The key is that a histogram looks solely at quantitative variables while a bar chart looks at categorical variables. That’s why the bars in a histogram are typically grouped together without spacing in between the bars.
Those variables are listed in order so you can see the overall range and skew of the data while a bar charts discrete categories may change depending on how the categories are arranged. Since histograms let you view data sets in ranges, you can tailor your histogram to show differing extremes.
4. Identify Patterns with Scatter Plots
Scatter plots let you see how closely your data may be correlated. If there is an apparent relationship between pieces of your data then there may be a single cause that could account for multiple variables.
Of course, learning that your data is not closely related can be important as well, and a scatter plot that shows no relationship can be just as important in nailing down why something happens. Randomness is a fact of life, after all. Knowing the difference between what is random and what can be predicted can remove uncertainty from decisions and keep you focused on actions likely to provide results.
Figure 4 – This scatter plot of crime volume sorted by date shows that some days do see a lot more reported crimes than other days.
From the scatter plot you can chart a line graph to measure trends and extrapolate from the data where things could go next. That’s known as a “line of best fit.” Once you can draw that you can begin to extrapolate what is likely to happen as conditions change.
5. Blow some Bubbles! (bubble charts, that is)
Bubble plots are a kind of scatter plot, but they let you examine a third variable in your scatter plot. Maybe some pieces of data are closely related but one piece is much larger than the other. The bubble plot lets you see that relationship between data points and ask if that has an effect on the relationship. That is something you may not see in a traditional scatter point that only looks at two variables.
Bubble charts work great with maps. If you’re looking for the impact of something across various cities, the bubble chart lets you see how much of an impact there is compared to simply where things may be happening.
Figure 5 – This bubble chart shows you where crime has happened and how often. If you have the right data configuration you could overlay this data over a map of the city.
From there you can drill down and figure out what is working in those cities and why. A business with locations doing very well in one city could teach its best practices between to other branches that have room to grow.
These charts are just a start. The beauty of EDA is that it is meant to be a starting point so you can approach your data with the freedom of figuring out what charts can reveal about things you didn’t know about your data.
If you start with these five then you are on your way to finding out what you can do when you decide to let your data speak for itself. Doing that will remove uncertainty and doubt from huge chunks of your decision-making process.
Of course, using software that can automatically configure data into a chart at the press of a button makes the process that much easier. You can automate your EDA and make it a part of your business’ standard decision making. Just select your database. Run it through any number of chart styles and see what you can find.
Take some time to look at some data that you are familiar with. Start with these five charts and try to note something different about what each visualization reveal and see if you notice anything different from when you looked at it before. Then see if you can turn any of your notes into an action item. Those action items are the result of your first successful foray into EDA. Enjoy!