Once you know what predictive analytics solution you want to build, it’s all about the data. The reliability of predictions depends on the quality of the data used to discover variables and generate, train, and test predictive models.
Follow these guidelines to solve the most common data challenges and get the most predictive power from your data:
Find the Data You Need
Some of the data may be easy to obtain, while other numbers will require more time and effort to assemble. Fortunately, there’s no need to have everything in hand from the beginning. Start by building a model with a few essential data elements, then collect user feedback while you gather additional data. Bring new data elements into your model in subsequent releases to deliver incremental performance enhancements or additional capabilities.
Choose the Right Database
You’ll need to store three types of data for your predictive projects: Historical data for creating, training, and testing your models; new data your models will analyze to make predictions; and predictions (outputs) from your model. One terabyte of storage for historical data is a good baseline for most small and medium business’s predictive applications. Traditional databases like Oracle and MySQL usually work fine for this purpose, and they’re also adequate for generating predictions on new data.
If your application is targeted at significantly larger businesses, or it requires collecting very detailed data, both your development team and your customers may need to store much more historical and/or new data, perhaps as much as 100 terabytes. In these cases, look at non-relational databases that scale vertically and horizontally. Learn how to choose a database for your predictive project.
Practice Database Hygiene
It’s usually best to keep the three categories of data in separate databases/tables. In some cases, however, it may be preferable to use the same database/table for both historical and new data. This can be done by creating filters to separate the data. For example, a six-month filter could be used to access historical data, with a last-day or last-week filter for new data.
Within the category of historical data, you also want to separate the data for training your model from the data for testing it. By creating two sections, you ensure a fair evaluation because you’re measuring the accuracy of your model on data it has never seen before.
Cleanse Your Data
Predictive models are only as good as the data they learn from. Accuracy of predictions depends on clean data. As with other business intelligence projects, the task of cleansing data has traditionally been lengthy—taking up as much as 60 percent of time in predictive projects. Fortunately, that’s changing. Machine learning is increasingly being used to detect and resolve two of the most common data problems: missing values and outliers.
Models learn to predict outcomes by identifying patterns in historical data. If you don’t correct bias in your data, the predictions your models make will be inaccurate, as they reinforce and perpetuate the bias. Watch out for the two common types of bias—data bias and selection bias—and take steps to avoid or correct them. Here’s how to recognize and avoid biased machine-learning models.
Validate and Establish a Performance Baseline
To validate that your model is working prior to launch, run it on your test data. This is historical data you’ve stored separately from the data you used to build the model. Since you already know the actual outcomes, you can measure how close the prediction gets to what happened in the real world. Learn how to assess the accuracy of your predictive model.
You don’t need a model with 100 percent accuracy. To determine if you’re ready for your initial predictive software launch and to establish a performance baseline, compare your model’s accuracy with what your beta customers were achieving without it. If your customers were making mostly judgment-based “gut” decisions, the predictive analytics model simply needs to do better than that baseline.