Predictive Analytics

2 Rookie Mistakes to Avoid When Training Predictive Analytics Models

By Sriram Parthasarathy
Share on LinkedIn Tweet about this on Twitter Share on Facebook

Predictive analytics works by learning the patterns that exist in your application’s historical data, then using those patterns to predict future outcomes. As you feed in data for different outcomes, you train the predictive analytics algorithm—and it’s important to regularly measure the accuracy of the predictions by comparing them to actual results.

>> Related: A Practical Way to Approach Predictive Analytics Accuracy <<

Watch out for these two rookie mistakes when training your predictive analytics models:

Mistake #1: Measuring Accuracy on the Same Data Used for Training

Say you are using customer churn data from 2017 and 2018 to train your predictive analytics model. If you then compare those predictions to the actual results you saw in 2017 and 2018, it’s akin to seeing a copy of an exam before the actual test. You’re going to ace that test—and so is your predictive analytics model.

To correctly train your model, you need to separate your data into two groups: training data and test data. Train the model using the training data, and test for the accuracy on only the test data (which the algorithm never saw during training).


In this customer churn example, you could train using the 2017 data to test the model and test predictions against the 2018 data. Or, say you have 1,000 rows in your dataset. Use 80 percent of the data (800 rows) to train and test with 20 percent (200 rows).

Now, you have both the actual outcome for 200 rows of test data as well as the predicted outcome for 200 rows. You can compare the two to measure accuracy. If 160 rows out of that 200 were correctly predicted, then the accuracy of your predictive model is 160/200, or 80 percent accuracy.

Mistake #2: Failing to Recognize a Data Imbalance

Data imbalance is an important part of any predictive analytics puzzle, and you can’t measure it in the standard accuracy assessment. A data imbalance can lead to a mistakenly biased predictive model.

Let’s take fraud detection as an example. Say 95 percent of your transactions are not fraudulent. If your algorithm marks 100 percent of transactions as “not fraudulent,” it will still be right 95 percent of the time. An algorithm that’s 95 percent accurate may seem pretty solid—but it’s still going to be wrong 5 percent of the time, and in the case of fraud, that 5 percent can break the bank.

As you can see, an imbalance in your data can lead to a biased predictive model. For more information on recognizing and avoiding biased models, see our previous blog on the subject.


Originally published January 29, 2019; updated on July 31st, 2020

About the Author

Sriram Parthasarathy is the Senior Director of Predictive Analytics at Logi Analytics. Prior to working at Logi, Sriram was a practicing data scientist, implementing and advising companies in healthcare and financial services for their use of Predictive Analytics. Prior to that, Sriram was with MicroStrategy for over a decade, where he led and launched several product modules/offerings to the market.