Cleaning is a billion dollar industry. We constantly scour the aisles of Lysol wipes, mops, dishwasher detergent, soap and laundry aids in the constant battle to keep our homes clean. But much like that hard to reach space behind the refrigerator – which we all know is full of dust but is simply too hard to access – our data is accumulating dirt as well.
I ran into (and contributed to) to this problem several years ago when developing a new data product for the mobile application space. My plan was to web-scrape data from the Apple App Store – across a range of countries – and then productize that information via an embedded web dashboard.
I knew that data quality would be an issue if I didn’t plan wisely, so I did. I started by identifying the data I wanted to collect: App Name, Developer, Rank, Price, Country and Category. Pretty straightforward, right?
I clearly defined the columns of data. I optimized various fields in Excel to ensure formatting on incoming data wouldn’t be an issue – this was a test ground in advance of moving to a SQL database. I was sure I had thought of everything. Then reality set in, reminding me of the old adage about the best laid plans.
So, what turned my pristine nuggets of golden data into coal covered briquettes?
- Applications began changing their names. I remain furious with Booking.com. Years later I still have nightmares of the seemingly weekly app name changes you undertook to show off your impressive growing room inventory (i.e. Booking.com 100,000 rooms…102,000, 110,000). Other companies would add features to their title. Some would note the app was “On Sale” as part of the title intermittently. If developers only had one application, correlation would be easy – but alas, most had a vast library making it difficult to create an ongoing tracking system.
- Developers got acquired. Some developers would get acquired (or simply change their name), which required adjusting current and historical data so we could trend information. For example, when Sunrise Calendar was purchased by Microsoft we couldn’t simply say Sunrise went from X position to non-existent. Instead we needed to develop strategies for dealing with this by either keeping the historical name attached, reverting old data to show Microsoft or adding a new category that showed Acquired Developer Name.
- We added more data sources. The above challenges were just from one store and one country, as we started with just the Apple App Store in the US. Adding in other countries required language translation, currency conversions, etc. It really exasperated our problem and made it borderline unmanageable.
Needless to say, my simple project got complicated very quickly. But the lessons allowed us as a company to grow up and deal with data integrity and data governance at a strategic level. These lessons included:
1. Deciding on strategies for encouraging data hygiene at the data source. Automated ETL solutions for cleaning or harmonizing our data simply didn’t work – I saw them as a one-time band-aid that didn’t actually fix our problem. There were too many variables, too many unique elements in our objective and too many unforeseen changes we couldn’t plan for.
So, we focused on data collection. We implemented strategies at the point of ingest to address our dirty data. This included keeping better records of changes, adding new fields to track history and prioritizing which data needed to be cleaned and which was lower priority (we were collecting millions of rows per week so we had to be smart). This approach allowed our highest value data to be cleaned quickly so we could present it to the market in near real-time.
2. We learned to get creative. Once we decided to focus on the data source we were able to devise unique long-lasting solutions. To address app name changes we began relying on consistent data markers such as the HTML path which never changed. We relied on .JPG icon image locations to correlate international versions housed at various URLs. We designed automated reports to alert us to important instances such as new names entering our database we had never seen or if a popular name suddenly disappeared (suggesting an unforeseen change). These solutions proved valuable in the long term and greatly reduced our back-end cleaning.
3. We recognized the responsibility fell on us. I desperately wanted to automate our data cleaning – especially after a particularly painful experience using Amazon Turks, but the data was simply too varied. Our goals would sometime change and we didn’t want to be locked into a rigid system. Ultimately, we simply learned that best practices were to practice good data hygiene on the front end to avoid dirty data on the back-end.
In the long term, I believe more companies will adhere to this clean-first philosophy and educate employees on their responsibility for data governance. Data is a competitive tool, an advantage companies are increasingly relying on to remain nimble, which if used incorrectly can lead to bad decisions.
My clean-up efforts were only effective through an ongoing relationship with IT and it didn’t happen overnight, but I do believe it was the right approach. We were prepared for ongoing data changes, nimble enough to act on changing market conditions and capable of making changes as we saw fit. As a result we were able to expand our data set to more countries and offer clients more value.
By addressing data collection as an organization, we had better control over the output because we designed the solution that worked for us and I increasingly believe other companies will too.