The number of objects and actions in our daily lives that generate a data footprint is exploding. As we become ever more connected, the growth of mobility data is expected to exceed 30 exabytes by 2020, the product of 8.2 billion personal wireless devices and 3.2 billion M2M and IoT connections.

For most of us, the idea of an exabyte is way beyond our comprehension (if we even know the term). To make it easier, let’s just call it Big Data. Thirty years ago, big data was a few years of time series. As we move through each technology threshold, we need to rethink how big data is defined. And with each generation, new horizons emerge for the use of big data.

How to use big data

First, big data is not information … yet. It’s bits and bytes, it’s signal and it’s noise. It is not smart unless you make it smart. And not all of it is worth keeping. I was at a conference recently where someone told the crowd to not worry about figuring out what data you need ahead of time, just collect it.

I disagree. Although the cost of storing and processing data keeps coming down, it’s still a factor, and the cost may not come down at the same rate as data expansion.

In fact, much of that data that is indiscriminately collected is likely to be bad – inaccurate and/or fraudulent. Rather than take all data, determine which data does not meet your criteria and use only what you need.

Ideally you build infrastructure where you can add to the feature set of your data collection and storage without having to rebuild the system or change code. Then you can iterate your usage as your business needs change.

Example A: As an advertising technology company, we filter out over 40% of our daily incoming 50 billion events. In addition, we create small samplers where we can look at all the data from a small period of time to see if there is new or existing metadata that is now relevant to our business strategy.

Storing data just to store it can be toxic. Why? Because there is a lot of data that is considered publicly identifiable information (PII) and can be inadvertently captured if:

  • You are not careful with what you are collecting.
  • You do not pay attention to data retention laws or advisories of a particular country.
  • You are receiving sensitive information in an unencrypted form.
  • You do not create a privacy-by-design collection policy.

Example B: Prior to the iOS IDFA and the Google Advertising ID, the MAC ID or the UDID or the Android Device ID were used as identifiers for collecting data from smartphones. The issue? These identifiers are hard-coded to the phone and often collected unencrypted, and the customer had no ability to opt out or change the ID.  Any solutions built that used this information instead of an agnostic ID created havoc for those companies.

Example C: A few years ago, when desktop publishers starting providing mobile ad impressions, I had at least one provider sending me PII in the mobile web URL. This was a result of not scrubbing key/value pairs appropriately or not encrypting data unnecessary for bidding. I notified the provider and also made changes on my system to prevent storage of that type of data.

How to leverage big data

Now that you are managing the collection and volume of the data, let’s explore how to make data smart. Data doesn’t give you answers by itself. First, you need to figure out what question you are trying to answer, and what data will help you answer it. The most time-consuming part of data science is setting up the data.

Setting up the data includes determining:

  • Which pieces of the data to use – is it location, an action, a description of content?
  • In an integration with another data set, whether to use a probabilistic algorithm or a deterministic variable. What key value will you use to integrate?
  • Whether there is bias in the way that the data is collected. Bias means that the data may not be representative of true behavior, and is often driven by fraudulent or inaccurate data.

Once you set up the data, you may create insights directly or you may need to apply an algorithm to derive additional results. But either way, caveat emptor, just because you have results doesn’t mean you have correct insights or metrics. If the results seem suspicious, check the data collection, the bias and the algorithm. You can further test by repeating the insight for different data sets.

Example D: At some point a few years ago, our audience models started performing at unbelievably high rates of conversion. Given that we had not made changes to our models, we delved into the data to see if the results were true behavior or an anomaly. We discovered the issue of fraud and how the bad data could generate incorrect insights. Once we eliminated the fraudulent non-human data, our model results returned to normal.

As our data world continues to explode, you can make the most of the data signal and avoid the noise by adhering to two principles:

  1. Start small with your data, using only what you need, and
  2. Embrace privacy by design.

With these two principles as a foundation, iterate your data until it is smart and make sure your insights pass the common sense test. We live and work in an era that grants us an amazing ability to harness the powers of big data and capture patterns and insights that are not transparent to the human eye. Now go for it!

About the Author: Lauren Moores is VP of strategy at Dstillery. She is their mobility and cross-channel data expert, bringing more than 20 years of experience in information and technology. She currently focuses on driving Dstillery’s growth by evangelizing the company’s technology, building relationships with clients and partners and connecting to thought leaders. Lauren serves on the CTIA Innovation Council and is frequently a pre-screener for the MMA Smarties Awards and ARF’s David Ogilvy Awards. Lauren holds a PhD in Economics from Brown University and is an adjunct professor at the Steinhardt School for Culture, Education and Human Development at New York University.