A recent WSJ article echoes an FTC report released last Wednesday warning of the possible consequences of bias in Big Data applications. The article identifies a number of valid concerns around privacy, equal opportunity, and accuracy. It also rightly hints at possible positive consequences as well.
For example, they quote cases where people judged poor credit risks by conventional means may receive loans as a result of big-data techniques. Good news for those people, and time will tell whether the lenders identified an underserved viable market, or whether bias simply caused them to make a poor investment. All models, including traditional analyses, will have error and ultimately we want to reduce both false positives and false negatives.
So we know that our models will have error, and the theme of this article is that a significant part of that error comes in the form of bias. It is a poor assumption that an analysis of any single data set – social media is a popular case – represents the whole population. Do people of all ages, nationalities, races, income levels, use social media in the same proportion as the general population? Probably not.
So how can we make this work to advantage?
- Consider bias as a first-cut classification. A common application of big data techniques is to classify large numbers of people into specific, targeted subgroups. We get our first course-grained categorization for free.
- Use the bias to select additional complementary data sets. If you understand the bias in your current data set, then you can strategically select additional data sets that give the best bang-for-the-buck in an effort to broadly analyze the general population. Calibrate your aggregate model by combining complementary data sets.
- Monitor production models. As the article observes, blind trust in correlations can be dangerous. Still, correlations can represent opportunities to be exploited. The key to safe utilization without a solid understanding of root cause is to assume those opportunities are temporary. Monitor their performance, and blow whistles as soon as the results begin to deviate from expectations.
George Box had it right: all models are wrong, but some are useful!
Example of complementary data sets (hypothetical, and for illustration only!).