IDG Contributor Network: AI: the challenge of data


In the last few years, AI has made breathtaking strides driven by developments in machine learning, such as deep learning. Deep learning is part of the broader field of machine learning that is concerned with giving computers the ability to learn without being programmed. Deep learning has had some incredible successes.

Arguably, the modern era of deep learning can be traced back to the ImageNet challenge in 2012. ImageNet is a database of millions of images categorized using nouns such as “strawberry,” “lemon,” and “dog.” During this challenge, a convolutional neural network (CNN) could achieve an error rate of 16 percent (before that, the best algorithm could only achieve a 25 percent error rate).

One of the biggest challenges of deep learning is the need for training data. Large volumes of data are needed to train networks to do the most rudimentary things.  This data must also be relatively clean to create networks that have any meaningful predictive value. For many organizations, this makes machine learning impractical. It’s not just the mechanics of creating neural networks that’s challenging (although this is itself a hard task), but also the way to organize and structure enough data to do something useful with it.

There is an abundance of data available in the world—more than 180 zettabytes (1 zettabyte is equal to 1 followed by 21 zeros) predicted by 2025. Ninety-nine percent of the data in the world is not yet analyzed, and more than 80 percent of it is unstructured, meaning that there is plenty of opportunity and hidden gems in the data we are collecting. Sadly, however, much of this data is not in any state to be analyzed.