How can you build a and spur digital transformation without thinking through who should be responsible for your data? Let’s do that together.
Data engineers and data scientists each occupy critical roles. Data engineers manage the data infrastructure and are in charge of designing, building, and integrating data workflows, pipelines, and the ETL process. Their goal is to provide data for data scientists’ analysis. Data scientists are those who can turn data into insights by applying statistics, machine learning, and analytical approaches. Their goal is to answer critical business questions.
Data-driven organizations require reliable, clean data to function. Without it, your AI, machine learning, and analytics are worthless. Unreliable, erroneous, and incomplete data leads to answers that can’t be trusted—hence, “garbage in, garbage out.”
Therefore, the process of wrangling and cleaning data is crucial, . Typically, this is seen as boring, annoying grunt work people don’t want to do.
However, I think this negative view is at least partly based on a major underappreciation of the significance of such work. Data wrangling and cleaning is not simply about eliminating white spaces, replacing wrong characters, and normalizing dates. Stepping back, these tasks should be viewed in the context of two key objectives:
- Understanding the ecosystem of people, data, and tasks in an organization
- Communicating and documenting that knowledge in order to generate clean and reliable data
Yes, data wrangling and cleaning can take 80% of a data scientist’s time and energy. This does not mean that 80% is wasted. While these tasks can and should be optimized for efficiency, they are part of the vital knowledge work that should be elevated within a data-driven organization. But who should be doing it?
approaches of the 1980s and 1990s. In that world, skills such as , knowledge elicitation, and knowledge specification were taught and used. These are lost arts in industry today, particularly in the data science context. I believe that revisiting these approaches will be a key part of developing both the instructional curriculum and the tooling needed to support the knowledge scientist.
The organizations which identify the central importance of clean and reliable data while elevating knowledge work will be at the forefront of digital transformation and will move faster along the path to creating a data-driven organization. Who are the knowledge scientists in your organization?
This article is published as part of the IDG Contributor Network.