I read David Linthicum’s post with great interest. A huge reason that I decided my next job would be for a search company was because of this very problem. (That’s why I now work for LucidWorks, which produces – and -based search tools.) While working with clients, I realized that with big data and the cloud a tough problem, finding things was becoming worse. I had seen the upcoming meltdown as the use of Hadoop formed yet another data silo and as a result produced few actual insights.
Part of the problem is that the technology industry is trend-driven rather than problem-solving. A few years ago, it was all about client/server under the guise of distributed computing à la Enterprise JavaBeans, followed by web services and then . Now it is all about . Many of these steps were important, and machine learning is an important tool for solving problems.
We lost indexing and search as big data emerged
But sadly, the most important problem-solving trend got lost in the shuffle: indexing and search.
The modern web began with search. The web would be a lot smaller if Yahoo and the search portals of the late 1990s had triumphed. The dot-com bomb happened and yet Google was born from its ashes. Search also birthed big data and arguably the modern machine learning trend. Google, Facebook, and other companies needed more ways to handle their indexing jobs and their large amounts of data distributed to internet scale. Meanwhile, they needed better ways to find and organize data after they ran upon the limits of crowdsourcing and human intelligence.
, we need to do is redefine “integration” for the distributed and cloud computing era.
Data integration used to mean just that: grabbing all the data and dumping it into a big, fat, single area. First this was with databases, then data warehouses, and then Hadoop. Ironically, we moved further away from indexed technology when doing this.
Now, integration must mean that we can index and find the data where it lives, deduplicate it, and derive a result. To find a single source of truth, we need to capture timestamps and source IDs.
To integrate, we need a single search solution that can reach our on-premises data and our cloud data. The worst thing we can do is deploy a search tool that only searches one source of data, serves only one use case, or can’t be used behind our firewall.
In the cloud era, we need to look at search to be the glue that lets us find the data and analyze it together, no matter where it lives. We can’t just dump everything into one place; we need tools to let us get to exactly the right data where it lives.