Q&A: Hortonworks and IBM double down on Hadoop


Hortonworks and IBM recently announced an expanded partnership. The deal pairs IBM’s Data Science Experience (DSX) analytics toolkit and the Hortonworks Data Platform (HDP), with the goal of extending machine learning and data science tools to developers across the Hadoop ecosystem. IBM’s Big SQL, a SQL engine for Hadoop, will be leveraged as well.

InfoWorld Editor at Large Paul Krill recently met with Hortonworks CEO Rob Bearden and IBM Analytics general manager Rob Thomas at the DataWorks Summit conference in Silicon Valley, to talk about the state of big data analytics, machine learning, and Hadoop’s standing among the expanding array of technologies available for large-scale data processing.

InfoWorld: What does IBM Data Science Experience bring to the Hadoop Data Platform?

Thomas: We launched Data Science Experience last year and the idea was we saw a change coming in the data science market. Traditionally, organizations were either SPSS users or SAS users but the whole market was moving toward open languages. We built Data Science Experience on Jupyter. It’s focused on Python data scientists, R, Spark, Scala programmers. You can use whatever language you want.

or  or  … It’s really an open platform for data science. We focus on the collaboration, how you get data scientists working as a team as part of doing that. Think about Hadoop. Hadoop has had an enormous run in the last five to six years in enterprises. There is a lot of data in Hadoop now. There is not super value for the client by just having data there. Sometimes, there is some cost savings. Where there is super value for the client is they actually start to change how they’re interacting with that data, how they’re building models, discovering what’s happening in there.  

InfoWorld: IBM has a well-known experience with . Hortonworks has positioned  and Hadoop as its entrance into the machine learning space. Can you discuss the company’s future plans for machine learning, AI, and data science?

Bearden:  It’s going to be through the DSX framework and the IBM platforms that come through that. Hadoop and HDP will continue to be the platform. We’ll leverage some of the other processing platforms collectively like Spark and there’s a tremendous amount of work that IBM’s done to advance Spark. We’ll continue to embody that inside of HDP through YARN but then on top of all of these large data sets, we’ll leverage DSX and the rest of the IBM tool suite. We expressed that DSX and the rest of the tool suite from IBM for machine learning, deep learning, and AI will be our strategic platforms going forward and we’re going to co-invest very deeply to make sure all the integration is done properly. That goes back to being able to bring all resources into a focused distribution so that we can not only innovate horizontally but integrate vertically.

InfoWorld: InfoWorld ran a story late last year claiming that , that other big data infrastructure including Spark, MongoDB, Cassandra, and Kafka were marching past it. InfoWorld asked Hortonworks CTO Scott Gnau a . What can you say about the continued vitality of Hadoop?

? What’s next on the roadmap for it?

Bearden: The notion of containers and having the ability to then take a container-based approach to applications and being able to do that as an extension through YARN is actually part of the roadmap today. We published that and we think that opens up new use cases and applications that can leverage Hadoop.

You go back to the ability to get to existing applications, whether it be fraud detection, money laundering, two of the typical ones that you look at in financial services. Rapid diagnostics in the healthcare world, being able to get to better processing for genomics… analyzing the genome for certain kinds of diseases and being able to take those existing algorithms or applications and moving them over to the data via a container approach. You can do that much cleaner with YARN.

InfoWorld: Is there anything else you want to mention?

Thomas: I’d mention just one more point around data governance. We started working with Hortonworks over the last, oh, 18 months around a project called Atlas. I’d say it’s just coming into form as we’ve both been working with a lot of clients and we view it as a key part of our joint strategy around how we’re going to approach data governance. You use data governance for compliance. You use data governance for insights. There’s a big compliance mandate with things like GDPR (General Data Protection Regulation) that’s happening right now in Europe. I think you’ll see more and more on this topic in the future from us.