Microsoft’s R tools bring data science to the masses


One of Microsoft’s more interesting recent acquisitions was , a company that built tools for working with big data problems using the open source . Mixing an open source model with commercial tools, Revolution Analytics offered a range of tools supporting academic and personal use, alongside software that took advantage of massive amounts of data–including Hadoop. Under Microsoft’s stewardship, the now-renamed R Server has become a bridge between on-premises and cloud data.

Two years on, Microsoft has announced a set of major updates to its R tools. The R programming language has become an important part of its data strategy, with support in Azure and SQL Server—and, more important, in its Azure Machine Learning service, where it can be used to preprocess data before delivering it to a machine learning pipeline. It’s also one of Microsoft’s key cross-platform server products, with versions for both Red Hat Linux and Suse Linux.

R is everywhere in Microsoft’s ecosystem

Outside of Microsoft, the open source R has become a key tool for data science, with a lot of support in academic environments. (It currently ranks fifth in terms of all languages, according to the IEEE.) You don’t need to be a statistical expert to get started with R, because the Comprehensive R Archive Network () now has more than 9,000 statistical modules and algorithms you can use with your data.

Microsoft’s vision for R is one that crosses the boundaries between desktop, on-premises servers, and the cloud. Locally, there’s , as well as R support in Microsoft’s (paid) flagship Visual Studio development environment. On-premises, , as well as , giving you access to statistical analysis tools alongside your data. Local big data services based on Hadoop and Spark are also supported, while alongside Microsoft’s HDInsight services.

, you need a deep knowledge of statistical analytics to get the most from it. It’s been a long while since I took college-level statistics classes, so I found getting started with R complex because many of the underlying concepts require graduate-level understanding of complex statistical functions. The question isn’t so much whether you can write R code—it’s whether you can understand the results you’re getting.

That’s probably the biggest issue facing any organization that wants to work with big data: getting the skills needed to produce the analysis you want and, more important, to interpret the results you get. R certainly helps here, with built-in graphing tools that help you visualize key statistical measures.

Working with Microsoft R Server

The free Microsoft R Open can help your analytics team get up to speed with R before investing in any of the server products. It’s also a useful tool for quickly trying out new analytical algorithms and exploring the questions you want answered using your data. That approach works well as part of an overall analytics lifecycle, starting with data preparation, moving on to model development, and finally turning the model into tools that can be built into your business applications.

One interesting role for R is alongside GPU-based machine-learning tools. Here, R is employed to help train models before they’re used at scale. Microsoft is bundling its own machine learning algorithms with the latest R Server release, so you can test a model before uploading it to either a local big data instance or to the cloud. During a recent press event, Microsoft demonstrated this approach with astronomy images, training a machine-learning-based classifier on a local server with a library of galaxies before running the resulting model on cloud-hosted GPUs.