During last year’s Strata + Hadoop World conference, Microsoft embraced Linux in its push to lure enterprises onto its growing portfolio of cloud-based big data offerings. This year, Microsoft is banking on extensive support for R, the statistical computing language that’s popular among data scientists and developers for big data application integration and self-service data discovery.
Microsoft announced on March 29 a preview version of R Server for Azure HDInsight, Microsoft cloud-based distribution of the Hadoop big data processing platform. After last year’s acquisition of Revolution Analytics, the leading commercial sponsor for the open-source R language, Microsoft R Server has effectively replaced Revolution R Enterprise. “It runs the most comprehensive set of [machine-learning] algorithms and statistical functions in the cloud, leveraging Hadoop and Spark,” Joseph Sirosh, corporate vice president of Microsoft’s Data Group, in a statement.
“By making it available as a workload running inside HDInsight, we remove obstacles for users to unlock the power of R, eliminating memory and processing constraints and extending analytics from the laptop to large multi-node Hadoop and Spark clusters,” Sirosh’s statement said. For large-scale big data applications, Microsoft revealed that it had upgraded Spark for Azure HDInsight to Apache Spark version 1.6. Since it was open sourced in 2010, enterprises have been flocking to Spark because of remarkably fast data processing engine, sophisticated analytics capabilities and its comparative ease-of-use. Proponents assert that Spark can accelerate big data processing workloads by a factor of 10 to 100. In Microsoft’s latest cloud-based implementation, the company has been able to achieve a 10x improvement in streaming state management performance, according to Sirosh. The update also includes new machine learning algorithms and automatic memory management.
Azure Data Catalog is set to become general available March 30, Sirosh also confirmed. The service allows businesses to maintain a metadata repository, which in turn enables self-service data discovery, a crucial early step in helping developers, business analysts and data scientists unearth insights hidden in their organizations’ data. Data Catalog supports an extensive number of enterprise data sources, including SQL Server, Oracle, Teradata and SAP HANA, and of course, Microsoft’s own Azure Data Lake and Storage Blobs. It features built-in support for Power BI Desktop, SQL Server Data Tools and Excel. It is available as part of the Cortana Analytics suite and as standalone product in free and standard editions.