IBM’s Data Science for Apache Spark

The Silicon Review
13 June, 2016

Data Science is a process or system to extract knowledge or insights from data. IBM’s Data Science offers quicker access to more data with superior performance on data analytics in close real time. A cloud-based development environment for Apache Spark called the Data Science Experience was announced recently by IBM. The new environment is planned with 250 curated data sets, open source tools and shared workspace on the IBM’s Bluemix cloud platform.

Big Blue ventured $300 million on Apache Spark to develop it as an operating system for analytics. Spark was formerly developed by the University of California; Berkeley’s AMP Lab and then donated to Apache. Data Science Experience will enhance Apache Spark with better computing speed, wider data access and flexibility of the platform.

Intended to improve the efficiency of data science application built on Apache Spark, IBM is already approaching various companies for their inputs.

“With Apache Spark, we see an opportunity to significantly transform the role of the data scientist by providing access to curated data sets, open source tools, and a collaborative platform to accelerate innovation,” said Bob Picciano, Senior Vice President, IBM Analytics, in the statement.

Data Science from IBM is advantageous to the scientists using R programming language. R language is open source software widely used by data scientists to build statistical software. IBM is in the drive to support the R language developed in SparkR, SparkSQL and Apache SparkML.

Spark framework is widely used by IBM’s various software such as Watson, Analytics, Systems, Cloud, and Commerce. The company open-sourced SystemML machine learning technology to facilitate Spark machine learning capability. Data Science in collaboration with Spark will help data scientists explore Big Data analytics.