Silicon 70 2020

Simplifying data and AI so you can innovate faster: Databricks


Machine learning is a form of data analysis that takes inputs from previous data available and based upon the pattern; predict the outcomes or take decisions without much human interference. Because of rapid innovations and human needs, machine learning today is not like machine learning of the past. The terminology remains the same for a decade while the working model has innovated. There has been a trend shift in machine learning, resulting in fewer requirements and more automation with better results. With this, the use case and profitability have also changed. More and more fields are coming up to use Machine learning and AI to solve their problems or become more profitable.

Several new companies are coming up with new ideas to use the machine learning to its maximum limit, or sometimes using AI and machine learning in unison to get results which can be groundbreaking. Today companies just do not need basic feedbacks to remain in the competition, but instead always review itself, its working fundamentals, and future innovation and expansion.

Databricks is one of the innovative companies that aim to leverage the ML and AI to benefit companies by unifying Data Science, Engineering, and Business. Interestingly, it is founded by the same team who also crafter Apache Spark. It provides a Unified Analytics Platform for data science teams to collaborate with data engineering and lines of business to build data products. The program is an ideal solution for businesses in the financial services, energy and utilities, advertising & marketing, Enterprise technology software, public sector, telecom, healthcare & life science, manufacturing and industrial, retail, internet of things, and media and entertainment industries.

Overview of Databricks Benefits

Databricks is a solution built to unify operations and help you focus more and solve actual business problems. It is designed natively to the cloud and on the top of its propriety platform which helps it smooth running with Apache Spark. It provides a platform to enable you to unify analytics with the reliable Apache Spark Program. Apache Spark has been known for its open-source revolutionary processing engine that is above par in the industry with the ease of use, speed, and sophisticated analytics.

Also, Databricks is a fully managed platform that eliminates the complexity of machine learning and big data. It utilizes the unified Spark engine that comes with higher-level libraries and support for streaming data, SQL queries, graph processing, and machine learning. The libraries boost the productivity for developers and can be combined seamlessly into complex workflows.

The platform also provides collaborative workplaces that allow stakeholders to create data pipelines in multiple languages (including Python, R, SQL, and Scala), train and even prototype machine learning models. The interactive workplaces avail plenty of point-and-click insight visualizations and scriptable options including D3, ggplot, and matplotlib.

Databricks secures data at all levels thanks to its unified security model. The security options offer fine-grained, role-based access controls, identity management, data encryption, support for the compliance standards, and rigorous auditing.

Reliable data engineering

Data reliability is an important issue for data pipelines. Failed jobs can corrupt and duplicate data with partial writes. Multiple data pipelines reading and writing concurrently to your data lake can compromise data integrity. “Delta Lake” is an open-source storage layer for all your existing data lake, and uses versioned Apache Parquet™ files and a transaction log to keep track of all data commits and deliver reliability capabilities to Spark. ACID transactions ensure that multiple data pipelines can simultaneously read and write data reliably on the same table. Schema Enforcement ensures data types are correct and required columns are present, and Schema Evolution allows these requirements to change as data changes.

Analytics for your data

You can automatically track experiments from any framework, and log parameters, results, and code version for each run with managed MLflow. Dataflow lets you securely share, discover, and visualize all experiments across workspaces, projects, or specific notebooks across thousands of runs and multiple contributors. You can compare results with search, sort, filter, and advanced visualizations to help find the best version of your model, and quickly go back to the right version of your code for this specific run.

Collaborative data science

With Databricks, you can collaboratively write code in Python, R, Scala, SQL, explore data with interactive visualizations, and discover new insights with Databricks notebooks. Confidently and securely share code with co-authoring, commenting, automatic versioning, Github integrations, and role-based access controls. You can also keep track of all experiments and models in one place, capture knowledge, publish dashboards, and facilitate hand-offs with peers and stakeholders across the entire workflow, from raw data to insights.

Production machine learning

Databricks gives you one-click access to ready-to-use and optimized Machine Learning environments including the most popular frameworks like scikit-learn, XGBoost, TensorFlow, Keras and more. You can also effortlessly migrate and customize ML environments with Conda. Simplified scaling on Databricks helps you go from small to big data effortlessly, so that you don’t have to be limited by your PC specifications.

The ML Runtime provides built-in AutoML capabilities, including hyperparameter tuning, model search, and more to help accelerate the data science workflow. For example, accelerate training time with built-in optimizations on the most commonly used algorithms and frameworks, including Logistic Regression, Tree-based Models, and GraphFrames.

Meet the Expert

Ali Ghodsi, Co–founder and Chief Executive Officer: Ali is responsible for the growth and international​ ​expansion of the company. He previously served as the VP of Engineering and Product​ ​Management before taking the role of CEO in January 2016. In addition to his work at Databricks, Ali serves as an adjunct professor at UC Berkeley and is on the board at UC Berkeley’s RiseLab. Ali was one of the creators of open source project, Apache Spark, and ideas from his academic research in the areas of resource management and scheduling and data caching have been applied to Apache Mesos​ ​and Apache Hadoop.​ ​​Ali​ ​received his MBA from Mid-Sweden University in 2003 and PhD from KTH/Royal Institute of Technology in Sweden​ in 2006​ in the area of​ ​Distributed Computing.