November Edition 2020

Qubole – Providing a secure, cloud agnostic platform for Big Data Analytics built on Amazon Web Services, Microsoft, and Google Clouds


A data lake is a system or repository that stores data in its raw format as well as transformed trusted datasets and provides both programmatic and SQL based access to this data for diverse analytics tasks such as machine learning, data exploration, and interactive analytics. The data stored in a data lake can include structured data from relational databases (rows and columns), semi-structured data (CSV, logs, XML, JSON), unstructured data (emails, documents, PDFs) and binary data (images, audio, video). A data lake, where data is stored in an open format and accessed through open standards-based interfaces, is defined as an Open Data Lake. This adherence to an open philosophy, aimed at preventing vendor lock-in, permeates through every aspect of the system, including data storage, data management, data processing, operations, data access, governance, and security.

Qubole is the open data lake company that provides a simple and secure data lake platform for machine learning, streaming, and ad-hoc analytics. Qubole’s Platform provides end-to-end data lake services such as cloud infrastructure management, data management, continuous data engineering, analytics, and machine learning with near-zero administration. It is trusted by leading brands such as Expedia, Disney, Oracle, Gannett and Adobe to spur innovation and to transform their businesses for the era of big data.

Next-generation solutions

Ad-Hoc Analytics: Balance accessibility and performance tradeoffs for growing set of users. Face infrastructure, data management, data ingestion delays to provide valuable insights on specific issues. Have a self-service access scalable ad-hoc analytics solution with automated data pipelines. Build data pipelines manually every time to do proof of concept for new types of regular reports. Get started with easy-to-use SQL interfaces that work the way analysts want. Discover insights, query data, analyze results, and debug queries from a single pane of glass Qubole Workbench. Leverage built-in connectors or JDBC and ODBC drivers with popular BI tools like Looker and Tableau to visualize the data.

Streaming Analytics: One can build streaming data pipelines to capture the benefits of real time data for machine learning and ad-hoc analytics. Qubole Pipelines Service is a Stream Processing Service that addresses real-time ingestion, decision, machine learning, and reporting use-cases. Stream Processing Pipelines are complex to build and take significant time.   The firm provides a built-in accelerated development cycle. Observability and data quality are hard to achieve at scale. Achieving high performance at low costs is challenging.     Data inconsistencies require constant clean-up of small files and result in file management overhead. Small files are compacted using Qubole ACID capabilties without blocking read/write operations.

Data Processing Engines: Qubole supports best-of-breed data processing engines and frameworks for end-to-end data processing. With Qubole’s platform-based approach, new open source big data engines and frameworks can be easily added to ensure platform longevity. SPARK Qubole runs the biggest Apache Spark clusters in the cloud and supports a broad variety of use cases from ETL and machine learning to analytics. The implementation of Spark is a performance-enhanced and cloud-optimized version of the open source framework Apache Spark. These enhancements bring all of the cost and performance optimization features of Qubole to Spark workloads. The Spark implementation greatly improves the performance of Spark workloads with enhancements such as fast storage, distributed caching, advanced indexing, and metadata caching capabilities. Other enhancements include job isolation on multi-tenant clusters and SparkLens, an open source Spark profiler that provides insights into the Spark application.

The passionate leader behind the success of Qubole

Ashish Thusoo is the Co-founder and serves as the Chief Executive Officer of Qubole. He leveraged his experience as part of the original Facebook Data Service Team from 2007 to 2011, in launching Qubole. Mr. Ashish authored many prominent data industry tools during their time at Facebook, including the Apache Hive Project. His goal was to enable massive speed and scale to the data platform while providing better self-service access to the data for business users. Mr. Ashish built these same product principles of speed, scale, and accessibility into the foundation of Qubole.

“Our team has worked diligently to cultivate an ideology of compassion and innovation that aligns with who we are and what we seek to achieve.”