Apache Drill Interview Questions and Answers

Drill is an Apache open-source SQL query engine for Big Data exploration. Drill is designed from the ground up to support high-performance analysis on the semi-structured and rapidly evolving data coming from modern Big Data applications, while still providing the familiarity and ecosystem of ANSI SQL, the industry-standard query language. Drill provides plug-and-play integration with existing Apache Hive and Apache HBase deployments.

1) Why Drill?

Answer)The 40-year monopoly of the RDBMS is over. With the exponential growth of data in recent years, and the shift towards rapid application development, new data is increasingly being stored in non-relational datastores including Hadoop, NoSQL and cloud storage. Apache Drill enables analysts, business users, data scientists and developers to explore and analyze this data without sacrificing the flexibility and agility offered by these datastores. Drill processes the data in-situ without requiring users to define schemas or transform data.

2)What are some of Drill's key features?

Answer)Drill is an innovative distributed SQL engine designed to enable data exploration and analytics on non-relational datastores. Users can query the data using standard SQL and BI tools without having to create and manage schemas. Some of the key features are:

Schema-free JSON document model similar to MongoDB and Elasticsearch

Industry-standard APIs: ANSI SQL, ODBC/JDBC, RESTful APIs

Extremely user and developer friendly

Pluggable architecture enables connectivity to multiple datastores

3)How does Drill achieve performance?

Answer)Drill is built from the ground up to achieve high throughput and low latency. The following capabilities help accomplish that:

Distributed query optimization and execution: Drill is designed to scale from a single node (your laptop) to large clusters with thousands of servers.

Columnar execution: Drill is the world's only columnar execution engine that supports complex data and schema-free data. It uses a shredded, in-memory, columnar data representation.

Runtime compilation and code generation: Drill is the world's only query engine that compiles and re-compiles queries at runtime. This allows Drill to achieve high performance without knowing the structure of the data in advance. Drill leverages multiple compilers as well as ASM-based bytecode rewriting to optimize the code.

Vectorization: Drill takes advantage of the latest SIMD instructions available in modern processors.

Optimistic/pipelined execution: Drill is able to stream data in memory between operators. Drill minimizes the use of disks unless needed to complete the query.

4)What datastores does Drill support?

Answer)Drill is primarily focused on non-relational datastores, including Hadoop, NoSQL and cloud storage. The following datastores are currently supported:

Hadoop: All Hadoop distributions (HDFS API 2.3+), including Apache Hadoop, MapR, CDH and Amazon EMR

NoSQL: MongoDB, HBase

Cloud storage: Amazon S3, Google Cloud Storage, Azure Blog Storage, Swift

A new datastore can be added by developing a storage plugin. Drill's unique schema-free JSON data model enables it to query non-relational datastores in-situ (many of these systems store complex or schema-free data).

5)What clients are supported?

Answer)BI tools via the ODBC and JDBC drivers (eg, Tableau, Excel, MicroStrategy, Spotfire, QlikView, Business Objects)

Custom applications via the REST API

Java and C applications via the dedicated Java and C libraries

6)Is Drill a SQL-on-Hadoop engine?

Answer)Drill supports a variety of non-relational datastores in addition to Hadoop. Drill takes a different approach compared to traditional SQL-on-Hadoop technologies like Hive and Impala. For example, users can directly query self-describing data (eg, JSON, Parquet) without having to create and manage schemas.

The following table provides a more detailed comparison between Drill and traditional SQL-on-Hadoop technologies:

| Drill | SQL-on-Hadoop (Hive, Impala, etc.)

Use case | Self-service, in-situ, SQL-based analytics | Data warehouse offload

Data sources | Hadoop, NoSQL, cloud storage (including multiple instances) | A single Hadoop cluster

Data model | Schema-free JSON (like MongoDB) | Relational

User experience | Point-and-query | Ingest data → define schemas → query

Deployment model| Standalone service or co-located with Hadoop or NoSQL | Co-located with Hadoop

Data management | Self-service | IT-driven

SQL | ANSI SQL | SQL-like

1.0 availability| Q2 2015 | Q2 2013 or earlier

7)Is Spark SQL similar to Drill?

Answer)No. Spark SQL is primarily designed to enable developers to incorporate SQL statements in Spark programs. Drill does not depend on Spark, and is targeted at business users, analysts, data scientists and developers.

8)Does Drill replace Hive?

Answer)Hive is a batch processing framework most suitable for long-running jobs. For data exploration and BI, Drill provides a much better experience than Hive.

In addition, Drill is not limited to Hadoop. For example, it can query NoSQL databases (eg, MongoDB, HBase) and cloud storage (eg, Amazon S3, Google Cloud Storage, Azure Blob Storage, Swift).

9)How does Drill support queries on self-describing data?

Answer)Drill's flexible JSON data model and on-the-fly schema discovery enable it to query self-describing data.

JSON data model: Traditional query engines have a relational data model, which is limited to flat records with a fixed structure. Drill is built from the ground up to support modern complex/semi-structured data commonly seen in non-relational datastores such as Hadoop, NoSQL and cloud storage. Drill's internal in-memory data representation is hierarchical and columnar, allowing it to perform efficient SQL processing on complex data without flattening into rows.

On-the-fly schema discovery (or late binding): Traditional query engines (eg, relational databases, Hive, Impala, Spark SQL) need to know the structure of the data before query execution. Drill, on the other hand, features a fundamentally different architecture, which enables execution to begin without knowing the structure of the data. The query is automatically compiled and re-compiled during the execution phase, based on the actual data flowing through the system. As a result, Drill can handle data with evolving schema or even no schema at all (eg, JSON files, MongoDB collections, HBase tables).

10)But I already have schemas defined in Hive Metastore? Can I use that with Drill?

Answer)Absolutely. Drill has a storage plugin for Hive tables, so you can simply point Drill to the Hive Metastore and start performing low-latency queries on Hive tables. In fact, a single Drill cluster can query data from multiple Hive Metastores, and even perform joins across these datasets.

11)Is Drill "anti-schema" or "anti-DBA"?

Answer)Not at all. Drill actually takes advantage of schemas when available. For example, Drill leverages the schema information in Hive when querying Hive tables. However, when querying schema-free datastores like MongoDB, or raw files on S3 or Hadoop, schemas are not available, and Drill is still able to query that data.

Centralized schemas work well if the data structure is static, and the value of data is well understood and ready to be operationalized for regular reporting purposes. However, during data exploration, discovery and interactive analysis, requiring rigid modeling poses significant challenges. For example:

Complex data (eg, JSON) is hard to map to relational tables

Centralized schemas are hard to keep in sync when the data structure is changing rapidly

Non-repetitive/ad-hoc queries and data exploration needs may not justify modeling costs

Drill is all about flexibility. The flexible schema management capabilities in Drill allow users to explore raw data and then create models/structure with CREATE TABLE or CREATE VIEW statements, or with Hive Metastore.

11)What does a Drill query look like?

Answer)Drill uses a decentralized metadata model and relies on its storage plugins to provide metadata. There is a storage plugin associated with each data source that is supported by Drill.

The name of the table in a query tells Drill where to get the data:

SELECT * FROM dfs1.root.`/my/log/files/`;

SELECT * FROM dfs2.root.`/home/john/log.json`;

SELECT * FROM mongodb1.website.users;

SELECT * FROM hive1.logs.frontend;

SELECT * FROM hbase1.events.clicks;

12)What SQL functionality does Drill support?

Answer)Drill supports standard SQL (aka ANSI SQL). In addition, it features several extensions that help with complex data, such as the KVGEN and FLATTEN functions.

13)Do I need to load data into Drill to start querying it?

Answer)No. Drill can query data 'in-situ'.

Apache Drill Interview Questions and Answers

You may also be interested in