Apache Beam Interview Questions and Answers

Apache Beam is an open source, unified model for defining both batch and streaming data-parallel processing pipelines. Using one of the open source Beam SDKs, you build a program that defines the pipeline. The pipeline is then executed by one of Beam’s supported distributed processing back-ends, which include Apache Apex, Apache Flink, Apache Spark, and Google Cloud Dataflow.

1) What are the benefits of Apache Beam over Spark/Flink for batch processing?

Answer)There's a few things that Beam adds over many of the existing engines.

Unifying batch and streaming. Many systems can handle both batch and streaming, but they often do so via separate APIs. But in Beam, batch and streaming are just two points on a spectrum of latency, completeness, and cost. There's no learning/rewriting cliff from batch to streaming. So if you write a batch pipeline today but tomorrow your latency needs change, it's incredibly easy to adjust. You can see this kind of journey in the Mobile Gaming examples.

APIs that raise the level of abstraction: Beam's APIs focus on capturing properties of your data and your logic, instead of letting details of the underlying runtime leak through. This is both key for portability (see next paragraph) and can also give runtimes a lot of flexibility in how they execute. Something like ParDo fusion (aka function composition) is a pretty basic optimization that the vast majority of runners already do. Other optimizations are still being implemented for some runners. For example, Beam's Source APIs are specifically built to avoid overspecification the sharding within a pipeline. Instead, they give runners the right hooks to dynamically rebalance work across available machines. This can make a huge difference in performance by essentially eliminating straggler shards. In general, the more smarts we can build into the runners, the better off we'll be. Even the most careful hand tuning will fail as data, code, and environments shift.

Portability across runtimes.: Because data shapes and runtime requirements are neatly separated, the same pipeline can be run in multiple ways. And that means that you don't end up rewriting code when you have to move from on-prem to the cloud or from a tried and true system to something on the cutting edge. You can very easily compare options to find the mix of environment and performance that works best for your current needs. And that might be a mix of things -- processing sensitive data on premise with an open source runner and processing other data on a managed service in the cloud.

Designing the Beam model to be a useful abstraction over many, different engines is tricky. Beam is neither the intersection of the functionality of all the engines (too limited!) nor the union (too much of a kitchen sink!). Instead, Beam tries to be at the forefront of where data processing is going, both pushing functionality into and pulling patterns out of the runtime engines.

Keyed State is a great example of functionality that existed in various engines and enabled interesting and common use cases, but wasn't originally expressible in Beam. We recently expanded the Beam model to include a version of this functionality according to Beam's design principles.

And vice versa, we hope that Beam will influence the roadmaps of various engines as well. For example, the semantics of Flink's DataStreams were influenced by the Beam (née Dataflow) model.

This also means that the capabilities will not always be exactly the same across different Beam runners at a given point in time. So that's why we're using capability matrix to try to clearly communicate the state of things.

2)Apache Beam : FlatMap vs Map?

Answer)These transforms in Beam are exactly same as Spark (Scala too).

A Map transform, maps from a PCollection of N elements into another PCollection of N elements.

A FlatMap transform maps a PCollections of N elements into N collections of zero or more elements, which are then flattened into a single PCollection.

As a simple example, the following happens:

beam.Create([1, 2, 3]) | beam.Map(lambda x: [x, 'any'])

# The result is a collection of THREE lists: [[1, 'any'], [2, 'any'], [3, 'any']]

Whereas:

beam.Create([1, 2, 3]) | beam.FlatMap(lambda x: [x, 'any'])

# The lists that are output by the lambda, are then flattened into a

# collection of SIX single elements: [1, 'any', 2, 'any', 3, 'any']

3)How do you express denormalization joins in Apache Beam that stretch over long periods of time

Answer)Since Producer may appear years before its Product, you can use some external storage (e.g. BigTable) to store your Producers and write a ParDo for Product stream to do lookups and perform join. To further optimize performance, you can take advantage of stateful DoFn feature to batch lookups.

You can still use windowing and CoGroupByKey to do join for cases where Product data is delivered before Producer data. However, the window here can be small enough just to handle out-of-order delivery.

4)Apache Airflow or Apache Beam for data processing and job scheduling

Answer)Airflow can do anything. It has BashOperator and PythonOperator which means it can run any bash script or any Python script.

It is a way to organize (setup complicated data pipeline DAGs), schedule, monitor, trigger re-runs of data pipelines, in a easy-to-view and use UI.

Also, it is easy to setup and everything is in familiar Python code.

Doing pipelines in an organized manner (i.e using Airflow) means you don't waste time debugging a mess of data processing (cron) scripts all over the place.

Apache Beam is a wrapper for the many data processing frameworks (Spark, Flink etc.) out there.

The intent is so you just learn Beam and can run on multiple backends (Beam runners).

If you are familiar with Keras and TensorFlow/Theano/Torch, the relationship between Keras and its backends is similar to the relationship between Beam and its data processing backends.

5)What are the use cases for Apache Beam and Apache Nifi? It seems both of them are data flow engines. In case both have similar use case, which of the two is better?

Answer)Apache Beam is an abstraction layer for stream processing systems like Apache Flink, Apache Spark (streaming), Apache Apex, and Apache Storm. It lets you write your code against a standard API, and then execute the code using any of the underlying platforms. So theoretically, if you wrote your code against the Beam API, that code could run on Flink or Spark Streaming without any code changes.

Apache NiFi is a data flow tool that is focused on moving data between systems, all the way from very small edge devices with the use of MiNiFi, back to the larger data centers with NiFi. NiFi's focus is on capabilities like visual command and control, filtering of data, enrichment of data, data provenance, and security, just to name a few. With NiFi, you aren't writing code and deploying it as a job, you are building a living data flow through the UI that is taking effect with each action.

Stream- processing platforms are often focused on computations involving joins of streams and windowing operations. Where as a data flow tool is often complimentary and used to manage the flow of data from the sources to the processing platforms.

There are actually several integration points between NiFi and stream processing systems... there are components for Flink, Spark, Storm, and Apex that can pull data from NiFi, or push data back to NiFi. Another common pattern would be to use MiNiFi + NiFi to get data into Apache Kafka, and then have the stream processing systems consume from Kafka.

Apache Beam Interview Questions and Answers

You may also be interested in