Preview Apache Spark Interview Q&A

Introduction and Getting Started with Spark

Introduction

What are the key features of Apache Spark that you like?

Which all kind of data processing supported by Spark?

What are benefits of Spark over MapReduce?

What does a Spark Engine do?

Spark Setup, Deployment, and Execution Modes

In which situation you will use Client mode and Cluster mode?

Do you need to install Spark on all nodes of Yarn cluster while running Spark?

How to stop a Running Spark Application

How to limit the number of retries on Spark job failure in YARN?

Is there any way to get Spark Application id, while running a job?

Where are logs in Spark on YARN How to view those logs?

How to prevent Spark Executors from getting Lost when using YARN client mode?

What is mount points? Why do you use it? in Databricks

How can I run Spark on a cluster?

Spark Core Concepts – RDDs, Transformations, and Actions

How do you define RDD?

Explain about transformations and actions in the context of RDDs?

What is Lazy evaluated RDD mean?

What happens to RDD when one of the nodes on which it is distributed goes down?

What is the difference between map and flatMap and a good use case for each?

How to print the contents of RDD?

How to read multiple text files into a single RDD?

Explain sortByKey() operation

How would you control the number of partitions of a RDD?

DataFrames and Spark SQL

What is DataFrames?

What are the advantages of DataFrame?

What is Spark SQL and how does it differ from Hive?

What are the various data sources available in SparkSQL?

Does SparkSQL support subquery?

How to change column types in Spark SQL DataFrame?

How to replace NULL value in Spark Dataframe?

How to add a constant column in a Spark DataFrame?

How to add an index Column in Spark Dataframe?

How to concatenate columns in Spark Dataframe?

Is there any way for Spark to create primary keys?

File Formats, Storage, and Data Partitioning

List the advantages of Parquet file in Apache Spark

How does Spark partition work on files in HDFS?

How to compress spark output written to HDFS in Standalone mode?

Difference between partition and bucketing? in Apache Spark

How to deal with a 100 GB table joined with a 1 GB table?

Spark Performance Optimization and Troubleshooting

How to get good performance with Spark?

What are the various levels of persistence in Apache Spark?

What is the difference between cache() and persist() method of RDD?

What is coalesce transformation?

What is Shuffling?

What is Speculative Execution of a tasks?

How to evaluate your Spark application?

Have you ever encountered Spark java.lang.OutOfMemoryError? How to fix it?

How do you deal with data skew in joins in Apache Spark?

Do you know the top five secrets of performance tuning Apache Spark

Advanced Spark Concepts

What is Catalyst Optimizer? Explain with example.

What is Tungsten Project in Spark and how does it optimize execution?

What is WholeStageCodeGen in Spark SQL?

What is the difference between groupByKey and reduceByKey?

How can you minimize data transfers when working with Spark?

What is the advantage of broadcasting values across Spark Cluster?

What is Broadcast Join and when should you use it?

What is the Default level of parallelism in Spark?

Monitoring, Logs, and Spark UI

How to monitor and troubleshoot Spark jobs using Spark UI?

What does “Stage Skipped” mean in Spark web UI?

How do you disable Info Messages when running Spark Application?

What are the different stages of query execution in Spark SQL?

Spark Streaming and Real-Time Processing

What is Apache Spark Streaming?

How Spark Streaming API works?

What do you understand by receivers in Spark Streaming?

What is DStream?

What is the significance of Sliding Window operation?

What is write-ahead log (journaling)?

What is Structured Streaming and how is it different from DStreams?

How do you implement watermarking in Structured Streaming?

How to handle late arriving data in Structured Streaming?

How does Spark integrate with Kafka for real-time streaming?

Explain the role of checkpointing & stateful operations in Structured Streaming

Machine Learning and Graph Processing

What does MLlib do?

What is GraphX?

What is PageRank?

Spark Architecture and Execution Flow

Define Spark architecture.

What is DAGScheduler and how it performs?

What are workers in Spark?

What happens when a job is submitted? (Job execution process)

What is Spark Driver?

Please define executors in detail.

What is stage, with regards to Spark Job execution?

What is checkpointing?

What is Data locality / placement?

Spark Integrations and Ecosystem Tools

Can you use Spark to access and analyse data stored in Cassandra?

Is it possible to run Apache Spark on Apache Mesos?

How to read a AWS S3 file in Spark?

Do I need Hadoop to run Spark?

How does Spark relate to Apache Hadoop?

Who is using Spark in production?

Scenario-Based and Troubleshooting Questions

Scenario Based Question (Memory Management)

Scenario Based Question (Cache)

Scenario Based Question (Cluster)

Scenario Based Question (Recovery)

I’ve got big RDD (1 GB) in Yarn cluster. I can’t use collect(). How to handle?

Why does a job fail with “No space left on device,” but df says otherwise?

Certain data used again and again — how to improve performance?

While processing CSV, resultant output is multiple files — wanted a single file?

How to remove parentheses from output?

What are possible reasons for TimeoutException?

Modern Spark Features – AQE, Delta Lake & More

What is Adaptive Query Execution (AQE) and how does it help?

How do you enable AQE and where is it useful?

What is Dynamic Partition Pruning?

What is Delta Lake and why is it important?

What are the differences between Parquet and Delta Lake?

Explain ACID transactions in Delta Lake

How does Delta Lake handle schema evolution?

How to implement Slowly Changing Dimensions (SCD) with Delta Lake

Miscellaneous and Community Insights

Does Spark require modified versions of Scala or Python?

Does my data need to fit in memory to use Spark?

How large a cluster can Spark scale to?

What is the role of Spark Accumulators?

Is it possible to have multiple SparkContext in a single JVM?

Which all cluster managers can be used with Spark?

Name some sources from where Spark Streaming can process real-time

Name some companies that are already using Spark Streaming.

What is the difference between Apache Spark and Apache Storm?

What is the difference between Apache Spark and Apache Flink?

How to read a AWS S3 file in Spark?

Which all are the ways to configure Spark Properties and order them?

I want to find the moving average of the Time Series using Apache Spark

What does Stage Skipped mean in Apache Spark web UI?

What is Apache Spark Streaming?

Why Spark is good at low-latency iterative workloads?

We understand Spark Streaming uses micro-batching. Does this increase latency?

Spark Interview Question Set 1

Introduction

How to add a index Column in Spark Dataframe?

What are the differences between Apache Spark and Apache Storm?

How to limit the number of retries on Spark job failure in YARN?

Is there any way to get Spark Application id, while running a job?

How to stop a Running Spark Application?

In Spark Standalone Mode, How to compress spark output written to HDFS

Is there any way to get the current number of partitions of a DataFrame?

How to get good performance with Spark

Why does a job fail with “No space left on device”, but df says otherwise?

Where are logs in Spark on YARN? How to view those logs?

Spark Interview Question Set 2

How to prevent Spark Executors from getting Lost when using YARN client mode?

In which situation you will use Client mode and Cluster mode ?

How to print the contents of RDD?

What is the difference between Apache Spark and Apache Flink?

How to remove the parentheses? from output

What are possible reasons for receiving TimeoutException: [n seconds] ?

How to open/stream .zip files through Spark?

How to read multiline JSON in Apache Spark?

How to replace NULL value in Spark Dataframe?

How does Spark partition(ing) work on files in HDFS?

Scenario Based Question (Memory Management)

Scenario Based Question (Cache)

Scenario Based Question (Cluster)

Scenario Based Question (Recovery)

Let’s say you have 100 GB of table and one 1 GB of small table. How do you join?

Spark Interview Question Set 3

How to read a AWS S3 file in Spark?

I want to find the moving average of the Time Series using Apache Spark

How to change column types in Spark SQL DataFrame?

I've got big RDD(1gb) in yarn cluster. I can't use collect() How to handle this?

Is there any way for Spark to create primary keys?

How to add a constant column in a Spark DataFrame?

What does Stage Skipped mean in Apache Spark web UI?

How to concatenate columns in apache spark dataframe?

While processing CSV file resultant output is multiple file, wanted single file?

Explain sortByKey() operation.

Spark Interview Question Set 4

List the advantage of Parquet file in Apache Spark.

Do you need to install Spark on all nodes of Yarn cluster while running Spark

What is PageRank?

What does MLlib do?

What is GraphX?

What do you understand by receivers in Spark Streaming ?

Name some companies that are already using Spark Streaming.

Name some source from where Spark streaming component can process real-time data

What are the key features of Apache Spark that you like?

What are the various data sources available in SparkSQL?

Spark Interview Question Set 5

What is the difference between map and flatMap and a good use case for each?

How to read multiple text files into a single RDD?

Does SparkSQL support subquery?

Have you ever encounter Spark java.lang.OutOfMemoryError? How to fix this issue?

How do I skip a header from CSV files in Spark?

What happens to RDD when one of the nodes on which it is distributed goes down?

Certain data that we want to use again and again how to improve performance

How Spark Streaming API works?

What is write ahead log(journaling)?

What are the advantages of DataFrame?

Spark Interview Question Set 6

What is DataFrames?

What is Spark Driver?

What are benefits of Spark over MapReduce?

What does a Spark Engine do?

Explain the difference between Spark SQL and Hive?

What are the various levels of persistence in Apache Spark?

Which one will you choose for a project Hadoop MapReduce or Apache Spark?

What is a DStream?

What is the significance of Sliding Window operation?

How can you minimize data transfers when working with Spark?

Spark Interview Question Set 7

Is it possible to run Apache Spark on Apache Mesos?

Can you use Spark to access and analyse data stored in Cassandra databases?

Explain about transformations and actions in the context of RDDs?

What is Apache Spark Streaming?

How can you define Spark Accumulators?

What is a Broadcast Variable?

What is Data locality / placement?

Which all cluster manager can be used with Spark?

What is Speculative Execution of a tasks?

What is stage, with regards to Spark Job execution?

Spark Interview Question Set 8

What is DAGSchedular and how it performs?

Please define executors in detail?

Please explain, how worker's work, when a new Job submitted to them?

What are the workers?

Define Spark architecture?

What is checkpointing?

What is the difference between groupByKey and use reduceByKey ?

What is Shuffling?

What is the difference between cache() and persist() method of RDD

What is coalesce transformation?

Spark Interview Question Set 9

Data is spread in all the nodes of cluster, how spark tries to process this data

How would you control the number of partitions of a RDD?

What is Lazy evaluated RDD mean?

How do you define RDD?

How do you evaluate your spark application ?

How do you disable Info Message when running Spark Application?

What is the advantage of broadcasting values across Spark Cluster?

Is it possible to have multiple SparkContext in single JVM?

What is the Default level of parallelism in Spark?

Which all are the, ways to configure Spark Properties and order them?

Spark Interview Question Set 10

Which all kind of data processing supported by Spark?

Why Spark is good at low-latency iterative workloads ?

We understand Spark Streaming uses micro-batching. Does this increase latency?

Does Spark require modified versions of Scala or Python?

Do I need Hadoop to run Spark?

How can I run Spark on a cluster?

Does my data need to fit in memory to use Spark?

How large a cluster can Spark scale to?

How does Spark relate to Apache Hadoop?

Who is using Spark in production?

What is mount points? why you use it? in Databricks?

Difference between partition and bucketing? in Apache Spark

Do you know the top five secrets of performance tuning Apache Spark?

Preview - Apache Spark Interview Q&A