Apache Flink is a framework and distributed processing engine for stateful computations over unbounded and bounded data streams. Flink has been designed to run in all common cluster environments, perform computations at in-memory speed and at any scale.
1)Is Apache Flink only for (near) real-time processing use cases?
Answer)Flink is a very general system for data processing and data-driven applications with data streams as the core building block. These data streams can be streams of real-time data, or stored streams of historic data. For example, in Flink’s view a file is a stored stream of bytes. Because of that, Flink supports both real-time data processing and applications, as well as batch processing applications.
Streams can be unbounded (have no end, events continuously keep coming) or be bounded (streams have a beginning and an end). For example, a Twitter feed or a stream of events from a message queue are generally unbounded streams, whereas a stream of bytes from a file is a bounded stream.
2)If everything is a stream, why are there a DataStream and a DataSet API in Flink?
Answer)Bounded streams are often more efficient to process than unbounded streams. Processing unbounded streams of events in (near) real-time requires the system to be able to immediately act on events and to produce intermediate results (often with low latency). Processing bounded streams usually does not require producing low latency results, because the data is a while old anyway (in relative terms). That allows Flink to process the data in a simple and more efficient way.
The DataStream API captures the continuous processing of unbounded and bounded streams, with a model that supports low latency results and flexible reaction to events and time (including event time).
The DataSet API has techniques that often speed up the processing of bounded data streams. In the future, the community plans to combine these optimizations with the techniques in the DataStream API.
3)How does Flink relate to the Hadoop Stack?
Answer)Flink is independent of Apache Hadoop and runs without any Hadoop dependencies.
However, Flink integrates very well with many Hadoop components, for example, HDFS, YARN, or HBase. When running together with these components, Flink can use HDFS to read data, or write results and checkpoints/snapshots. Flink can be easily deployed via YARN and integrates with the YARN and HDFS Kerberos security modules.
4)What other stacks does Flink run in?
Answer)Users run Flink on Kubernetes, Mesos, Docker, or even as standalone services.
5)What are the prerequisites to use Flink?
Answer)You need Java 8 to run Flink jobs/applications.
The Scala API (optional) depends on Scala 2.11.
Highly-available setups with no single point of failure require Apache ZooKeeper.
For highly-available stream processing setups that can recover from failures, Flink requires some form of distributed storage for checkpoints (HDFS / S3 / NFS / SAN / GFS / Kosmos / Ceph / …).
6)What scale does Flink support?
Answer)Users are running Flink jobs both in very small setups (fewer than 5 nodes) and on 1000s of nodes and with TBs of state.
7)Is Flink limited to in-memory data sets?
Answer)For the DataStream API, Flink supports larger-than-memory state be configuring the RocksDB state backend.
For the DataSet API, all operations (except delta-iterations) can scale beyond main memory.