Apache Flume Interview Questions and Answers

Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms.


1)Explain about the core components of Flume?


Answer)The core components of Flume are –

Event-The single log entry or unit of data that is transported.

Source-This is the component through which data enters Flume workflows.

Sink-It is responsible for transporting data to the desired destination.

Channel-it is the duct between the Sink and Source.

Agent-Any JVM that runs Flume.

Client-The component that transmits event to the source that operates with the agent.


2)Can I run two instances of the flume node on the same unix machine?


Answer)Yes. Run flume with the -n option.

flume node

flume node -n physicalnodename


3)I'm generating events from my application and sending it to a flume agent listening for Thrift/Avro RPCs and my timestamps seem to be in the 1970s.


Answer)Event generated is expected to have unix time in milliseconds. If the data is being generated by an external application, this application must generated data in terms of milliseconds.

For example, 1305680461000 should result in 5/18/11 01:01:01 GMT, but 1305680461 will result in something like 1/16/70 2:41:20 GMT


4)Can I control the level of HDFS replication / block size / other client HDFS property?


Answer)Yes. HDFS block size and replication level are HDFS client parameters, so you should expect them to be set by client. The parameters you get are probably coming from hadoop-core.*.jar file (it usually contains hdfs-default.xml and friends). If you want to overwrite the default parameters, you need to set dfs.block.size and dfs.replication in your hdfs-site.xml or flume-site.xml file


5)Which is the reliable channel in Flume to ensure that there is no data loss?


Answer) FILE Channel is the most reliable channel among the 3 channels JDBC, FILE and MEMORY.


6)How multi-hop agent can be setup in Flume?


Answer)Avro RPC Bridge mechanism is used to setup Multi-hop agent in Apache Flume.


7)Does Apache Flume provide support for third party plug-ins?


Answer)Most of the data analysts use Apache Flume has plug-in based architecture as it can load data from external sources and transfer it to external destinations.


8)Is it possible to leverage real time analysis on the big data collected by Flume directly? If yes, then explain how.


Answer)Data from Flume can be extracted, transformed and loaded in real-time into Apache Solr servers using MorphlineSolrSink


9)What is a channel?


Answer)It stores events,events are delivered to the channel via sources operating within the agent.An event stays in the channel until a sink removes it for further transport.


10)What is Interceptor?


Answer)An interceptor can modify or even drop events based on any criteria chosen by the developer.


11)Explain about the replication and multiplexing selectors in Flume.


Answer)Channel Selectors are used to handle multiple channels. Based on the Flume header value, an event can be written just to a single channel or to multiple channels. If a channel selector is not specified to the source then by default it is the Replicating selector. Using the replicating selector, the same event is written to all the channels in the source’s channels list. Multiplexing channel selector is used when the application has to send different events to different channels.


12)Does Apache Flume provide support for third party plug-ins?


Answer)Most of the data analysts use Apache Flume has plug-in based architecture as it can load data from external sources and transfer it to external destinations.


13)Agent communicate with other Agents?


Answer)NO each agent runs independently. Flume can easily horizontally. As a result ther is no single point of failure.


14)what are the complicated steps in Flume configurations?


Answer)Flume can processing straming data. so if started once, there is no stop/end to the process. asynchronously it can flows data from source to HDFS via agent. First of all agent should know individual components how they are connecto to load data. so configuraton is trigger to load streaming data. for example consumerkey, consumersecret accessToken and accessTokenSecret are key factor to download data from twitter.


15)What is flume agent?


Answer)A flume agent is JVM holds the flume core components(source, channel, sink) through which events flow from an external source like web-servers to destination like HDFS. Agent is heart of the Apache Flime.


16)What is Flume event?


Answer)A unit of data with set of string attribute called Flume event. The external source like web-server send events to the source. Internally Flume has inbuilt functionality to understand the source format.Each log file is consider as an event. Each event has header and value sectors, which has header information and appropriate value that assign to articular header.


17)Is it possible to leverage real time analysis on the big data collected by Flume directly? If yes, then explain how.


Answer)Data from Flume can be extracted, transformed and loaded in real-time into Apache Solr servers using MorphlineSolrSink


18)Differentiate between FileSink and FileRollSink


Answer)The major difference between HDFS FileSink and FileRollSink is that HDFS File Sink writes the events into the Hadoop Distributed File System (HDFS) whereas File Roll Sink stores the events into the local file system?


19)How to use exec source?


Answer)Set the agents source type property to exec as below.

agents.sources.sourceid.type=exec


20)How to improve performance?


Answer)Batching the events: You can specify the number of events to be written per transaction by changing the batch size, which has default value as 20

Agent.sources.sourceid.batchSize=2000


21)Why to provide higher value in batchSize?


Answer)When your input data is large and you find that you can not write to your channel fast enough.Having bigger batch size will reduce the overall average transaction overhead per event. However, you must have tested it before deciding batch size.


22)What is the Problem with SpoolDir?


Answer)Whenever, due to error or any other reason Flume, restarts it will create duplicate events on any files in the spooling directory that are re-transmitted due to not being marked as finished.


23)How can I tell if I have a library loaded when flume runs?


Answer)From the command line, you can run flume classpath to see the jars and the order Flume is attempting to load them in.


24)How can I tell if a plugin has been loaded by a flume node?


Answer)You can look at the node's plugin status web page – http://<master>:35871/extension.jsp Alternately, you can look at the logs.


25)Why does the master need to have plugins installed?


Answer)The master needs to have plugins installed in order to validate configs it is sending to nodes.


26)How can I tell if a plugin has been loaded by a flume master?


Answer)You can look at the node's plugin status web page – http://<master>:35871/masterext.jsp Alternately, you can look at the logs.


27)How can I tell if a plugin has been loaded by a flume node?


Answer)You can look at the node's plugin status web page – http://<node>:35862/staticconfig.jsp Alternately, you can look at the logs.


28)How can I tell if my flume-site.xml configuration values are being read properly?


Answer)You can go to the node or master's static config web page to see what configuration values are loaded. http://<node>:35862/staticconfig.jsp http://<master>:35871/masterstaticconfig.jsp


29)I'm having a hard time getting the LZO codec to work.


Answer)Flume by default reads the $HADOOP_CONF_DIR/core-site.xml which may have the io.compression.codecs setting set. You can make the settting <final> so that flume does not attempt to override the setting.


29)I lose my configurations when I restart the master. What's happening?


Answer)The default path to write information is set to this value. You may want to override this to a place will be persistent across reboots such as /var/lib/flume. <property> <name>flume.master.zk.logdir</name> <value>/tmp/flume-$ Unknown macro: {user.name} -zk</value> <description>The base directory in which the ZBCS stores data.</description> </property>


30)How can I get metrics from a node?


Answer)Flume nodes report metrics which you can use to debug and to see progress. You can look at a node's status web page by pointing your browser to port 35862. (http://<node>:35862).


31)How can I tell if data is arriving at the collector?


Answer)When events arrive at a collector, the source counters should be incremented on the node's metric page. For example, if you have a node called foo you should see the following fields have growing values when you refresh the page. LogicalNodeManager.foo.source.CollectorSource.number of bytes LogicalNodeManager.foo.source.CollectorSource.number of events


32)How can I tell if data is being written to HDFS?


Answer)Data in hdfs doesn't "arrive" in hdfs until the file is closed or certain size thresholds are met. As events are written to hdfs, the sink counters on the collector's metric page should be incrementing. In particular look for fields that match the following names: *.Collector.GunzipDecorator.UnbatchingDecorator.AckChecksumChecker.InsistentAppend.append* *.appendSuccesses are successful writes. If other values like appendRetries or appendGiveups are incremented, they indicate a problem with the attemps to write.


33)I am getting a lot of duplicated event data. Why is this happening and what can I do to make this go away?


Answer)tail/multiTail have been reported to restart file reads from the beginning of files if the modification rate reaches a certain rate. This is a fundamental problem with a non-native implementation of tail. A work around is to use the OS's tail mechanism in an exec source (exec("tail -n +0 -F filename")). Alternately many people have modified their applications to push to a Flume agent with an open rpc port such as syslogTcp or thriftSource, avroSource. In E2E mode, agents will attempt to retransmit data if no acks are recieved after flume.agent.logdir.retransmit milliseconds have expried (this is a flume-site.xml property). Acks do not return until after the collector's roll time, flume.collector.roll.millis , expires (this can be set in the flume-site.xml file or as an argument to a collector) . Make sure that the retry time on the agents is at least 2x that of the roll time on the collector. If that was in E2E mode goes down, it will attempt to recover and resend data that did not receive acknowledgements on restart. This may result in some duplicates.


34)I have encountered a "Could not increment version counter" error message.


Answer)This is a zookeeper issue that seems related to virtual machines or machines that change IP address while running. This should only occur in a development environment – the work around here is to restart the master.


35)I have encountered an IllegalArgumentException related to checkArgument and EventImpl.


Answer)Here's an example stack trace: 2011-07-11 01:12:34,773 ERROR com.cloudera.flume.core.connector.DirectDriver: Driving src/sink failed! LazyOpenSource | LazyOpenDecorator because null java.lang.IllegalArgumentException at com.google.common.base.Preconditions.checkArgument(Preconditions.java:75) at com.cloudera.flume.core.EventImpl.<init>(EventImpl.java:97) at com.cloudera.flume.core.EventImpl.<init>(EventImpl.java:87) at com.cloudera.flume.core.EventImpl.<init>(EventImpl.java:71) at com.cloudera.flume.handlers.syslog.SyslogWireExtractor.buildEvent(SyslogWireExtractor.java:120) at com.cloudera.flume.handlers.syslog.SyslogWireExtractor.extract(SyslogWireExtractor.java:192) at com.cloudera.flume.handlers.syslog.SyslogWireExtractor.extractEvent(SyslogWireExtractor.java:89) at com.cloudera.flume.handlers.syslog.SyslogUdpSource.next(SyslogUdpSource.java:88) at com.cloudera.flume.handlers.debug.LazyOpenSource.next(LazyOpenSource.java:57) at com.cloudera.flume.core.connector.DirectDriver$PumperThread.run(DirectDriver.java:89) This indicates an attempt to create an event body that is larger than the maximum allowed body size (default 32k). You can increase the size of the max event by setting flume.event.max.size.bytes in your flume-site.xml file to a larger value. We are addressing this with issue FLUME-712.


36)I'm getting OutOfMemoryExceptions in my collectors or agents.


Answer)Add -XX:+HeapDumpOnOutOfMemoryError to the JVM_MEM_OPTS env variable or flume-env.sh file. This should dump heap upon thse kinds of errors and allow you to determine what objects are consuming excessive memory by using the jhat java heap viewer program. There have been instances of queues that are unbounded. Several of these have been fixed in v0.9.5. There are situations where queue sizes are too large for certain messages. For example, if batching is used, each event can takes up more memory. The default queue size in thrift sources is 1000 items. With batching individual events can become megabytes in size which may cause memory exhaustion. For example making batches of 1000 1000-byte messages with a queue of 1000 events could result in flume requiring 1GB of memory! In these cases, reduced the size of the thrift queue to bound potential the memory usage by setting flume.thrift.queuesize

<property>

<name>flume.thrift.queuesize</name>

<value>500</value>

</property>


Launch your GraphyLaunch your Graphy
100K+ creators trust Graphy to teach online
Learn Bigdata, Spark & Machine Learning | SmartDataCamp 2024 Privacy policy Terms of use Contact us Refund policy