Apache Accumulo Interview Questions and Answers

Apache Accumulo is a highly scalable structured store based on Google’s BigTable. Accumulo is written in Java and operates over the Hadoop Distributed File System (HDFS), which is part of the popular Apache Hadoop project. Accumulo supports efficient storage and retrieval of structured data, including queries for ranges, and provides support for using Accumulo tables as input and output for MapReduce jobs. Accumulo features automatic load-balancing and partitioning, data compression and fine-grained security labels.

1) How to remove instance of accumulo? We have created a instance while initializing accumulo by calling accumulo init But now i want to remove that instance and as well i want to create a new instance. Can any one help to do this?

Answer)Remove the directory specified by the instance.dfs.dir property in $ACCUMULO_HOME/conf/accumulo-site.xml from HDFS.

If you did not specify an instance.dfs.dir in accumulo-site.xml, the default is "/accumulo".

You should then be able to call accumulo init with success.

2) How are the tablets mapped to a Datanode or HDFS block? Obviously, One tablet is split into multiple HDFS blocks (8 in this case) so would they be stored on the same or different datanode(s) or does it not matter?

Answer) Tablets are stored in blocks like all other files in HDFS. You will typically see all blocks for a single file on at least one data node (this isn't always the case, but seems to mostly hold true when i've looked at block locations for larger files)

3) In the example above, would all data about RowC (or A or B) go onto the same HDFS block or different HDFS blocks?

Answer) Depends on the block size for your tablets (dfs.block.size or if configured the Accumulo property table.file.blocksize). If the block size is the same size as the tablet size, then obviously they will be in the same HDFS block. Otherwise if the block size is smaller than the tablet size, then it's pot luck as to whether they are in the same block or not.

4) When executing a map reduce job how many mappers would I get? (one per hdfs block? or per tablet? or per server?)

Answer) This depends on the ranges you give InputFormatBase.setRanges(Configuration, Collection<Ranges>).

If you scan the entire table (-inf -> +inf), then you'll get a number of mappers equal to the number of tablets (caveated by disableAutoAdjustRanges). If you define specific ranges, you'll get a different behavior depending on whether you've called InputFormatBase.disableAutoAdjustRanges(Configuration) or not:

If you have called this method then you'll get one mapper per range defined. Importantly, if you have a range that starts in one tablet and ends in another, you'll get one mapper to process that entire range

If you don't call this method, and you have a range that spans over tablets, then you'll get one mapper for each tablet the range covers

5) How do I create a Spark RDD from Accumulo?

Answer) Generally with custom Hadoop InputFormats, the information is specified using a JobConf. As @Sietse pointed out there are some static methods on the AccumuloInputFormat that you can use to configure the JobConf. In this case I think what you would want to do is:

val jobConf = new JobConf() // Create a job conf

// Configure the job conf with our accumulo properties

AccumuloInputFormat.setConnectorInfo(jobConf, principal, token)

AccumuloInputFormat.setScanAuthorizations(jobConf, authorizations)

val clientConfig = new ClientConfiguration().withInstance(instanceName).withZkHosts(zooKeepers)

AccumuloInputFormat.setZooKeeperInstance(jobConf, clientConfig)

AccumuloInputFormat.setInputTableName(jobConf, tableName)

// Create an RDD using the jobConf

val rdd2 = sc.newAPIHadoopRDD(jobConf,

classOf[org.apache.accumulo.core.client.mapreduce.AccumuloInputFormat],

classOf[org.apache.accumulo.core.data.Key],

classOf[org.apache.accumulo.core.data.Value]

)

6) How to filter Scan on Accumulo using RegEx?

Answer)The Filter class lays the framework for the functionality you want. To create a custom filter, you need to extend Filter and implement the accept(Key k, Value v) method. If you are only looking to filter based on regular expressions, you can avoid writing your own filter by using RegExFilter.

Using a RegExFilter is straightforward. Here is an example:

//first connect to Accumulo

ZooKeeperInstance inst = new ZooKeeperInstance(instanceName, zooServers);

Connector connect = inst.getConnector(user, password);

//initialize a scanner

Scanner scan = connect.createScanner(myTableName, myAuthorizations);

//to use a filter, which is an iterator, you must create an IteratorSetting

//specifying which iterator class you are using

IteratorSetting iter = new IteratorSetting(15, "myFilter", RegExFilter.class);

//next set the regular expressions to match. Here, I want all key/value pairs in

//which the column family begins with "J"

String rowRegex = null;

String colfRegex = "J.*";

String colqRegex = null;

String valueRegex = null;

boolean orFields = false;

RegExFilter.setRegexs(iter, rowRegex, colfRegex, colqRegex, valueRegex, orFields);

//now add the iterator to the scanner, and you're all set

scan.addScanIterator(iter);

The first two parameters of the iteratorSetting constructor (priority and name) are not relevant in this case. Once you've added the above code, iterating through the scanner will only return key/value pairs that match the regex parameters.

7) Connecting to Accumulo inside a Mapper using Kerberos

Answer) The provided AccumuloInputFormat and AccumuloOutputFormat have a method to set the token in the job configuration with the Accumulo*putFormat.setConnectorInfo(job, principle, token). You can also serialize the token in a file in HDFS, using the AuthenticationTokenSerializer and use the version of the setConnectorInfo method which accepts a file name.

If a KerberosToken is passed in, the job will create a DelegationToken to use, and if a DelegationToken is passed in, it will just use that.

The provided AccumuloInputFormat should handle its own scanner, so normally, you shouldn't have to do that in your Mapper if you've set the configuration properly. However, if you're doing secondary scanning (for something like a join) inside your Mapper, you can inspect the provided AccumuloInputFormat's RecordReader source code for an example of how to retrieve the configuration and construct a Scanner.

8)How to get count for database query in Accumulo

Answer)Accumulo is a lower-level application than a traditional RDBMS. It is based on Google's Big Table and not like a relational database. It's more accurately described as a massive parallel sorted map than a database.

It is designed to do different kinds of tasks than a relational database, and its focus is on big data.

To achieve the equivalent of the MongoDB feature you mentioned in Accumulo (to get a count of the size of an arbitrary query's result set), you can write a server-side Iterator which returns counts from each server, which can be summed on the client side to get a total. If you can anticipate your queries, you can also create an index which keeps track of counts during the ingest of your data.

Creating custom Iterators is an advanced activity. Typically, there are important trade-offs (time/space/consistency/convenience) to implementing something as seemingly simple as a count of a result set, so proceed with caution. I would recommend consulting the user mailing list for information and advice.

9)How do you use “Range” to Scan an entire table in accumulo

Answer)This is the same thing that the previous answer is saying, but I thought it might help to show a line of code.

If you have a scanner, cleverly named 'scanner', you can use the setRange() method to set the range on the scanner. Because the default range is (-inf, +inf), passing setRange a newly created range object will give your scanner, with a range of (-inf, +inf), the ability to scan the entire table.

The sample code looks like:

scanner.setRange(new Range());

10)What CAP-Type does Apache Accumulo have?

Answer) Apache Accumulo is based on the Google BigTable paper, and shares a lot of similarities with Apache HBase. All three of these systems are intended to be CP, where nodes will simply go down rather than serve inconsistent data.

11)How do I set an environment variable in a YARN Spark job?

Answer)So I discovered the answer to this while writing the question (sorry, reputation seekers). The problem is that CDH5 uses Spark 1.0.0, and that I was running the job via YARN. Apparently, YARN mode does not pay any attention to the executor environment and instead uses the environment variable SPARK_YARN_USER_ENV to control its environment. So ensuring SPARK_YARN_USER_ENV contains ACCUMULO_CONF_DIR=/etc/accumulo/conf works, and makes ACCUMULO_CONF_DIR visible in the environment at the indicated point in the question's source example.

12) Does Accumulo actually need all Zookeeper servers listed?

Answer) ZooKeeper servers operate as a coordinated group, where the group as a whole determines the value of a field at any given time, based on consensus among the servers. If you have a 5-node ZooKeeper instance running, all 5 server names are relevant. You should not simply treat them as 5 redundant 1-node instances. Accumulo, and other ZooKeeper clients, actually use all of the servers listed.

Apache Accumulo Interview Questions and Answers

You may also be interested in