By Ahmed Kamal in Hadoop — Apr 23, 2019

Your absolute guide to managing Hadoop Logging Configurations

Hadoop became an essential componenet of the infrastructure of any company nowadays. There are different distributions maintained and managed by different companies like Cloudera, Databricks and AWS. The distribution managed by AWS is named EMR. This distribution is supposdly fully managed by AWS (Not everything).

One of the things that has to be properly managed in order to get a scalable performance from Hadoop and Spark is the logs written by YARN and Spark. In this post, I will walk you through the different configurations that have to be twaeked in YARN and Spark in order to ensure getting clear visibility through logs without bottlnecking the scalability of your setup.

YARN Logs

By default, YARN writes all the applications logs on the local disk of each worker instance.

Health Service

These logs keeps adding up overtime till the disk gets full. By default, YARN will mark nodes as unhealthy once the disk space utilization pass 90% threshold. This config and others could be configured through yarn-site.xml file.

<property>
  <name>yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage</name>
  <value>98.5</value>
</property>

More params to check,

yarn.nodemanager.disk-health-checker.min-healthy-disks
yarn.nodemanager.disk-health-checker.min-free-space-per-disk-mb

Logs Aggregation

In order to clean these logs while keeping the disk storage within required limits, YARN does allow regular logs aggregation and storage on HDFS.

By default, it is enabled on EMR

  <property>
    <name>yarn.log-aggregation-enable</name>
    <value>true</value>
  </property>

Everytime, YARN does the aggregation, it cleans the local disk once the data is stored on HDFS. The cleaning is controlled by two main attributes that could also be tweaked.

The cleaning of the local disk is controlled by :

  <property>
    <name>yarn.nodemanager.delete.debug-delay-sec</name>
    <value>0</value>
  </property>

While the cleaning of the aggregated data on HDFS is controlled by :

  <property>
    <name>yarn.log-aggregation.retain-seconds</name>
    <value>172800</value>
  </property>

Since EMR AMI 4.3.0, YARN Log aggregation and EMR log aggregation can work together. EMR basically introduced a patch described in https://issues.apache.org/jira/browse/YARN-3929 exclusivly as part of EMR Hadoop.

After this patch, the local files are kept on the local file system to allow EMR based aggregation to collec the data.

  <property>
    <name>yarn.log-aggregation.enable-local-cleanup</name>
    <value>true</value>
  </property>

This property is not public and can only be set on EMR distributions. This option is by default set to False which means the cleanup on local machines will not take place.

Logpusher will need this logs on local machines to push to S3 and LogPusher is the one which is responsible for removing those local logs after they are copied over and after certain retention period (4 hours for containers ). This configuration could be very useful if you rely on ephemeral clusters as any data on HDFS would be lost if not moved to S3.

Accessing YARN Logs

YARN logs can be accessed in different ways depending on the configurations set. If automatic aggregation isn't enabled. You can gather the logs from the Hadoop UI

Otherwise, you can easily rely on YARN CLI after passing the application Id.

yarn logs -applicationId application_1432041223735_0001 > appID_1432041223735_0001.log

Spark Logs

In addition to the traditional yarn logs, Apache Spark does writeJob history files on filesystem. Cleaning these files regularly is also necessary to avoid filling up the space. Spark configs (spark-defaults) allows us to enable the history server to periodically clean up event logs from storage.

spark.hadoop.yarn.timeline-service.enabled true
spark.history.fs.cleaner.maxAge 1d
spark.history.fs.cleaner.interval 1d

YARN Cache

NodeManager maintains and manages serveral local-cache of all the files downloaded for any job. The resources are uniquely identified based on the remote-url originally used while copying that file. There are a lot of configs that control resource localization and caching in general but the important note here is that there is also a limit that can be set to auto trigger the cleaning of the cache before the disk fills up unncessarily.

yarn.nodemanager.localizer.cache.target-size-mb = 4GB.
yarn.nodemanager.localizer.cache.cleanup.interval-ms = 300000

YARN and Spark have a lot of interesting configurations that drasitcally change the performance of the execution engine in a surprising way. By investing some time to understand more the ins and outs of different pieces, you can easily save yourself a lot of headache and ensure a smooth and consistent scalable performance.