Just a short blog about HDFS directory file and memory limits.
I’ve recently faced an issue with Hive with the following error message
The directory item limit of /tmp/hive/hive is exceeded: limit=1048576 items=1048576
The parameter which caused the error above is called
||Defines the maximum number of items that a directory may contain.
The parameter controls how much in a directory on HDFS may contain.
To check the current setting of the parameter above you could run
hdfs getconf -confKey dfs.namenode.fs-limits.max-directory-items
The parameters default value is 1048576 which seems to be quite enough. Though imagine you run thousand of jobs a day and each job writes s single logfile to a specific directory, then the limit will soon be reached.
A quick fix is deleting the not needed files in the directory causing the error (e.g. /tmp/hive/hive).
I’ve also discovered a script on github for cleaning up /tmp directory on HDFS.
Please test carefully before running in production environments 😉
To prevent this issue you could set the parameter to “0” which will disable check. However the parameter needs to be specified in hdfs-site.xml which will lead to an restart of the whole HDFS stack.
Nevertheless do not forget that maximum number of files in HDFS depends on the amount of memory available for the NameNode.
As a rule of thumb we the NameNode allocate 1000 MB per million blocks stored in HDFS.
So with a block size of 128MB , a million blocks would be
128 MB * 1.000.000 blocks = 128.000.000 MB = 128 TB
So with the 1000MB allocated to the NameNode you could manage a cluster with 128TB of raw disk space.
Keep in mind the 1000MB are just by the NameNode to hold the block metadata in memory. The node itself requires additional memory for other services running and the OS itself
For reference some links with examples how to calculate NameNodes Memory: