Like almost all Mondays, today was a very challenging one. The first thing I noticed was that our primary namenode had faced some issues over the weekend and went down. Which means secondary namenode, namenode-02, was active. I checked namenode-01 and made sure it is okay before making it active again. After that, I was made aware of when I arrived at office was that a very critical range of our ETL jobs has failed for over 12 hours.
Like everyone else would do when they get failed jobs, the first thing I did was to look into the logs for those jobs. All of them have failed with this error:
org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block
It is not hard to guess that hdfs is complaining about not being able to find some blocks of data it needs. So I navigated to Ambari’s HDFS page. But there were not any missing blocks being reported.
Therefore we can conclude that the data blocks are there, but for some reason namenode is not able to access them when jobs are submitted to cluster.
The second thing I noticed was that after primary namenode was made active, jobs started working fine and completing successfully. That hints there should have been something with namenode-02. So I navigated to our 2 namenode’s web UI:
There it is! I know we have 33 datanodes in our cluster, but the secondary namenode shows only 30. So what I did was to restart node manager on those datanodes that were not listed for namenode-02 and refreshed the page:
Now all the datanodes are recognized by both namenodes and everyone lives happily ever after!
Note that you may check namenodes’ web UI and don’t see any missing datanodes. But still, restarting node managers on all datanodes will resolve your issue.