Hadoop failover (and hopefully failback)

1 minute read

We’ve decided to test using linux HeartBeat together with hadoop , to enavle failover (and failback) capacbilities.

Infrastructure: Take 3 Servers : A is the NN , B is the SNN and will be later used as NN , and a datanode (On C). The hadoop-site.xml file in borh A and B use THE SAME LOCATION as their SNN.

3 servers:
A - Hadoop NameNode - fs.checkpoint.dir: is configured to be on server B under fs.checkpoint.dir
B - Hadoop SNN - fs.checkpoint.dir: is configured to be local , under fs.checkpoint.dir
C - Hadoop DN

Scenarion 1: Failover

  1. Run the regular (above) configuration.
  2. Insert some files
  3. Kill the NN on A.
  4. Stop the DN on C (this is only required becuase we don’t use the Heartbeat yet).
  5. (create a /usr/apps/hadoop/name dir on B and updtae the hadoop-site files on B and C)
  6. Start the NN on B with the flag: haddop namenode -importCheckpoint
  7. Start DN on C.
  8. Check if all the relevent files exist.

Status: Works</p>

</span>Scenarion 2: Failback
continue from the previous scenario: (NN and SNN on B , DN on C , Nothing on A)

  1. Insert some files
  2. Kill the NN on B.
  3. Stop the DN on C (this is only required becuase we don't use the Heartbeat yet).
  4. (updtae the hadoop-site files on B and C)
  5. Start the NN on A with the flag: haddop namenode -importCheckpoint
  6. Start DN on C.
  7. Check if all the relevent files exist.

Status: Fail (not back , just fail) .
The NN on A doesn't failback .It claims it already has a valid image file in it's local location - but then detects errors and shuts down.