This second step is less obvious than the first and works around a subtlety in Hadoop’s data transfer architecture. Traffic between the DataNodes and the NameNode occurs over a custom RPC protocol; the port for this protocol is specified in the URI supplied
to thefs.default.name property. The NameNode also runs a Jetty web servlet engine on port 50070. This servlet engine generates status pages detailing the NameNode’s operation. It also communicates with the SecondaryNameNode. The SecondaryNameNode
actually performs an HTTP GET request to retrieve the current FSImage (checkpoint) and EditLog from the NameNode; it uses HTTP POST to upload the new checkpoint back to the NameNode. Theconf/hadoop-default.xml file setsdfs.http.address to0.0.0.0:50070
; the NameNodelistens on this host mask and port (by default, all inbound interfaces on port 50070), and the SecondaryNameNode attempts to use the same value as an address to connect to. It special-cases 0.0.0.0 as “localhost.” Running the
SecondaryNameNode on a different machine requires telling that machine where to reach the NameNode.
Usually this setting could be placed in the hadoop-site.xml file used by all daemons on all nodes. In an environment such as Amazon EC2, though, where a node is known by multiple addresses (one public IP and one private IP), it is preferable to have the
SecondaryNameNode connect to the NameNode over the private (unmetered bandwidth) IP address, while you connect to the public IP address for status pages. Specifyingdfs.http.address as anything other than 0.0.0.0 on the NameNode will cause it to bind
to only one address instead of all available ones.
In conclusion, larger deployments of HDFS will require a remote SecondaryNameNode, but doing so requires a subtle configuration tweak, to ensure that the SecondaryNameNode can communicate back to the remote NameNode.
This second step is less obvious than the first and works around a subtlety in Hadoop’s data transfer architecture. Traffic between the DataNodes and the NameNode occurs over a custom RPC protocol; the port for this protocol is specified in the URI supplied
to thefs.default.name property. The NameNode also runs a Jetty web servlet engine on port 50070. This servlet engine generates status pages detailing the NameNode’s operation. It also communicates with the SecondaryNameNode. The SecondaryNameNode
actually performs an HTTP GET request to retrieve the current FSImage (checkpoint) and EditLog from the NameNode; it uses HTTP POST to upload the new checkpoint back to the NameNode. Theconf/hadoop-default.xml file setsdfs.http.address to0.0.0.0:50070
; the NameNodelistens on this host mask and port (by default, all inbound interfaces on port 50070), and the SecondaryNameNode attempts to use the same value as an address to connect to. It special-cases 0.0.0.0 as “localhost.” Running the
SecondaryNameNode on a different machine requires telling that machine where to reach the NameNode.
Usually this setting could be placed in the hadoop-site.xml file used by all daemons on all nodes. In an environment such as Amazon EC2, though, where a node is known by multiple addresses (one public IP and one private IP), it is preferable to have the
SecondaryNameNode connect to the NameNode over the private (unmetered bandwidth) IP address, while you connect to the public IP address for status pages. Specifyingdfs.http.address as anything other than 0.0.0.0 on the NameNode will cause it to bind
to only one address instead of all available ones.
In conclusion, larger deployments of HDFS will require a remote SecondaryNameNode, but doing so requires a subtle configuration tweak, to ensure that the SecondaryNameNode can communicate back to the remote NameNode.