Copying Data Between Two Clusters Using Distcp
The Distcp Command
The distributed copy command, distcp, is a general utility for copying large data sets between distributed filesystems within and across clusters. The distcp command submits a regular MapReduce job that performs a file-by-file copy.
$ hadoop distcp
- Do not run distcp as the hdfs user which is blacklisted for MapReduce jobs by default.
- Do not use Hadoop shell commands (such as cp, copyfromlocal, put, get) for large copying jobs or you may experience I/O bottlenecks.
Distcp Syntax and Examples
You can use distcp to copy files between compatible clusters in either direction, from or to the source or destination clusters.
For example, when upgrading, say from CDH 4 to CDH 5, you should run distcp from the CDH 5 cluster in this manner:
$ hadoop distcp hftp://cdh4-namenode:50070/ hdfs://CDH5-nameservice/ $ hadoop distcp s3a://bucket/ hdfs://CDH5-nameservice/
You can also use a specific path, such as /hbase to move HBase data, for example:
$ hadoop distcp hftp://cdh4-namenode:50070/hbase hdfs://CDH5-nameservice/hbase $ hadoop distcp s3a://bucket/file hdfs://CDH5-nameservice/bucket/file
HFTP Protocol
The HFTP protocol allows you to use FTP resources in an HTTP request. When copying with distcp across different versions of CDH, use hftp:// for the source filesystem and hdfs:// for the destination filesystem, and run distcp from the destination cluster. The default port for HFTP is 50070 and the default port for HDFS is 8020.
Example of a source URI: hftp://namenode-location:50070/basePath
- hftp:// is the source protocol.
- namenode-location is the CDH 4 (source) NameNode hostname as defined by its configured fs.default.name.
- 50070 is the NameNode's HTTP server port, as defined by the configured dfs.http.address.
Example of a destination URI: hdfs://nameservice-id/basePath or hdfs://namenode-location
- hdfs:// is the destination protocol
- nameservice-id or namenode-location is the CDH 5 (destination) NameNode hostname as defined by its configured fs.defaultFS.
- basePath in both examples refers to the directory you want to copy, if one is specifically needed.
- HFTP is a read-only protocol and can only be used for the source cluster, not the destination.
- HFTP cannot be used when copying with distcp from an insecure cluster to a secure cluster.
S3 Protocol
Amazon S3 block and native filesystems are also supported with the s3a:// protocol.
Example of an Amazon S3 Block Filesystem URI: s3a://bucket_name/path/to/file
<property> <name>fs.s3a.access.key</name> <value>...</value> </property> <property> <name>fs.s3a.secret.key</name> <value>...</value> </property>or run on the command line as follows:
hadoop distcp -Dfs.s3a.access.key=... -Dfs.s3a.secret.key=... s3a://
Enabling Fallback Configuration
<property> <name>ipc.client.fallback-to-simple-auth-allowed</name> <value>true</value> </property>
Protocol Support for Distcp
The following table lists the protocols supported with the distcp command on different versions of CDH. "Secure" means that the cluster is configured to use Kerberos.
Source | Destination | Where to Issue distcp Command | Source Protocol | Source Config | Destination Protocol | Destination Config | Fallback Config Required |
---|---|---|---|---|---|---|---|
CDH 4 | CDH 4 | Destination | hftp | Secure | hdfs or webhdfs | Secure | |
CDH 4 | CDH 4 | Source or Destination | hdfs or webhdfs | Secure | hdfs or webhdfs | Secure | |
CDH 4 | CDH 4 | Source or Destination | hdfs or webhdfs | Insecure | hdfs or webhdfs | Insecure | |
CDH 4 | CDH 4 | Destination | hftp | Insecure | hdfs or webhdfs | Insecure | |
CDH 4 | CDH 5 | Destination | webhdfs or hftp | Secure | webhdfs or hdfs | Secure | |
CDH 4 | CDH 5.1.3+ | Destination | webhdfs | Insecure | webhdfs | Secure | Yes |
CDH 4 | CDH 5 | Destination | webhdfs or hftp | Insecure | webhdfs or hdfs | Insecure | |
CDH 4 | CDH 5 | Source | hdfs or webhdfs | Insecure | webhdfs | Insecure | |
CDH 5 | CDH 4 | Source or Destination | webhdfs | Secure | webhdfs | Secure | |
CDH 5 | CDH 4 | Source | hdfs | Secure | webhdfs | Secure | |
CDH 5.1.3+ | CDH 4 | Source | hdfs or webhdfs | Secure | webhdfs | Insecure | Yes |
CDH 5 | CDH 4 | Source or Destination | webhdfs | Insecure | webhdfs | Insecure | |
CDH 5 | CDH 4 | Destination | webhdfs | Insecure | hdfs | Insecure | |
CDH 5 | CDH 4 | Source | hdfs | Insecure | webhdfs | Insecure | |
CDH 5 | CDH 4 | Destination | hftp | Insecure | hdfs or webhdfs | Insecure | |
CDH 5 | CDH 5 | Source or Destination | hdfs or webhdfs | Secure | hdfs or webhdfs | Secure | |
CDH 5 | CDH 5 | Destination | hftp | Secure | hdfs or webhdfs | Secure | |
CDH 5.1.3+ | CDH 5 | Source | hdfs or webhdfs | Secure | hdfs or webhdfs | Insecure | Yes |
CDH 5 | CDH 5.1.3+ | Destination | hdfs or webhdfs | Insecure | hdfs or webhdfs | Secure | Yes |
CDH 5 | CDH 5 | Source or Destination | hdfs or webhdfs | Insecure | hdfs or webhdfs | Insecure | |
CDH 5 | CDH 5 | Destination | hftp | Insecure | hdfs or webhdfs | Insecure |
Distcp between Secure Clusters in Distinct Kerberos Realms
This section explains how to copy data between two secure clusters in distinct Kerberos realms.
Specify the Destination Parameters in krb5.conf
[realms] HADOOP.QA.domain.COM = { kdc = kdc.domain.com:88 admin_server = admin.test.com:749 default_domain = domain.com supported_enctypes = arcfour-hmac:normal des-cbc-crc:normal des-cbc-md5:normal des:normal des:v4 des:norealm des:onlyrealm des:afs3 } [domain_realm] .domain.com = HADOOP.test.domain.COM domain.com = HADOOP.test.domain.COM test03.domain.com = HADOOP.QA.domain.COM
Configure HDFS RPC Protection and Acceptable Kerberos Principal Patterns
- Open the Cloudera Manager Admin Console.
- Go to the HDFS service.
- Click the Configuration tab.
- Select
- Select .
- Locate the Hadoop RPC Protection property and select authentication.
- Click Save Changes to commit the changes.
The following steps are not required if the two realms are already set up to trust each other, or have the same principal pattern. However, this isn't usually the case.
- Open the Cloudera Manager Admin Console.
- Go to the HDFS service.
- Click the Configuration tab.
- Select
- Select .
- Edit the HDFS Client Advanced Configuration Snippet (Safety Valve) for hdfs-site.xml property to add:
<property> <name>dfs.namenode.kerberos.principal.pattern</name> <value>*</value> </property>
- Click Save Changes to commit the changes.
(If TLS/SSL is enabled) Specify Truststore Properties
<property> <name>ssl.client.truststore.location</name> <value>path_to_truststore</value> </property> <property> <name>ssl.client.truststore.password</name> <value>XXXXXX</value> </property> <property> <name>ssl.client.truststore.type</name> <value>jks</value> </property>
Set HADOOP_CONF to the Destination Cluster
Set the HADOOP_CONF path to be the destination environment. If you are not using HFTP, set the HADOOP_CONF path to the source environment instead.
Launch Distcp
hadoop distcp hdfs://test01.domain.com:8020/user/alice hdfs://test02.domain.com:8020/user/alice
[libdefaults] udp_preference_limit = 1