Apache HDFS
Hadoops uses the file system Apache HDFS.

This job step supports copying a directory or file(s).
The above image depicts Copy Between HDFS
Job Step Properties
Copy Type is a dropdown which denotes what is being copied. Several copy types are supported: CopyBetweenHDFS, CopyToHDFS, CopyFromHDFS, CopyWithinHDFS.
CopyBetweenHDFS
Source File/directory paths is a file or directory specification including datanode cluster URL and Port (for example, hdfs://myhdfs.company.com:8020/user/foo/bar).
Destination Directory is a file or directory specification include datanode cluster URL and Port. Note: The destination directory must not already exist.
Server Connection – This optional set of properties represents the connection properties to use to connect to the host via SSH. This is not required if the Hadoop CLI is installed on the Execution Agent Machine.
Preserve – This Boolean property if enabled allows the user to select which attributes from the original file should be preserved on the new file after the copy.
Ignore Failures is an optional Boolean property that when enabled means that a failing map should not cause a map-reduce job to fail.
Log Directory is an optional property that represent a directory used for logs of each file map-reduce. Note the logs are not retained if the job is restarted.
Number of Maps is an optional property that allows you to specify the number of maps to copy data. Note increasing the number does not necessarily mean that performance will improve. The value -1 is used to request the default.
Update Type. Two (2) selections are available: Update (default) and Overwrite. Update means to copy files from source that don’t exist at the Target or have different contents. Overwrite means to overwrite target files even if they exist or have the same contents.
Delete. If enabled, delete files that exist in the Destination directory but not in the source.
Copy Strategy. Select strategy. Either Uniformsize or Dynamic.
Bandwidth. Specify bandwidth in Mb/sec. A value of -1 means to use the default bandwidth.
Atomic. If enabled, an atomic commit is used.
SSL Configuration File. An SSL file to be used with HSFTP source.
Advanced Properties. A set of advanced properties for distcp.
Output Messages is a Boolean property that causes messages generated by distcp execution to be written to the job’s log file.
CopyToHDFS
Source File/directory paths is a file specification (including wildcards) or directory path.
Destination Directory is a file or directory specification include datanode cluster URL and Port. Note: The destination directory must not already exist.
Before. When specified, a date used to select only those files prior to the specified date. The date is formatted based on: For Windows, the locale of the Execution Agent machine; For Linux/UNIX use the YYYYMMDD format.
DataType. Used when Before or Since have been specified. Indicates which type of file date is used. Selections are: CreateDate, LastAccessDate, LastWriteDate.
Since. When specified, a date used to select only those files after the specified date. The date is formatted based on: For Windows, the locale of the Execution Agent machine; For Linux/UNIX use the YYYYMMDD format.
Permissions. Optional. Used to specify file permissions. Two methods are available: By Class and Custom. By Class allows you to specify individual permissions for each role (Owner, Group, Other). Custom allows you to specify a numeric value representing the permission.
Overwrite. This optional Boolean property indicates whether existing files are to be overwritten. By default, files are not overwritten (false).
CopyFromHDFS
Source File/directory paths is a file or directory specification including datanode cluster URL and Port (for example, hdfs://myhdfs.company.com:8020/user/foo/bar).
Destination Directory is a file specification (including wildcards) or directory path.
Overwrite. This optional Boolean property indicates whether existing files are to be overwritten. By default, files are not overwritten (false).
CopyWithinHDFS
Source File/directory paths is a file or directory specification including datanode cluster URL and Port (for example, hdfs://myhdfs.company.com:8020/user/foo/bar).
Destination Directory is a file or directory specification include datanode cluster URL and Port. Note: The destination directory must not already exist.
Server Connection – This optional set of properties represents the connection properties to use to connect to the host via SSH. This is not required if the Hadoop CLI is installed on the Execution Agent Machine.
Overwrite. This optional Boolean property indicates whether existing files are to be overwritten. By default, files are not overwritten (false).
Name Node URL. The URL of the HDFS Name Node to run the command against.
Authentication – This set of properties is used when executing on the HDFS Name Node. These credentials will be used to authenticate with Kerberos if necessary.
The above image depicts Copy From HDFS
The above image depicts Copy To HDFS
The above image depicts Copy Within HDFS

This job step supports creating a directory.
Job Step Properties
Path is the full path of the directory to create.
Name Node URL. Optional. The URL of the HDFS Name Node to run the command against.
Authentication – Optional. This set of properties is used when executing on the HDFS Name Node. These credentials will be used to authenticate with Kerberos if necessary.
Permissions. Optional. Used to specify file permissions. Two methods are available: By Class and Custom. By Class allows you to specify individual permissions for each role (Owner, Group, Other). Custom allows you to specify a numeric value representing the permission.

This job step supports deleting a directory or file(s).
Job Step Properties
Path is the full path of the file or directory to delete. Wildcard characters are supported.
Force. This Boolean property indicates whether non-empty directories should be deleted (meaning that all files will be deleted within the directory followed by the directory).
Name Node URL. Optional. The URL of the HDFS Name Node to run the command against.
Authentication – Optional. This set of properties is used when executing on the HDFS Name Node. These credentials will be used to authenticate with Kerberos if necessary.

This job step retrieves content summary for a given path.
Job Step Properties
Directory Path is the full path of the directory to provide content summary statistics.
Name Node URL. Optional. The URL of the HDFS Name Node to run the command against.
Authentication – Optional. This set of properties is used when executing on the HDFS Name Node. These credentials will be used to authenticate with Kerberos if necessary.
Return Step Values
DirectoryCount – The number of directories within the path
FileCount – The number of files within the path
Length – The number of bytes used by the content
Quota – The name quota of this directory. (The name quota is a hard limit on the number of file and directory names in the tree rooted at that directory - https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoophdfs/HdfsQuotaAdminGuide.html)
SpaceConsumed – The disk space consumed by the content of the directory
SpaceQuota – The space quota of this directory (The space quota is a hard limit on the number of bytes used by files in the tree rooted at that directory - https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoophdfs/HdfsQuotaAdminGuide.html)

This job step calculates a returns a checksum on the specified file.
Job Step Properties
File Path is the full path of the file to perform a file checksum.
Name Node URL. Optional. The URL of the HDFS Name Node to run the command against.
Authentication – Optional. This set of properties is used when executing on the HDFS Name Node. These credentials will be used to authenticate with Kerberos if necessary.
Return Step Value
Algorithm – The name of the checksum algorithm
Bytes – The byte sequence of the checksum in hexadecimal
Length – The length of the bytes (not the length of the string)

This job step returns the current user’s Home Directory.
Job Step Properties
Name Node URL. Optional. The URL of the HDFS Name Node to run the command against.
Authentication – Optional. This set of properties is used when executing on the HDFS Name Node. These credentials will be used to authenticate with Kerberos if necessary.
Return Step Value
Return Value is the Home Directory.

This job step retrieves file/directory related information.
Job Step Properties
Path is the path of the file or directory. Wildcards are not allowed.
Name Node URL. Optional. The URL of the HDFS Name Node to run the command against.
Authentication – Optional. This set of properties is used when executing on the HDFS Name Node. These credentials will be used to authenticate with Kerberos if necessary.
Return Step Value
Numerous file attributes are returned.

This job step returns files and sub-directories within a specified directory path.
Job Step Properties
Path is the path of the directory. Wildcards are not allowed.
Name Node URL. Optional. The URL of the HDFS Name Node to run the command against.
Authentication – Optional. This set of properties is used when executing on the HDFS Name Node. These credentials will be used to authenticate with Kerberos if necessary.
Return Step Value
FileStatus. A collection of retrieved file-directory objects. The example above illustrates one usage. Using the ForEachItem job step you can iterate though the returned objects.

This job step performs a login/validation sequence for subsequent HDFS steps.
Job Step Properties
Name Node URL. The URL of the HDFS Name Node.
Authentication – This set of credentials is used to authenticate on the HDFS Name Node.

This job step moves (or renames) a file or directory.
Job Step Properties
Source File or Directory – Source file or directory for the Move operation.
Destination – Destination or Target of the Move operation as appropriate.
Name Node URL. The URL of the HDFS Name Node.
Authentication – This set of credentials is used to authenticate on the HDFS Name Node.

This job step set the security permissions for a file or directory.
Job Step Properties
Source File or Directory – Source file or directory for the Set Security operation.
Name Node URL. Optional. The URL of the HDFS Name Node to run the command against.
Authentication – Optional. This set of properties is used when executing on the HDFS Name Node. These credentials will be used to authenticate with Kerberos if necessary.
Permissions. Optional. Used to specify file permissions. Two methods are available: By Class and Custom. By Class allows you to specify individual permissions for each role (Owner, Group, Other). Custom allows you to specify a numeric value representing the permission.