This lesson is in the early stages of development (Alpha version)

Transferring files with remote computers

Overview

Teaching: 10 min
Exercises: 5 min
Questions
  • How do I transfer files to (and from) the cluster?

Objectives
  • Transfer files to and from a computing cluster.

Performing work on a remote computer is not very useful if we cannot get files to or from the cluster. There are several options for transferring data between computing resources using CLI and GUI utilities, a few of which we will cover.

Download Lesson Files From the Internet

One of the most straightforward ways to download files is to use either curl or wget. One of these is usually installed in most Linux shells, on Mac OS terminal and in GitBash. Any file that can be downloaded in your web browser through a direct link can be downloaded using curl or wget. This is a quick way to download datasets or source code. The syntax for these commands is

Try it out by downloading some material we’ll use later on (a wordlist from John Lawler at University of Michigan) from the following url:

https://websites.umich.edu/~jlawler/wordlist

Download the wordlist

By default, curl and wget download files to the same name as the URL: in this case, wordlist. Use one of the above commands to save the file as wordlist.txt.

wget and curl Commands

[yourUsername@borah-login ~]$ wget -0 wordlist.txt https://websites.umich.edu/~jlawler/wordlist
# or
[yourUsername@borah-login ~]$ curl -o wordlist.txt -L https://websites.umich.edu/~jlawler/wordlist

The -L option to curl tells it to follow URL redirects (which wget does by default).

After downloading the file, use ls to see it in your working directory:

[you@laptop:~]$ ls

Using the OnDemand File Browser

The ondemand “Files” tab provides a graphical interface to all your files on Borah.

/hpc-intro/Ondemand%20dashboard%20highlighting%20the%20Files%20app

Here you can edit, upload, download, etc.

Transferring Single Files and Folders With scp

To copy a single file to or from the cluster from our local computer, we can use scp (“secure copy”). The syntax can be a little complex for new users, but we’ll break it down. The scp command is a relative of the ssh command we used to access the system, and can use the same public-key authentication mechanism.

To upload to another computer, the template command is

[you@laptop:~]$ scp local_file yourUsername@borah-login.boisestate.edu:remote_destination

in which @ and : are field separators and remote_destination is a path relative to your remote home directory, or a new filename if you wish to change it, or both a relative path and a new filename. If you don’t have a specific folder in mind you can omit the remote_destination and the file will be copied to your home directory on the remote computer (with its original name). If you include a remote_destination, note that scp interprets this the same way cp does when making local copies: if it exists and is a folder, the file is copied inside the folder; if it exists and is a file, the file is overwritten with the contents of local_file; if it does not exist, it is assumed to be a destination filename for local_file.

Upload a file to your remote home directory like so:

[you@laptop:~]$ scp myfile yourUsername@borah-login.boisestate.edu:

Transferring a Directory

To transfer an entire directory, we add the -r flag for “recursive”: copy the item specified, and every item below it, and every item below those… until it reaches the bottom of the directory tree rooted at the folder name you provided.

[you@laptop:~]$ scp -r amdahl yourUsername@borah-login.boisestate.edu:

Caution

For a large directory – either in size or number of files – copying with -r can take a long time to complete.

When using scp, you may have noticed that a : always follows the remote computer name. A string after the : specifies the remote directory you wish to transfer the file or folder to, including a new name if you wish to rename the remote material. If you leave this field blank, scp defaults to your home directory and the name of the local material to be transferred.

On Linux computers, / is the separator in file or directory paths. A path starting with a / is called absolute, since there can be nothing above the root /. A path that does not start with / is called relative, since it is not anchored to the root.

If you want to upload a file to a location inside your home directory – which is often the case – then you don’t need a leading /. After the :, you can type the destination path relative to your home directory. If your home directory is the destination, you can leave the destination field blank, or type ~ – the shorthand for your home directory – for completeness.

With scp, a trailing slash on the target directory is optional, and has no effect.

Working with Windows

When you transfer text files from a Windows system to a Unix system (Mac, Linux, BSD, Solaris, etc.) this can cause problems. Windows encodes its files slightly different than Unix, and adds an extra character to every line.

On a Unix system, every line in a file ends with a \n (newline). On Windows, every line in a file ends with a \r\n (carriage return + newline). This causes problems sometimes.

Though most modern programming languages and software handles this correctly, in some rare instances, you may run into an issue. The solution is to convert a file from Windows to Unix encoding with the dos2unix command.

You can identify if a file has Windows line endings with cat -A filename. A file with Windows line endings will have ^M$ at the end of every line. A file with Unix line endings will have $ at the end of a line.

To convert the file, just run dos2unix filename. (Conversely, to convert back to Windows format, you can run unix2dos filename.)

Archiving Files

One of the biggest challenges we often face when transferring data between remote HPC systems is that of large numbers of files. There is an overhead to transferring each individual file and when we are transferring large numbers of files these overheads combine to slow down our transfers to a large degree.

The solution to this problem is to archive multiple files into smaller numbers of larger files before we transfer the data to improve our transfer efficiency. Sometimes we will combine archiving with compression to reduce the amount of data we have to transfer and so speed up the transfer. The most common archiving command you will use on a (Linux) HPC cluster is tar.

tar can be used to combine files and folders into a single archive file and, optionally, compress the result. Let’s look at the file we downloaded from the lesson site, amdahl.tar.gz.

The .gz part stands for gzip, which is a compression library.

To view the contents of a tarfile, without unpacking the file, we can use the -t flag. tar prints the “table of contents” with the -t flag, for the file specified with the -f flag followed by the filename. Note that you can concatenate the two flags: writing -t -f is interchangeable with writing -tf together. However, the argument following -f must be a filename, so writing -ft will not work.

First download the example tarfile:

[yourUsername@borah-login ~]$ wget -O amdahl.tar.gz https://github.com/hpc-carpentry/amdahl/tarball/main
# or
[yourUsername@borah-login ~]$ curl -o amdahl.tar.gz https://github.com/hpc-carpentry/amdahl/tarball/main

Then list the contents:

[yourUsername@borah-login ~]$ tar -tf amdahl.tar.gz
hpc-carpentry-amdahl-46c9b4b/
hpc-carpentry-amdahl-46c9b4b/.github/
hpc-carpentry-amdahl-46c9b4b/.github/workflows/
hpc-carpentry-amdahl-46c9b4b/.github/workflows/python-publish.yml
hpc-carpentry-amdahl-46c9b4b/.gitignore
hpc-carpentry-amdahl-46c9b4b/LICENSE
hpc-carpentry-amdahl-46c9b4b/README.md
hpc-carpentry-amdahl-46c9b4b/amdahl/
hpc-carpentry-amdahl-46c9b4b/amdahl/__init__.py
hpc-carpentry-amdahl-46c9b4b/amdahl/__main__.py
hpc-carpentry-amdahl-46c9b4b/amdahl/amdahl.py
hpc-carpentry-amdahl-46c9b4b/requirements.txt
hpc-carpentry-amdahl-46c9b4b/setup.py

This example output shows a folder which contains a few files, where 46c9b4b is an 8-character git commit hash that will change when the source material is updated.

Now let’s unpack the archive. We’ll run tar with a few common flags:

Extract the Archive

Using the flags above, unpack the source code tarball into a new directory named “amdahl” using tar.

[yourUsername@borah-login ~]$ tar -xvzf amdahl.tar.gz
hpc-carpentry-amdahl-46c9b4b/
hpc-carpentry-amdahl-46c9b4b/.github/
hpc-carpentry-amdahl-46c9b4b/.github/workflows/
hpc-carpentry-amdahl-46c9b4b/.github/workflows/python-publish.yml
hpc-carpentry-amdahl-46c9b4b/.gitignore
hpc-carpentry-amdahl-46c9b4b/LICENSE
hpc-carpentry-amdahl-46c9b4b/README.md
hpc-carpentry-amdahl-46c9b4b/amdahl/
hpc-carpentry-amdahl-46c9b4b/amdahl/__init__.py
hpc-carpentry-amdahl-46c9b4b/amdahl/__main__.py
hpc-carpentry-amdahl-46c9b4b/amdahl/amdahl.py
hpc-carpentry-amdahl-46c9b4b/requirements.txt
hpc-carpentry-amdahl-46c9b4b/setup.py

Note that we did not need to type out -x -v -z -f, thanks to flag concatenation, though the command works identically either way – so long as the concatenated list ends with f, because the next string must specify the name of the file to extract.

The folder has an unfortunate name, so let’s change that to something more convenient.

[yourUsername@borah-login ~]$ mv hpc-carpentry-amdahl-46c9b4b amdahl

Check the size of the extracted directory and compare to the compressed file size, using du for “disk usage”.

[you@laptop:~]$ du -sh amdahl.tar.gz
8.0K     amdahl.tar.gz
[you@laptop:~]$ du -sh amdahl
48K    amdahl

Text files (including Python source code) compress nicely: the “tarball” is one-sixth the total size of the raw data!

If you want to reverse the process – compressing raw data instead of extracting it – set a c flag instead of x, set the archive filename, then provide a directory to compress:

[you@laptop:~]$ tar -cvzf compressed_code.tar.gz amdahl
amdahl/
amdahl/.github/
amdahl/.github/workflows/
amdahl/.github/workflows/python-publish.yml
amdahl/.gitignore
amdahl/LICENSE
amdahl/README.md
amdahl/amdahl/
amdahl/amdahl/__init__.py
amdahl/amdahl/__main__.py
amdahl/amdahl/amdahl.py
amdahl/requirements.txt
amdahl/setup.py

If you give amdahl.tar.gz as the filename in the above command, tar will update the existing tarball with any changes you made to the files. That would mean adding the new amdahl folder to the existing folder (hpc-carpentry-amdahl-46c9b4b) inside the tarball, doubling the size of the archive!

Key Points

  • wget and curl -O download a file from the internet.

  • scp to transfer files to and from your computer.

  • You can use the file browser in OnDemand to view and transfer files.