Sample Code & Cluster User's Guide

A User's Guide to Linux Clusters

1. Introduction

This guide is little more than a set of notes that I have accumulated over the years as I was handed Linux clusters, and told to make applications work on them. As a result, it is unashamedly practical and utilitarian. It simply contains some disjointed factoids that I found to be useful, and I made little effort to explore why some things worked and others didn't. There are plenty of voluminous manuals and user guides that provide that extra background for the proverbial "interested reader".

Nevertheless, my first and most insistent piece of advice is to read, or at least browse through, whatever manuals and user guides are available for the products and tools you are using. Most of them nowadays are well-enough written to allow you find what you need relatively quickly without having to plough through too much turgid verbiage.

If you can find any information in here that is useful to you, or that spares you some grief as you try to make your applications work, I'll consider this a job well done.

Let's say that your cluster hardware is all assembled and wired up, and that each node has a full operating system installed. What do you do next? Well, roughly in order, you may need to do some or all of the following:

2. Configure the cluster to make it navigable
3. Install cluster-management tools
4. Create a User Account
5. Enable User to navigate the cluster
6. Install any interconnect-specific drivers
7. Install a compiler
8. Install MPI
9. Build or Install user applications
10. Run applications over the cluster

2. Configure the cluster to make it navigable

On a cluster, it is essential that a user be able to either "ssh" or "rsh" from node to node without being prompted for a password. This is not just to avoid the hassle of entering a password every few minutes, or because a cluster is an abstract "single system", but because MPI needs to launch client processes using either ssh or rsh, and they simply fail if a password is required. Commercial parallel applications also uses either rsh or ssh for launching remote (client) processes. However, for the most part, the option of using either "rsh" or "ssh" is usually provided by these applications, so we will assume that being able to "ssh" from node to node without being prompted for a password will suffice.

The following instructions apply equally well to a normal user as well as to root. See here for more details.

Log on to a node on your cluster and run "ssh-keygen -t rsa", and just hit "return" to accept the defaults for each question asked. (Choose a different encryption scheme to "rsa"- e.g., "tsa" - if you like). When prompted for a passphrase, simply hit "return" again (i.e., select an empty passphrase) in order to "ssh" around the cluster without being prompted for passwords. This is a key point! Once "ssh-keygen" has completed, the files "id_rsa" (private key) and "id_rsa.pub" (public key) will have been generated in the directory "~/.ssh". Run "ssh-keygen -t rsa" on every node that has a unique home directory (i.e., nodes that have not imported the home directory as a shared file system).
Collect all the unique "id_rsa.pub" files generated (or whatever "id_*.pub" files were generated) in step 1, and concatenate them all into a single big "authorized_keys" file. Copy this "authorized_keys" file to the "~/.ssh" directory on each node which has a unique home directory.
The final step involves collecting the public host keys from each node in the cluster (these keys are stored as a single line in "/etc/ssh/ssh_host_rsa_key.pub", one on each node). There are two ways of doing this:
The first way is to record the line of data from "/etc/ssh/ssh_host_rsa_key.pub" on each node and prefix it with a string comprising the node name comma-separated from the IP address. If there are multiple interfaces on the node, record a separate entry for each interface using the corresponding node name and IP address of the interface. Concatenate all these data into a single big "known_hosts" file. If this step is carried out by root, then the resulting "known_hosts" can be copied to "/etc/ssh/ssh_known_hosts" on each node in the cluster, making the information available to every user by default. Alternatively, the "known_hosts" file should be copied to "~/.ssh/known_hosts" on each node that has a unique home directory, thereby making the information available only to the current user.

The second way is to "ssh" directly into each node on the cluster. The first time you do this you will be warned that the node is not a "known host", and asked if you want to proceed. Answer "yes", at which point you'll be informed that the node is being permanently added to the "~/.ssh/known_hosts" file. More precisely, what is actually added to the "~/.ssh/known_hosts" file is the public key of that node, presixed with the string comprising the node name comma-separated from the IP adddress. Again, if multiple interfaces exist on each node, you may want to repeat this procedure for each interface, and "ssh" into the node using the node name associated with the interface. As before, if all this is done by root, the resulting "known_hosts" file can be copied to "/etc/ssh/ssh_known_hosts" on each node in the cluster, making the information available to every user by default. Alternatively, the "known_hosts" file should be copied to "~/.ssh/known_hosts" on each node in the cluster that has a unique home directory (thereby making the information only available to the current user).

A different set of incantations must be recited if you want to open the cluster in order to "rsh" around it without being prompted for a password. Details on how to do this are available here. Nowadays, however,use of rsh tends to be frowned upon, since rsh is completely insecure. Moreover, if "ssh" works properly, there should be no need to use "rsh". The MPICH message-passing library still uses "rsh" as its default "remote shell", but this can be changed during the MPICH configuration process by setting the environment variable RSHCOMMAND to ssh, and/or by using the "-rsh-ssh" option to the configure command (see section 8 below).

3. Install cluster-management tools

When configuring a cluster, ideally you should work only on the head node, then use some cluster management tool to repeat the same work on every other node. There are several public-domain tool-sets available to do this, including OSCAR (Open Source Cluster Application Resources), C3 ("Cluster Command and Control"), and NPACI Rocks. Even if you don't want a full set, it's handy even to have a tool like PDSH, or "parallel distributed shell", so your shell commands can apply across the cluster. Each one of these has its own documentation, and there's no need for me to elaborate here.

4. Create a User Account

Here's how root can create a user account with username "fred", give "fred" a login-password, and give him ownership of his home directory:


mkdir /home/fred
useradd –s /bin/csh –d /home/fred fred
passwd fred
chown –R fred /home/fred

See the man pages for useradd, usermod, and userdel.

5. Enable User to Navigate the Cluster

As described above for root, in order to ssh around the cluster without a password, user fred himself needs to:


ssh-keygen –t rsa
(hit "return" for no passphrase….)
cd /home/dalco/.ssh
cat id_rsa.pub >> authorized_keys

(or in general, add all the *.pub files from all nodes to authorized_keys, then copy this single authorized_keys file back to all nodes, unless /home is exported to all nodes as a shared filesystem).

Each user should create their own ~/.rhosts file, and add the hostname of each node in the cluster to this, one node per line.

Ensure that the ~/.rhosts file, along with shell-specific run-command files like ~/.cshrc, are visible to each node (i.e., you will need to copy them if there's no shared filesystem).

6. Install any Interconnect-specific Drivers

If you are using ethernet as your interconnect, no special driver needs to be installed, but other interconnects like Myrinet or Infiniband need their own drivers to be installed. This is really again a job for root, and needs to be done on each node. See each product's installation or user guide for how to do this.

Some potentially useful factoids are:

To install an rpm file:


# rpm --prefix=<installation-directory> -Uvh *.rpm

Given a ".iso" file, here's how to get at it's contents:


# mkdir -p /mnt/iso
# mount -r -o loop filename.iso /mnt/iso
# cd /mnt/iso
# ls
...
# umount /mnt/iso

To mount a CD-ROM drive:


# dmesg | grep CD
hda: TOSHIBA DVD-ROM SD-M1912, 
   ATAPI CD/DVD-ROM drive Uniform CD-ROM driver Revision: 3.20

# mount -r /dev/hda /mnt/
# df
....

Then "umount /mnt" when done.

7. Install a compiler

This assumes that gcc and g77 (usually installed along with the operating system) are not enough to satisfy your parallel application needs. If they are enough, you can move right along to the MPI installation.

At this point, we're starting to move away from root-space to user-space. Most real-world clusters probably have a compiler installed by root for all to use, but there's no reason why individual users can't have their own compilers installed in their own space just for themselves. Just follow your particular compiler's installation guide. Good luck with the licensing - usually the messiest part!

Once the compiler is installed on the head node, it is usually enough to copy the installation directory then to all other nodes - a formal install should not be needed. Nodes other than the head node will probably just need to find the compiler's run-time libraries anyway.

8. Install MPI

Installing an MPI kit is also usually a job for root, but can equally be done by individual users in their own space. Sometimes MPI is built by a user (who has a compiler license...) and then installed by root (with directory write permissions).

If your MPI kit of choice is MPICH, or closely related to it, you will get a standard gnu source-code package that is built and installed by the 3 commands "./configure", "make", and "make install", in that order. The key command is "configure". Before starting, type "./configure --help" to see the full range of options. In my experience, installing MPICH is trickier than installing either interconnect drivers or compilers, partly because MPICH needs to use both of these, but might not be able to find them.

Here's what I need to remind myself to do:

Define the CC (C-compiler) and FC (Fortran compiler) environment variables (along with maybe CFLAGS) before running configure.
Define the RSHCOMMAND environment variable to be "ssh" (in case "rsh" is the default remote shell, and unless you really want to use rsh....)
Make sure that the compiler (and attendant license information) you want is on your "path" or in your environment for the configure script to pick up.
Remember to include "--enable-sharedlib" and "-rsh=ssh" as options to configure.
If on a 32-bit system, pick up 32-bit libraries ("lib32"); if on a 64-bit, or x86-64 system, pick up 64-bit libraries (lib64).
To clean up a messed-up build and start again, do a "make distclean" first.
If there are multiple compilers on your system, you may need to build a different MPI installation for each one. This is to ensure that comand-line arguments and the Fortran interface will be handled properly. (Different compilers append different numbers of trailing underscores to Fortran routine names, and this can cause confusion at link-time if you try too much mixing and matching...)

9. Build or Install User Applications

To install a commercial or fully developed open-source application, follow the documentation that comes with it. I'll just mention a few points here that in my experience are the most likely to cause you hassle:

When compiling and especially when linking an MPI application, use the mpicc or mpif90 wrappers in the MPI "bin" directory (which should be on your PATH) rather than invoke the compiler directly. This saves you the trouble of explicitly spelling out where to find, say, the mpif.h "include" file, and especially saves you the trouble of explicitly listing all the libraries needed at link-time, some of which may not be obvious at all!
If the compiler reports an error when compiling a "standard" application, it is likely that this can be fixed by use of an appropriate compiler option (e.g., -r8) rather than by having to modify the source. Not being able to find mpif.h (and the variables therein) is a common problem when compiling directly - easily fixed by adding "-I${MPI_ROOT}/include" to the compiler options, where MPI_ROOT is the path to your MPI installation.
If mixing C and Fortran source code, some functions may be reported as "unresolved" or "not found" at link-time, because the C and Fortran compilers use different conventions to append trailing underscores to the function name. Say that mpi_init_ is reported as "not found"; check how mpi_init is stored in the MPI libraries by going to your MPI "lib" directory, and running, e.g.,
```
nm lib* | grep -i mpi_init | grep T
					
```
and see how mpi_init looks in there. (The grep for "T" is to find where mpi_init exists as a "text" file, not just where it's called from other routines).

Sometimes this trailing-underscore mismatch can be avoided by using C macros to rename functions to have an extra trailing underscore, and sometimes by using the appropriate fortran option (e.g., -nounderscore; see the fortran man page for exact details). If the "unresolved" functions are MPI routines, then in the worst-case scenario you may need to re-build your MPI installation with different "underscore" options (set, e.g., in CFLAGS).

It's usually possible to see exactly which libraries the mpicc and mpif90 wrappers are invoking by using something like a -v ("verbose") option to the wrappers themselves.

10. Running Applications over the Cluster

If all previous steps here were completed properly, running MPI jobs across the cluster should be straightforward. In the case of LAM, make sure the daemon is running before submitting jobs; in the case of Myrinet, make sure either the GM or MX "mapper" is running (e.g., "ps auxww | grep mapper"). When starting jobs, give mpirun not just a "-np $NPROC" option, but also a "-machinefile <filename>" option (or equivalent), where <filename> contains a simple list of node-names on which you want to run the job.

Processes are usually allocated in a "round-robin" fashion by proceeding down the list in the "machinefile", or down the list in share/machines.LINUX under your MPI installation. So, for a "block" process allocation (processes 1-2 on node 1, processes 3-4 on node 2, etc.) just list each node twice on consecutive lines, as in:


node1
node1
node2
node2

While for a "cyclic" process allocation, just list each node once, or in the order in which you want processes allocated.

My most common experience of job failure on a cluster is a "connection refused" message or a "hang", related to the fact that I did not configure ssh correctly so MPI processes can communicate across the cluster without needing passwords!