Quick User's Guide

Disclaimer: This is a reference manual for RUPC users, describing the organization of the Beowulf cluster, the resources available and the Sun Grid Engine queuing system. We encourage users to spend half an hour in reading this quick manual to fully profit from the cluster usage.

Cluster Access:

To get an access to the cluster please contact Viktor Oudovenko [Physics Rm #E256] providing the following information:

Your full name
Username [should be the same as on physsun machines]
Machine name or IP address to be used for logins
Room number and phone number
Name of professor you work with

Users can login to the cluster via frontend nodes (rupc03-rupc09)

rupc04 is x86-64 Intel Xeon, used to compile and submit jobs to Physics Xeon's machines.
rupc05,rupc07 are x86-64 Opteron, used to compile and submit jobs to Opteron machines in CoRE#2 queue.
rupc03, rupc06, rupc08 are x86-64 Intel Xeon, used to compile and submit jobs to Xeon's machines in CoRE queue.

Access to the cluster is allowed only from computers with granted access and all physsun machines.
Secure Shell (SSH) is required to access the cluster and for logins inside the cluster.

For password less browsing in the cluster one should execute this script line-by-line.

Secure CoPy (SCP) is required to transfer data to and from and inside the cluster.
You are not allowed to run any interactive jobs on the compute nodes.
All frontend nodes provide compilers, libraries, editors and so on.
Number of open login sessions on a frontend machine is limited to 7 per user.

Cluster Structure (Hardware):

RUPC consists of 4 computational subclusters: I) Physics, II) CoRE, III) CoRE2 and IV) IBM .

Physics: AMD Opterons, Intel Xeons
CoRE: Intel Xeons
CoRE2: AMD Opterons
IBM: IBM P575

Main Servers

Main File Server [home directories]

NIS/license+Storage server [storage directories]
Work File servers [directories for large work data]
Backup Servers
SGE Servers [queueing system]
Web / Temperature Control Servers

Cluster Structure (Software):

Operating System
SuSE 13.1 and CentOS 7.0 (wp02 in CoRE cluster)
Compilers
Compilers (C, C++, Fortran77, Fortran90), parallel environment (MPICH2) and numerical libraries are accross the cluster.

Intel 14.0
The compiler commands are: ifort, icc, icpc
The compiler installation path is: /opt/intel/Compiler/14.0/
The MPICH2 wrappers: mpif77, mpif90, mpicc, mpiCC
The MPICH2 wrapper installation path is: /opt/mpich2/intel/14.0/

Libraries
The most common libraries used in the cluster are

Intel libraries

MKL /opt/intel/mkl/lib/intel64/ [x86-64]
FFTW{2,3} /opt/intel/mkl/lib/intel64/

Fast Fourier Libraries

FFTW3

/opt/fftw/lib64/

GNU scientific library

rpm OS package

Linking Tips

Linking examples to BLAS and LAPACK libraries using MKL:
```
#
LAPACK_LIB = -mkl 
LIB = $(LAPACK_LIB)	
```

Make Files Examples for Frequently Used Codes

Queuing System

Physics: (submit host: rupc04)

CLUSTER QUEUE                   CQLOAD   USED  AVAIL  TOTAL aoACDS  cdsuE  
-------------------------------------------------------------------------------
all.q                             0.39      0      0    538      0    538 
wo02m2                            0.00      0     32     34      0      2 
wo04m1                            0.32      8     12     24      0      4 
wo08m2                            0.44     21     27     48      8      0 
wo08m4                            0.65     22     18     48     16      8 
wo16m2                            0.75     12      4     16      0      0 
wo32m2                            1.00     32      0     32     32      0 
wp04m1                            0.13     16    126    148      0      6 
wp08m2                            0.36     12      8     20      0      0 
wp08m3                            0.52    100      8    108     16      0 
wp12m4                            0.54     30     42     72      0      0

Physics cluster (phys) is underutilized by most members, although the machines on physics cluster are not much slower than in CORE clusters. Please take a moment and familiarize yourself with Physics cluster. Physics cluster contains many queues, which have various amounts of memory and various numer of cores per node.

Opteron architecture:

wo32m2 -- 32 cores per node and 2GB of memory per core. Total number of cores is 32.

wo16m2 -- 16 cores per node and 2GB of memory per core. Total number of cores is 16.

wo08m4 -- 8 cores per node and 4GB of memory per core. Total number of cores is 32.

wo08m2 -- 8 cores per node and 2GB of memory per core. Total number of cores is 48.

wo04m1 -- 8 cores per node and 1GB of memory per core. Total number of cores is 24.

wo02m2 -- 2 cores per node and 2GB of memory per core. Total number of cores is 34.

Intel architecture:

wp12m4 -- 12 cores per node and 4GB of memory per core. Total number of cores is 72.

wp08m3 -- 8 cores per node and 8GB of memory per core. Total number of cores is 108.

wp08m2 -- 8 cores per node and 2GB of memory per core. Total number of cores is 20.

wp04m1 -- 4 cores per node and 1GB of memory per core. Total number of cores is 148.

Note on name convention:

Opteron base computer start from "wo" while intel based computer queues start with "wp".

It is followed by two numbers separated by letter m. The first gives the number of

cores per node and the second the amount of memory per core.

Restrictions: wp08m1 and wp12m4 queues are devoted to USPEX project. The rest do not have restrictions.

CoRE: (submit hosts: rupc06, rupc08, rupc09 (exclusively for wp02) )

CLUSTER QUEUE                   CQLOAD   USED    RES  AVAIL  TOTAL aoACDS  cdsuE  
--------------------------------------------------------------------------------
all.q                             0.65      0      0      0   2016      0   2016 
wp02                              0.92    512      0      0    512     32      0 
wp02_hp                           0.92      0      0    512    512      0      0 
wp02m                             0.98     16      0      0     16      0      0 
wp04                              0.22    108      0    140    280      8     32 
wp06                              0.22     14      0     18     32      0      0 
wp08                              1.00    336      0      0    336      0      0 
wp08_hp                           1.00      0      0    336    336      0      0 
wp10                              0.64    192      0    108    300    156      0 
wp12                              0.34    204      0     36    240      0      0 
wp14                              1.00     96      0      0     96      0      0 
wp15                              0.00      0      0    160    160      0      0

*_hp means "High Priority". _hp queues share the same hardware with corresponding non _hp queues.

CoRE #2: (submit hosts: rupc05, rupc07)

CLUSTER QUEUE                   CQLOAD   USED    RES  AVAIL  TOTAL aoACDS  cdsuE  
--------------------------------------------------------------------------------
all.q                             0.73      0      0      0   1138      0   1138 
wp32m2                            0.00      0      0     96     96      0      0 
wp32m4                            0.45    144      0    176    320     64      0 
wp48m4                            0.95    680      0     40    720    576      0

Note on name convention:

Queues names start "w" (work) followed by "p" (processors).

"wp" is followed by two numbers separated by letter m (memory). The first gives the number of

cores per server and the second number showes the amount of memory per core.

Example: wp48m4 i.e. 48 cores inside a node with 4GB memory core.

Job Control Script Templates:
- All Physics queues, all CoRE queues and all CoRE#2 queues with Gigabit connection can use this script mpich2_gb.sh for job submission or this one mpich2_gb_s.sh where the main command line is simplified.
- Queues with InfiniBand connection (CoRE: wp02) can use this submission script mvapich2_ib.sh .
- For Serial jobs in all queues one can use the following script serial.sh .

Job Submisions Requirements

Add the follofing lines in to .bashrc file on compute nodes.

export SMPD_OPTION_NO_DYNAMIC_HOSTS=1
export OMP_NUM_THREADS=1
export LD_LIBRARY_PATH=/opt/intel/mkl/lib/intel64/:/opt/intel/lib/intel64/

Created file .smpd on computer node with the following content:
```
phrase=pass	
```
and apply the following permissions:
```
chmod 600 .smpd 
```

Basic Commands to Manage Jobs

To submit a job:
```
qsub  your_job_script.sh 
```
To check the status of submitted job:
```
qstat
```
(where qw =queued, t = starting, r = running)
To delete a job from a queue:
```
qdel your_job_number 
```
To see activated queues and their load:
```
qstat -g c 
```
(alias qs)
To check whether if there is enough resources available to run your job:
```
qsub -w v job_script.sh
```

Note about jobs output:
One can access output from running jobs by login to the node where job is running. Usually it is in directory /src or /tmp. The generic name of your job directory is Job_numberUsername (i.e. 12345username).

Tips and Tricks

To send job to a particular host:

qsub -l hostname=sub04n17 job_script.shi

To submit job for a period of time (example 1 hour):
```
qsub -l s_cpu=01:00:00 job_script.sh
```
To submit job to nodes with 2Gb of memory:
```
qsub -l mem_total=2000M job_script.sh
```
To submit job at specific time:
```
qsub -a yyyymmddhhmmss job_script.sh
```
(example -a 20041230153234 == 2004 Dec 30 15:32.34)
There is also a nice and useful graphical utility (qmon) that allows you to submit new jobs. Qmon (Queue MONitor) also provides all possible information about the cluster.
For complete information about the SGE please use User's and Admins's guides as well as nice presentation and online stuff.
To login to a particular compute node in Physics cluster use alias : dXX where XX is node number (only from rupc04). To access computer node in CoRE cluster you can use command: nXX (from all login servers except rupc04). Examples: d101 (physics) or n165 (CoRE) or n138 (CoRE2) or n202 (IBM).

Backup:

Snapshots of home directory are taken every 5 hours. Hourly snapshots are available from any login node in directory /snapshot/hour. Daily snapshots are available from any login node in directory /snapshot/day.
Hourly snapshots are complete images of your home directory taken at 7am, 12pm, 5pm and 10pm. Daily snapshorts are usually taken at 3am every night.

Daily snapshots of work directory are evailable at any login server in directories: /mnt/swkXX/ where XX is number of your work directory folder. (for the XX number see /work directory: ls -l /work and your name).
Example: /mnt/wk19/haule/ means XX=19 i.e. backup directory is /mnt/swk19.

Note:
Snapshot directories are read only i.e. you can not modify files there! One can only copy deleted files from them.

Storage Policy:

Home directory usage should not exceed 50 GB.
Work directory usage should not exceed 500 GB.
Storage directory usage should not exceed 100 GB [all data should be in tar and gzipped form].

Cluster Usage Policy:

Common rules:
There are 4 sub clusters, which can be used by most of users:
I) Physics (located at Physics Rm#284, common usage)
     1) can be accessed from rupc04
     2) can also be accessed from othe rupc frontend nodes by typing "phys"

II) CORE (located at CoRE building)
     1) can be accessed from rupc06
     2) can also be accessed from othe rupc frontend nodes by typing "core"

III) CORE2 (located at CoRE building)
     1) can be accessed from rupc06
     2) can also be accessed from othe rupc frontend nodes by typing "core2"

IV) IBM (located at CoRE)
     1) can be accessed from rupc06
     2) can also be accessed from othe rupc frontend nodes by typing "ibm"

Although most of users have very high limit on the number of cores which they can simultaneously use, it is expected that users will be moderate, and not use disproportional amount of computer time. If you need more computer time, you should get a special permission, which will be granted if resources are available.

There are two groups of users allowed to run jobs in the cluster. Users are devided depending on a group supervisor.

Profs. K. Rabe and D. Vanderbilt : Jobs can be run on Physics and CoRE2 sub clusters. Policy (very important, must read)
Profs. G. Kotliar and K. Haule : Jobs can be run on Physics, CoRE and IBM sub clusters. Policy (very important, must read)

And finally:

All new cluster users will need to answer at least three questions about the cluster usage Policy before to be granted the cluster access.
Before a user leaves Rutgers (s)he should inform cluster administrator about the date of the departure. If it is not done the access to the cluster will be blocked immediately upon this information reaches the administrator in other ways.
After the date of the departure the user is given three months to archive data and move them from active work in to storage directory.
In six months the account will be closed unless your supervisor confirms that still there are actively running projects.

Notes:

If a user exceeds the quota he/she gets three automatics reminders to reduce the disk usage. After 3 reminders the account will be suspended
Please tar and gzip all data in storage directory i.e. storage directory should contain only files with .tar.gz, .tgz or .tar extensions.
Script "matreshka.sh" can help you to gzip and tar your data.

Quick User's Guide

Cluster Access:

Cluster Structure (Hardware):

Cluster Structure (Software):

Operating System SuSE 13.1 and CentOS 7.0 (wp02 in CoRE cluster)

Compilers Compilers (C, C++, Fortran77, Fortran90), parallel environment (MPICH2) and numerical libraries are accross the cluster.

Libraries The most common libraries used in the cluster are

Linking Tips

Make Files Examples for Frequently Used Codes

Queuing System

Physics: (submit host: rupc04)

CoRE: (submit hosts: rupc06, rupc08, rupc09 (exclusively for wp02) )

CoRE #2: (submit hosts: rupc05, rupc07)

Job Control Script Templates:

Job Submisions Requirements

Basic Commands to Manage Jobs

Tips and Tricks

Backup:

Storage Policy:

Cluster Usage Policy:

Operating System
SuSE 13.1 and CentOS 7.0 (wp02 in CoRE cluster)

Compilers
Compilers (C, C++, Fortran77, Fortran90), parallel environment (MPICH2) and numerical libraries are accross the cluster.

Libraries
The most common libraries used in the cluster are