Quick User Guide

Disclaimer: This is a reference manual for RUPC users, describing the organization of the Beowulf cluster, the resources available and the Sun Grid Engine queuing system. We encourage users to spend a few minutes in reading this quick manual to fully profit from the cluster usage. Quick user's guide for previous system could he found here.

Cluster Access:

To get an access to the cluster please contact Viktor Oudovenko [Physics Rm #E256] providing the following information:

Your full name
Username [should be the same as Rutgers NetID]
Machine name or IP address to be used for logins
Room number and phone number
Anticipated period of work
Name of professor you work with

Users can login to the cluster via frontend nodes (rupc02-rupc09)

rupc04 is x86-64 Intel Xeon, used to compile and submit jobs to Physics Xeon's machines.
rupc05,rupc07 are x86-64 Opteron, used to compile and submit jobs to Opteron machines in CoRE#2 queue.
rupc02, rupc03, rupc06, rupc08 and rupc09 are x86-64 Intel Xeon, used to compile and submit jobs to Xeon's machines in CoRE queue.

Access to the cluster is allowed only from computers with granted access and all physsun machines.
Secure Shell (SSH) is required to access the cluster and for logins inside the cluster.

For password less browsing in the cluster one should execute this script line-by-line.

Secure CoPy (SCP) is required to transfer data to and from and inside the cluster.
You are not allowed to run any interactive jobs on the compute nodes.
All frontend nodes provide compilers, libraries, editors etc.
Number of open login sessions on frontend machine is limited to 7 per user.

Cluster Structure (Hardware):

RUPC consists of 4 computational subclusters: I) Physics, II) CoRE, and III) CoRE2.

Physics: AMD Opterons, Intel Xeons
CoRE: Intel Xeons
CoRE2: Intel Xeons and AMD Opterons

Main Servers

Main File Server [home directories]

NIS/license+Storage server [storage directories]
Work File servers [directories for large work data]
Backup Servers
SGE Servers [queueing system]
Web / Temperature Control Servers

Cluster Structure (Software):

Operating Systems
CentOS 7.0, 7.4 and 7.5
Modules
Module - command interface to the Modules package. The Modules package provides for the dynamic modification of the user's environment via modulefiles.

module available -- to see modules available on the system.
module load module_name -- to load a module.
module rm module_name -- to remove a module.
module show module_name -- to show information about a module.

Compilers
Compilers (C, C++, Fortran77, Fortran90), parallel environment (OpenMPI, MPICH3) and numerical libraries are across the cluster.

Intel 18.0
GNU
PGI

Libraries
The most common libraries used in the cluster are

Intel libraries

MKL /opt/intel/mkl/lib/intel64/ [x86-64]
FFTW{2,3} /opt/intel/mkl/lib/intel64/

Fast Fourier Libraries
- FFTW3 /opt/sw/ompi/intel/18.0/fftw-3.3.6-mpi/lib
HDF5 Libraries
- HDF5 /opt/sw/ompi/intel/18.0/hdf5/lib
ARPACK Libraries
- ARPACK /opt/sw/ompi/intel/18.0/hdf5/lib
GNU scientific library
- GSL

Jupyter Notebook
To get access Jupyter Notebook please follow setup instractions on this webpage.
Make Files Examples for Frequently Used Codes
- ABINIT
- ADRIX
- DMFT
- Espresso
- VASP
- Wannier90
- WIEN2K

Queuing System

In order to run serial and/or parallel jobs on the cluster, you must prepare a job control file and submit it from any of the submit nodes (rupc03-09). Job control files are nothing else as shell scripts, with additional information specifying the queue, the number of CPUs, etc. We strongly encourage you to use job script templates given below.

Currently activated queues (2019/09/24):

Physics: (submit host: rupc04)



CLUSTER QUEUE                   CQLOAD   USED  AVAIL  TOTAL aoACDS  cdsuE  
-------------------------------------------------------------------------------
all.q                             0.10      0      0    756      0    756 
i04m1                             0.00      0     92    136      0     44 
i04m4                             0.00      0     16    160      0    144 
i08m2                             0.00      0     12     20      0      8 
i08m3                             0.00      0     92    108      0     16 
i08m3_2                           0.31      0     72    128     32     24 
i08m3c                            0.00      0     88    120      0     32 
i12m4                             0.29      0     60     72      0     12 
o32m2                             0.01      0     32     32      0      0 

Physics cluster (phys) is underutilized by most members, although the machines
on physics cluster are very close in servers in CORE clusters. Please
take a moment and familiarize yourself with Physics cluster. Physics
cluster contains many queues, which have different amount of memory and
CPU power per node.
   

 Opteron architecture:

   o32m2   --  32 cores per node and 2GB of memory per core.  
   o16m2   --  16 cores per node and 2GB of memory per core.  
    
 Intel architecture:

    i12m4   --  12 cores per node and 4GB of memory per core. Total number of cores is 72.
    i08m3   --   8 cores per node and 8GB of memory per core. Total number of cores is 108.
    i08m3_2 --   8 cores per node and 8GB of memory per core. Total number of cores is 128.
    i08m3c  --   8 cores per node and 8GB of memory per core. Total number of cores is 120.
    i08m2   --   8 cores per node and 2GB of memory per core. Total number of cores is 20.
    i04m1   --   4 cores per node and 1GB of memory per core. Total number of cores is 136.
    i04m4   --   4 cores per node and 4GB of memory per core. Total number of cores is 160.
   
Note on name convention:
  Opteron base computer start from "o" while intel based computer queues start with "i".
 The first letter is followed by two numbers separated by letter m. The first two digits give the number of
cores per node and the last digit specifies amount of memory per core.

CoRE: (submit hosts: rupc02, rupc06, rupc08, rupc09 (exclusively for i16m6 an i16m24) )

CLUSTER QUEUE                   CQLOAD   USED    RES  AVAIL  TOTAL aoACDS  cdsuE  
--------------------------------------------------------------------------------
all.q                             0.35      0      0      0   4060      0   4060 
bnl                               0.00      0      0    792    792      0      0 
dmref                             0.01      0      0    112    112      0      0 
i12m4i                            0.50    144      0    132    300    108     24 
i12m4                             0.65    156      0     84    240      0      0 
i16m4                             0.13     48      0    208    256      0      0 
i16m6                             0.56    348      0    164    512      0      0 
i16m24                            0.00      0      0     16     16      0      0 
i24m5                             0.78    628      0     68    696      0      0 
i28m4                             0.00      0      0    112    112      0      0 
i36m5                             0.00      0      0    828    828      0      0 
jed                               0.96    241      0     11    252      0      0

CoRE #2: (submit hosts: rupc05, rupc07)

CLUSTER QUEUE                   CQLOAD   USED    RES  AVAIL  TOTAL aoACDS  cdsuE  
--------------------------------------------------------------------------------
all.q                             0.93      0      0      0   1408      0   1408 
wi28m4                            1.00    140      0      0    140    140      0 
wi36m5                            0.00      0      0      0     72      0     72 
wo32m4                            0.99    448      0      0    448    448      0 
wo48m4                            0.97    720      0      0    720    624      0 
Note on name convention:
Queues names start "w" (work) followed by "i" (Intel processors) or by "o" (Opteron CPUs).
"wi(o)" is followed by two numbers separated by letter m (memory). The first gives the number of cores per server and the second number shows the amount of memory per core.
Example: wo48m4 i.e. 48 cores inside a node with 4GB memory per core, Opteron CPU.

Job Control Script Templates:
- submit_ompi.sh submission script one can use to submit jobs to all queues.
- Queues with InfiniBand connection (CoRE: i36m5, bnl, jed,i12m4i) can use this submission script submit_ompi_ib.sh .
- For Serial jobs in all queues one can use the following script submit.sh .
Note: Guidelines for job control script:
1. substitute JOB_NAME with a meaningful job name
2. NUMBER_OF_CPUS (#$ -pe line) can be just a number (e.g. 8) or a range (e.g. 4-32).
3. also specify your e-mail address [email protected] to be notified when the job starts/ends.
4. and finally adjust ./YOUR_EXEC to whatever you have as executable.

Basic Commands to Manage Jobs
- To submit a job:
- To check the status of submitted job:
```
qstat
```

(where qw =queued, t = starting, r = running)

To delete a job from a queue:
```
qdel your_job_number 
```
To see activated queues and their load:
```
qstat -g c 
```
(alias qs)
To check whether if there is enough resources available to run your job:
```
qsub -w v job_script.sh
```

Note about jobs output:
One can access output from running jobs by login to the node where job is running. Usually it is in directory /src or /tmp. The generic name of your job directory is Job_numberUsername (i.e. 12345username).

Tips and Tricks

To send job to a particular host:
```
qsub -l hostname=n17 job_script.sh
```
To submit job for a period of time (example 1 hour):

qsub -l s_cpu=01:00:00 job_script.s

To submit job to nodes with 2Gb of memory:
```
qsub -l mem_total=2000M job_script.sh
```
To submit job at specific time:
```
qsub -a yyyymmddhhmmss job_script.sh
```
(example -a 20041230153234 == 2004 Dec 30 15:32.34)
There is also a nice and useful graphical utility (qmon) that allows you to submit new jobs. Qmon (Queue MONitor) also provides all possible information about the cluster.
For complete information about the SGE please use User's and Admins's guides as well as nice presentation and online resources.
To login to a particular compute node in Physics cluster use alias : dXX where XX is node number (only from rupc04). To access computer node in CoRE cluster you can use command: nXX (from all login servers except rupc04). Examples: d101 (physics) or n165 (CoRE) or n138 (CoRE2).

Backup:

Snapshots of home directory are taken every 5 hours. Hourly snapshots are available from any login node in directory /snapshot/hour. . Daily snapshots are available from any login node in directory /snapshot/day.
Hourly snapshots are complete images of your home directory taken at 7am, 12pm, 5pm and 10pm. Daily snapshots are usually taken at 3am every night.

Daily snapshots of work directory are available at any login server in directories: /mnt/swkXX/ where XX is number of your work directory folder. (for the XX number see /work directory: ls -l /work and your name).
Example: /mnt/wk19/user/ means XX=19 i.e. backup directory is /mnt/swk19.

Note:
Snapshot directories are read only i.e. you can not modify files there! One can only copy deleted files from them.

Storage Policy:

Home directory usage should not exceed 50 GB.
Work directory usage should not exceed 1.5 TB.
Storage directory usage should not exceed 500 GB [all data should be in tar and gzipped form].

Cluster Usage Policy:

Common rules:
There are 4 sub clusters, which can be used by most of users:
I) Physics (located at Physics Rm#284, common usage)
     1) can be accessed from rupc04
     2) can also be accessed from other rupc frontend nodes by typing "phys"

II) CORE (located at CoRE building)
     1) can be accessed from rupc02, rupc06, rupc08, rupc09
     2) can also be accessed from other rupc frontend nodes by typing "core"

III) CORE2 (located at CoRE building)
     1) can be accessed from rupc05, rupc07
     2) can also be accessed from other rupc frontend nodes by typing "core2"

Although most of users have very high limit on the number of cores which they can simultaneously use, it is expected that users will be moderate, and not use disproportional amount of computer time. If you need more computer time, you should get a special permission, which will be granted if resources are available.

There are three groups of users allowed to run jobs in the cluster. Users are divided depending on a group supervisor.

Profs. K. Rabe and D. Vanderbilt : Jobs can be run on Physics and CoRE2 sub clusters. Policy (very important, must read)
Profs. G. Kotliar and K. Haule : Jobs can be run on Physics and CoRE sub clusters. Policy (very important, must read)
Prof. J. Pixley : Jobs can be run on Physics and CoRE(queue "jed") sub clusters. Policy (very important, must read)

And finally:

All new cluster users will need to answer at least three questions about the cluster usage Policy before to be granted the cluster access.
Before a user leaves Rutgers (s)he should inform cluster administrator about the date of the departure. If it is not done the access to the cluster will be blocked immediately upon this information reaches the administrator in other ways.
After the date of the departure the user is given three months to archive data and move them from active work in to storage directory.
In six months the account will be closed unless your supervisor confirms that there are actively running projects.

Notes:

If a user exceeds the quota he/she gets three automatics reminders to reduce the disk usage. After 3 reminders the account will be suspended
Please tar and gzip all data in storage directory i.e. storage directory should contain only files with .tar.gz, .tgz or .tar extensions.
Script "matreshka.sh" can help you to gzip and tar your data.

Quick User Guide

Cluster Access:

Cluster Structure (Hardware):

Cluster Structure (Software):

Operating Systems CentOS 7.0, 7.4 and 7.5

Modules Module - command interface to the Modules package. The Modules package provides for the dynamic modification of the user's environment via modulefiles.

Compilers Compilers (C, C++, Fortran77, Fortran90), parallel environment (OpenMPI, MPICH3) and numerical libraries are across the cluster.

Libraries The most common libraries used in the cluster are

Jupyter Notebook

Make Files Examples for Frequently Used Codes

Queuing System

Physics: (submit host: rupc04)

CoRE: (submit hosts: rupc02, rupc06, rupc08, rupc09 (exclusively for i16m6 an i16m24) )

CoRE #2: (submit hosts: rupc05, rupc07)

Job Control Script Templates:

Basic Commands to Manage Jobs

Tips and Tricks

Backup:

Storage Policy:

Cluster Usage Policy:

Operating Systems
CentOS 7.0, 7.4 and 7.5

Modules
Module - command interface to the Modules package. The Modules package provides for the dynamic modification of the user's environment via modulefiles.

Compilers
Compilers (C, C++, Fortran77, Fortran90), parallel environment (OpenMPI, MPICH3) and numerical libraries are across the cluster.

Libraries
The most common libraries used in the cluster are