Quick User Guide
Disclaimer: This is a reference manual for RUPC users, describing the organization of the Beowulf cluster, the resources available and the Sun Grid Engine queuing system. We encourage users to spend a few minutes in reading this quick manual to fully profit from the cluster usage. Quick user's guide for previous system could he found here.Cluster Access:
- To get an access to the cluster please contact Viktor Oudovenko [Physics Rm #E256] providing the following information:
- Your full name
- Username [should be the same as Rutgers NetID]
- Machine name or IP address to be used for logins
- Room number and phone number
- Anticipated period of work
- Name of professor you work with
- Users can login to the cluster via frontend nodes (rupc02-rupc09)
- rupc04 is x86-64 Intel Xeon, used to compile and submit jobs to Physics Xeon's machines.
- rupc05,rupc07 are x86-64 Opteron, used to compile and submit jobs to Opteron machines in CoRE#2 queue.
- rupc02, rupc03, rupc06, rupc08 and rupc09 are x86-64 Intel Xeon, used to compile and submit jobs to Xeon's machines in CoRE queue.
- Access to the cluster is allowed only from computers with granted access and all physsun machines.
Secure Shell (SSH) is required to access the cluster and for logins inside the cluster. - For password less browsing in the cluster one should execute this script line-by-line.
- Secure CoPy (SCP) is required to transfer data to and from and inside the cluster.
- You are not allowed to run any interactive jobs on the compute nodes.
- All frontend nodes provide compilers, libraries, editors etc.
- Number of open login sessions on frontend machine is limited to 7 per user.
Cluster Structure (Hardware):
- RUPC consists of 4 computational subclusters: I) Physics, II) CoRE, and III) CoRE2.
- Physics: AMD Opterons, Intel Xeons
- CoRE: Intel Xeons
- CoRE2: Intel Xeons and AMD Opterons
- Main Servers
- Main File Server [home directories]
- NIS/license+Storage server [storage directories]
- Work File servers [directories for large work data]
- Backup Servers
- SGE Servers [queueing system]
- Web / Temperature Control Servers
Cluster Structure (Software):
-
Operating Systems
CentOS 7.0, 7.4 and 7.5 -
Modules
Module - command interface to the Modules package. The Modules package provides for the dynamic modification of the user's environment via modulefiles. - module available -- to see modules available on the system.
- module load module_name -- to load a module.
- module rm module_name -- to remove a module.
- module show module_name -- to show information about a module.
-
Compilers
Compilers (C, C++, Fortran77, Fortran90), parallel environment (OpenMPI, MPICH3) and numerical libraries are across the cluster. - Intel 18.0
- GNU
- PGI
-
Libraries
The most common libraries used in the cluster are - Intel libraries
- MKL /opt/intel/mkl/lib/intel64/ [x86-64]
- FFTW{2,3} /opt/intel/mkl/lib/intel64/
-
Fast Fourier Libraries
- FFTW3 /opt/sw/ompi/intel/18.0/fftw-3.3.6-mpi/lib
- FFTW3 /opt/sw/ompi/intel/18.0/fftw-3.3.6-mpi/lib
- HDF5 Libraries
- HDF5 /opt/sw/ompi/intel/18.0/hdf5/lib
- HDF5 /opt/sw/ompi/intel/18.0/hdf5/lib
- ARPACK Libraries
- ARPACK /opt/sw/ompi/intel/18.0/hdf5/lib
- GNU scientific library
- GSL
Loading module intel/ompi one gets automatically preloaded FFTW, HDF5 and ARPACK libraries.
-
Jupyter Notebook
To get access Jupyter Notebook please follow setup instractions on this webpage. -
Make Files Examples for Frequently Used Codes
Queuing System
In order to run serial and/or parallel jobs on the cluster, you must
prepare a job control file and submit it from any of the submit nodes
(rupc03-09). Job control files are nothing else as shell scripts,
with additional information specifying the queue, the number of CPUs,
etc. We strongly encourage you to use job script templates given below.
Currently activated queues (2019/09/24):
-
Physics: (submit host: rupc04)
CLUSTER QUEUE CQLOAD USED AVAIL TOTAL aoACDS cdsuE
-------------------------------------------------------------------------------
all.q 0.10 0 0 756 0 756
i04m1 0.00 0 92 136 0 44
i04m4 0.00 0 16 160 0 144
i08m2 0.00 0 12 20 0 8
i08m3 0.00 0 92 108 0 16
i08m3_2 0.31 0 72 128 32 24
i08m3c 0.00 0 88 120 0 32
i12m4 0.29 0 60 72 0 12
o32m2 0.01 0 32 32 0 0 Physics cluster (phys) is underutilized by most members, although the machines on physics cluster are very close in servers in CORE clusters. Please take a moment and familiarize yourself with Physics cluster. Physics cluster contains many queues, which have different amount of memory and CPU power per node.
Opteron architecture:
o32m2 -- 32 cores per node and 2GB of memory per core. o16m2 -- 16 cores per node and 2GB of memory per core. Intel architecture:
i12m4 -- 12 cores per node and 4GB of memory per core. Total number of cores is 72. i08m3 -- 8 cores per node and 8GB of memory per core. Total number of cores is 108. i08m3_2 -- 8 cores per node and 8GB of memory per core. Total number of cores is 128. i08m3c -- 8 cores per node and 8GB of memory per core. Total number of cores is 120. i08m2 -- 8 cores per node and 2GB of memory per core. Total number of cores is 20. i04m1 -- 4 cores per node and 1GB of memory per core. Total number of cores is 136. i04m4 -- 4 cores per node and 4GB of memory per core. Total number of cores is 160.
Note on name convention:
Opteron base computer start from "o" while intel based computer queues start with "i".
The first letter is followed by two numbers separated by letter m. The first two digits give the number of cores per node and the last digit specifies amount of memory per core.
-
CoRE: (submit hosts: rupc02, rupc06, rupc08, rupc09 (exclusively for i16m6 an i16m24) )
CLUSTER QUEUE CQLOAD USED RES AVAIL TOTAL aoACDS cdsuE
--------------------------------------------------------------------------------
all.q 0.35 0 0 0 4060 0 4060
bnl 0.00 0 0 792 792 0 0
dmref 0.01 0 0 112 112 0 0
i12m4i 0.50 144 0 132 300 108 24
i12m4 0.65 156 0 84 240 0 0
i16m4 0.13 48 0 208 256 0 0
i16m6 0.56 348 0 164 512 0 0 i16m24 0.00 0 0 16 16 0 0 i24m5 0.78 628 0 68 696 0 0 i28m4 0.00 0 0 112 112 0 0
i36m5 0.00 0 0 828 828 0 0
jed 0.96 241 0 11 252 0 0
-
CoRE #2: (submit hosts: rupc05, rupc07)
CLUSTER QUEUE CQLOAD USED RES AVAIL TOTAL aoACDS cdsuE
--------------------------------------------------------------------------------
all.q 0.93 0 0 0 1408 0 1408
wi28m4 1.00 140 0 0 140 140 0
wi36m5 0.00 0 0 0 72 0 72
wo32m4 0.99 448 0 0 448 448 0
wo48m4 0.97 720 0 0 720 624 0Note on name convention:
Queues names start "w" (work) followed by "i" (Intel processors) or by "o" (Opteron CPUs).
"wi(o)" is followed by two numbers separated by letter m (memory). The first gives the number of cores per server and the second number shows the amount of memory per core.
Example: wo48m4 i.e. 48 cores inside a node with 4GB memory per core, Opteron CPU.
-
Job Control Script Templates:
- submit_ompi.sh submission script one can use to submit jobs to all queues.
- Queues with InfiniBand connection (CoRE: i36m5, bnl, jed,i12m4i) can use this submission script submit_ompi_ib.sh .
- For Serial jobs in all queues one can use the following script submit.sh .
Note: Guidelines for job control script:- substitute JOB_NAME with a meaningful job name
- NUMBER_OF_CPUS (#$ -pe line) can be just a number (e.g. 8) or a range (e.g. 4-32).
- also specify your e-mail address [email protected] to be notified when the job starts/ends.
- and finally adjust ./YOUR_EXEC to whatever you have as executable.
-
Basic Commands to Manage Jobs
- To submit a job:
- To check the status of submitted job:
qstat
qsub your_job_script.sh
- To delete a job from a queue:
qdel your_job_number
- To see activated queues and their load:
qstat -g c
(alias qs)
- To check whether if there is enough resources available to run your job:
qsub -w v job_script.sh
(where qw =queued, t = starting, r = running)
-
Note about jobs output:
One can access output from running jobs by login to the node where job is running. Usually it is in directory /src or /tmp. The generic name of your job directory is Job_numberUsername (i.e. 12345username).
Tips and Tricks
- To send job to a particular host:
qsub -l hostname=n17 job_script.sh
- To submit job for a period of time (example 1 hour):
- To submit job to nodes with 2Gb of memory:
qsub -l mem_total=2000M job_script.sh
- To submit job at specific time:
qsub -a yyyymmddhhmmss job_script.sh
(example -a 20041230153234 == 2004 Dec 30 15:32.34)
- There is also a nice and useful graphical utility (qmon) that allows you to submit new jobs. Qmon (Queue MONitor) also provides all possible information about the cluster.
- For complete information about the SGE please use User's
and Admins's guides as well as
nice presentation and
online resources.
- To login to a particular compute node in Physics cluster use alias : dXX where XX is node number (only from rupc04). To access computer node in CoRE cluster you can use command: nXX (from all login servers except rupc04). Examples: d101 (physics) or n165 (CoRE) or n138 (CoRE2).
qsub -l s_cpu=01:00:00 job_script.s
Backup:
Snapshots of home directory are taken every 5 hours. Hourly snapshots are available from any login node in directory /snapshot/hour. . Daily snapshots are available from any login node in directory /snapshot/day.Hourly snapshots are complete images of your home directory taken at 7am, 12pm, 5pm and 10pm. Daily snapshots are usually taken at 3am every night.
Daily snapshots of work directory are available at any login server in directories: /mnt/swkXX/ where XX is number of your work directory folder. (for the XX number see /work directory: ls -l /work and your name).
Example: /mnt/wk19/user/ means XX=19 i.e. backup directory is /mnt/swk19.
Note:
Snapshot directories are read only i.e. you can not modify files there! One can only copy deleted files from them.
Storage Policy:
Home directory usage should not exceed 50 GB.Work directory usage should not exceed 1.5 TB.
Storage directory usage should not exceed 500 GB [all data should be in tar and gzipped form].
Cluster Usage Policy:
Common rules:There are 4 sub clusters, which can be used by most of users:
I) Physics (located at Physics Rm#284, common usage)
1) can be accessed from rupc04
2) can also be accessed from other rupc frontend nodes by typing "phys"
II) CORE (located at CoRE building)
1) can be accessed from rupc02, rupc06, rupc08, rupc09
2) can also be accessed from other rupc frontend nodes by typing "core"
III) CORE2 (located at CoRE building)
1) can be accessed from rupc05, rupc07
2) can also be accessed from other rupc frontend nodes by typing "core2"
Although most of users have very high limit on the number of cores which they can simultaneously use, it is expected that users will be moderate, and not use disproportional amount of computer time. If you need more computer time, you should get a special permission, which will be granted if resources are available.
There are three groups of users allowed to run jobs in the cluster. Users are divided depending on a group supervisor.
- Profs. K. Rabe and D. Vanderbilt : Jobs can be run on Physics and CoRE2 sub clusters. Policy (very important, must read)
- Profs. G. Kotliar and K. Haule : Jobs can be run on Physics and CoRE sub clusters. Policy (very important, must read)
- Prof. J. Pixley : Jobs can be run on Physics and CoRE(queue "jed") sub clusters. Policy (very important, must read)
And finally:
- All new cluster users will need to answer at least three questions about
the cluster usage Policy before to be granted the cluster
access.
- Before a user leaves Rutgers (s)he should inform cluster administrator about the date of
the departure. If it is not done the access to the cluster will be
blocked immediately upon this information reaches the administrator in
other ways.
- After the date of the departure the user is given three months to archive data and move them from active work in to storage directory.
- In six months the account will be closed unless your supervisor confirms that there are actively running projects.
Notes:
- If a user exceeds the quota he/she gets three automatics reminders to reduce the disk usage. After 3 reminders the account will be suspended
- Please tar and gzip all data in storage
directory i.e. storage
directory should contain only files with .tar.gz, .tgz
or .tar extensions.
Script "matreshka.sh" can help you to gzip and tar your data.