Quick User's Guide
Disclaimer: This is a reference manual for RUPC users, describing the organization of the Beowulf cluster, the resources available and the Sun Grid Engine queuing system. We encourage users to spend half an hour in reading this quick manual to fully profit from the cluster usage.Cluster Access:
- To get an access to the cluster please contact Viktor Oudovenko [Physics Rm #E256] providing the following information:
- Your full name
- Username [should be the same as on physsun machines]
- Machine name or IP address to be used for logins
- Room number and phone number
- Name of professor you work with
- Users can login to the cluster via frontend nodes (rupc03-rupc09)
- rupc04 is x86-64 Intel Xeon, used to compile and submit jobs to Physics Xeon's machines.
- rupc05,rupc07 are x86-64 Opteron, used to compile and submit jobs to Opteron machines in CoRE#2 queue.
- rupc03, rupc06, rupc08 are x86-64 Intel Xeon, used to compile and submit jobs to Xeon's machines in CoRE queue.
- Access to the cluster is allowed only from computers with granted access and all physsun machines.
Secure Shell (SSH) is required to access the cluster and for logins inside the cluster. - For password less browsing in the cluster one should execute this script line-by-line.
- Secure CoPy (SCP) is required to transfer data to and from and inside the cluster.
- You are not allowed to run any interactive jobs on the compute nodes.
- All frontend nodes provide compilers, libraries, editors and so on.
- Number of open login sessions on a frontend machine is limited to 7 per user.
Cluster Structure (Hardware):
- RUPC consists of 4 computational subclusters: I) Physics, II) CoRE, III) CoRE2 and IV) IBM .
- Physics: AMD Opterons, Intel Xeons
- CoRE: Intel Xeons
- CoRE2: AMD Opterons
- IBM: IBM P575
- Main Servers
- Main File Server [home directories]
- NIS/license+Storage server [storage directories]
- Work File servers [directories for large work data]
- Backup Servers
- SGE Servers [queueing system]
- Web / Temperature Control Servers
Cluster Structure (Software):
-
Operating System
SuSE 13.1 and CentOS 7.0 (wp02 in CoRE cluster) -
Compilers
Compilers (C, C++, Fortran77, Fortran90), parallel environment (MPICH2) and numerical libraries are accross the cluster. - Intel 14.0
The compiler commands are: ifort, icc, icpc
The compiler installation path is: /opt/intel/Compiler/14.0/
The MPICH2 wrappers: mpif77, mpif90, mpicc, mpiCC
The MPICH2 wrapper installation path is: /opt/mpich2/intel/14.0/
-
Libraries
The most common libraries used in the cluster are - Intel libraries
- MKL /opt/intel/mkl/lib/intel64/ [x86-64]
- FFTW{2,3} /opt/intel/mkl/lib/intel64/
- Fast Fourier Libraries
- FFTW3
- /opt/fftw/lib64/
- GNU scientific library
- GSL
- rpm OS package
-
Linking Tips
- Linking examples to BLAS and LAPACK libraries using MKL:
#
LAPACK_LIB = -mkl
LIB = $(LAPACK_LIB) -
Make Files Examples for Frequently Used Codes
Physics: (submit host: rupc04)
Queuing System
In order to run serial and/or parallel jobs on the cluster, you must prepare a job control file and submit it from any of the submit nodes (rupc03-09). Job control files are nothing else as shell scripts, with additional information specifying the queue, the number of CPUs, etc. We strongly encourage you to use job script templates given below.Currently activated queues (2016/03/17):
CLUSTER QUEUE CQLOAD USED AVAIL TOTAL aoACDS cdsuEPhysics cluster (phys) is underutilized by most members, although the machines on physics cluster are not much slower than in CORE clusters. Please take a moment and familiarize yourself with Physics cluster. Physics cluster contains many queues, which have various amounts of memory and various numer of cores per node.
-------------------------------------------------------------------------------
all.q 0.39 0 0 538 0 538
wo02m2 0.00 0 32 34 0 2
wo04m1 0.32 8 12 24 0 4
wo08m2 0.44 21 27 48 8 0
wo08m4 0.65 22 18 48 16 8
wo16m2 0.75 12 4 16 0 0
wo32m2 1.00 32 0 32 32 0
wp04m1 0.13 16 126 148 0 6
wp08m2 0.36 12 8 20 0 0
wp08m3 0.52 100 8 108 16 0
wp12m4 0.54 30 42 72 0 0
Opteron architecture:
wo32m2 -- 32 cores per node and 2GB of memory per core. Total number of cores is 32.
wo16m2 -- 16 cores per node and 2GB of memory per core. Total number of cores is 16.
wo08m4 -- 8 cores per node and 4GB of memory per core. Total number of cores is 32.
wo08m2 -- 8 cores per node and 2GB of memory per core. Total number of cores is 48.
wo04m1 -- 8 cores per node and 1GB of memory per core. Total number of cores is 24.
wo02m2 -- 2 cores per node and 2GB of memory per core. Total number of cores is 34.
Intel architecture:
wp12m4 -- 12 cores per node and 4GB of memory per core. Total number of cores is 72.
wp08m3 -- 8 cores per node and 8GB of memory per core. Total number of cores is 108.
wp08m2 -- 8 cores per node and 2GB of memory per core. Total number of cores is 20.
wp04m1 -- 4 cores per node and 1GB of memory per core. Total number of cores is 148.
Note on name convention:
Opteron base computer start from "wo" while intel based computer queues start with "wp".
It is followed by two numbers separated by letter m. The first gives the number of
cores per node and the second the amount of memory per core.
Restrictions: wp08m1 and wp12m4 queues are devoted to USPEX project. The rest do not have restrictions.
CoRE: (submit hosts: rupc06, rupc08, rupc09 (exclusively for wp02) )
CLUSTER QUEUE CQLOAD USED RES AVAIL TOTAL aoACDS cdsuE*_hp means "High Priority". _hp queues share the same hardware with corresponding non _hp queues.
--------------------------------------------------------------------------------
all.q 0.65 0 0 0 2016 0 2016
wp02 0.92 512 0 0 512 32 0
wp02_hp 0.92 0 0 512 512 0 0
wp02m 0.98 16 0 0 16 0 0
wp04 0.22 108 0 140 280 8 32
wp06 0.22 14 0 18 32 0 0
wp08 1.00 336 0 0 336 0 0
wp08_hp 1.00 0 0 336 336 0 0
wp10 0.64 192 0 108 300 156 0
wp12 0.34 204 0 36 240 0 0
wp14 1.00 96 0 0 96 0 0
wp15 0.00 0 0 160 160 0 0
CoRE #2: (submit hosts: rupc05, rupc07)
CLUSTER QUEUE CQLOAD USED RES AVAIL TOTAL aoACDS cdsuENote on name convention:
--------------------------------------------------------------------------------
all.q 0.73 0 0 0 1138 0 1138
wp32m2 0.00 0 0 96 96 0 0
wp32m4 0.45 144 0 176 320 64 0
wp48m4 0.95 680 0 40 720 576 0
Queues names start "w" (work) followed by "p" (processors).
"wp" is followed by two numbers separated by letter m (memory). The first gives the number of
cores per server and the second number showes the amount of memory per core.
Example: wp48m4 i.e. 48 cores inside a node with 4GB memory core.
Job Control Script Templates:
- All Physics queues, all CoRE queues and all CoRE#2 queues with Gigabit connection can use this script mpich2_gb.sh for job submission or this one mpich2_gb_s.sh where the main command line is simplified.
- Queues with InfiniBand connection (CoRE: wp02) can use this submission script mvapich2_ib.sh .
- For Serial jobs in all queues one can use the following script serial.sh .
- substitute JOB_NAME with a meaningful job name
- NUMBER_OF_CPUS (#$ -pe line) can be just a number (e.g. 8) or a range (e.g. 4-32).
- also specify your e-mail address [email protected] to be notified when the job starts/ends.
- and finally adjust ./YOUR_EXE_FILE_OR_SCRIPT to whatever you have.
Note: Guidelines for job control script:
Job Submisions Requirements
- Add the follofing lines in to .bashrc file on compute nodes.
export SMPD_OPTION_NO_DYNAMIC_HOSTS=1
export OMP_NUM_THREADS=1
export LD_LIBRARY_PATH=/opt/intel/mkl/lib/intel64/:/opt/intel/lib/intel64/ - Created file .smpd on computer node with the following content:
phrase=pass
and apply the following permissions:chmod 600 .smpd
-
Basic Commands to Manage Jobs
- To submit a job:
qsub your_job_script.sh
- To check the status of submitted job:
qstat
(where qw =queued, t = starting, r = running) - To delete a job from a queue:
qdel your_job_number
- To see activated queues and their load:
qstat -g c
(alias qs) - To check whether if there is enough resources available to run your job:
qsub -w v job_script.sh
Tips and Tricks
- To send job to a particular host:
qsub -l hostname=sub04n17 job_script.shi
- To submit job for a period of time (example 1 hour):
qsub -l s_cpu=01:00:00 job_script.sh
- To submit job to nodes with 2Gb of memory:
qsub -l mem_total=2000M job_script.sh
- To submit job at specific time:
qsub -a yyyymmddhhmmss job_script.sh
(example -a 20041230153234 == 2004 Dec 30 15:32.34) - There is also a nice and useful graphical utility (qmon) that allows you to submit new jobs. Qmon (Queue MONitor) also provides all possible information about the cluster.
- For complete information about the SGE please use User's
and Admins's guides as well as
nice presentation and
online stuff.
- To login to a particular compute node in Physics cluster use alias : dXX where XX is node number (only from rupc04). To access computer node in CoRE cluster you can use command: nXX (from all login servers except rupc04). Examples: d101 (physics) or n165 (CoRE) or n138 (CoRE2) or n202 (IBM).
Note about jobs output:
One can access output from running jobs by login to the node where job is running. Usually it is in directory /src or /tmp. The generic name of your job directory is Job_numberUsername (i.e. 12345username).
Backup:
Snapshots of home directory are taken every 5 hours. Hourly snapshots are available from any login node in directory /snapshot/hour. Daily snapshots are available from any login node in directory /snapshot/day.Hourly snapshots are complete images of your home directory taken at 7am, 12pm, 5pm and 10pm. Daily snapshorts are usually taken at 3am every night.
Daily snapshots of work directory are evailable at any login server in directories: /mnt/swkXX/ where XX is number of your work directory folder. (for the XX number see /work directory: ls -l /work and your name).
Example: /mnt/wk19/haule/ means XX=19 i.e. backup directory is /mnt/swk19.
Note:
Snapshot directories are read only i.e. you can not modify files there! One can only copy deleted files from them.
Storage Policy:
Home directory usage should not exceed 50 GB.Work directory usage should not exceed 500 GB.
Storage directory usage should not exceed 100 GB [all data should be in tar and gzipped form].
Cluster Usage Policy:
Common rules:There are 4 sub clusters, which can be used by most of users:
I) Physics (located at Physics Rm#284, common usage)
1) can be accessed from rupc04
2) can also be accessed from othe rupc frontend nodes by typing "phys"
II) CORE (located at CoRE building)
1) can be accessed from rupc06
2) can also be accessed from othe rupc frontend nodes by typing "core"
III) CORE2 (located at CoRE building)
1) can be accessed from rupc06
2) can also be accessed from othe rupc frontend nodes by typing "core2"
IV) IBM (located at CoRE)
1) can be accessed from rupc06
2) can also be accessed from othe rupc frontend nodes by typing "ibm"
Although most of users have very high limit on the number of cores which they can simultaneously use, it is expected that users will be moderate, and not use disproportional amount of computer time. If you need more computer time, you should get a special permission, which will be granted if resources are available.
There are two groups of users allowed to run jobs in the cluster. Users are devided depending on a group supervisor.
- Profs. K. Rabe and D. Vanderbilt : Jobs can be run on Physics and CoRE2 sub clusters. Policy (very important, must read)
- Profs. G. Kotliar and K. Haule : Jobs can be run on Physics, CoRE and IBM sub clusters. Policy (very important, must read)
And finally:
- All new cluster users will need to answer at least three questions about
the cluster usage Policy before to be granted the cluster
access.
- Before a user leaves Rutgers (s)he should inform cluster administrator about the date of
the departure. If it is not done the access to the cluster will be
blocked immediately upon this information reaches the administrator in
other ways.
- After the date of the departure the user is given three months to archive data and move them from active work in to storage directory.
- In six months the account will be closed unless your supervisor confirms that still there are actively running projects.
Notes:
- If a user exceeds the quota he/she gets three automatics reminders to reduce the disk usage. After 3 reminders the account will be suspended
- Please tar and gzip all data in storage
directory i.e. storage
directory should contain only files with .tar.gz, .tgz
or .tar extensions.
Script "matreshka.sh" can help you to gzip and tar your data.