Quick User Guide
Disclaimer: This is a reference manual for RUPC users, describing the organization of the Beowulf cluster, the resources available and the Sun Grid Engine queuing system. We encourage users to spend a few minutes in reading this quick manual to fully profit from the cluster usage.Cluster Access:
- To gain access to the cluster, please contact Viktor Oudovenko [Physics Rm #E256] and provide the following information:
- Your full name
- Username [(should match your Rutgers NetID)]
- Machine name or IP address to be used for logins
- Room number and phone number
- Anticipated period of work
- Name of the professor you work with
- Users can log in to the cluster via the frontend node:(rupc-01.rutgers.edu)
- Access to the cluster is allowed only from computers with granted access and computers with Rutgers VPN.
Secure Shell (SSH) is required to access the cluster and for logins within the cluster. - For passwordless browsing within the cluster, execute the provided script line-by-line.
- Secure CoPy (SCP)is required to transfer data to and from the cluster and within the cluster.
- Interactive jobs are not permitted on the login node(s).
- Number of open login sessions on frontend machine is limited to 7 per user.
Cluster Structure (Hardware):
- Main Servers
- Main File Server [home directories]
- LDAP server
- Work File servers [directories for large work data]
- Backup Servers
- SGE Servers [queueing system]
- Web / Temperature Control Servers
Cluster Structure (Software):
-
Operating Systems
Rocky Linux 9.3 -
Modules
Module: A command interface to the Modules package, which allows for the dynamic modification of the user's environment through modulefiles. - module available -- to see modules available on the system.
- module load module_name -- to load a module.
- module rm module_name -- to remove a module.
- module show module_name -- to show information about a module.
-
Compilers
Compilers (C, C++, Fortran77, Fortran90), parallel environment (OpenMPI ) and numerical libraries are across the cluster. - Intel oneAPI 2024.0
- GNU
-
Libraries
The most common libraries used in the cluster are MKL, FFTW, HDF5, GSL, ...
-
Jupyter Notebook
To access Jupyter Notebook, please follow the setup instructions at this webpage. -
Make Files Examples for Frequently Used Codes
Queuing System
In order to run serial and/or parallel jobs on the cluster, you must
prepare a job control file and submit it from any of the submit nodes
(rupc03-09). Job control files are nothing else as shell scripts,
with additional information specifying the queue, the number of CPUs,
etc. We strongly encourage you to use job script templates given below.
Currently activated queues (2024/09/01):
-
CMT: (submit host: rupc-01 )
CLUSTER QUEUE CQLOAD USED RES AVAIL TOTAL aoACDS cdsuE -------------------------------------------------------------------------------- all.q 0.11 0 0 0 5968 0 5968 ari56m4 0.02 24 0 312 336 0 0 dki28m4 0.00 0 0 140 140 0 0 dki28m4d 0.00 0 0 112 112 0 0 dki36m5 0.00 0 0 72 72 0 0 dko48m4 0.00 0 0 768 864 0 96 dko64m4 0.00 0 0 192 192 0 0 gkbnl 0.35 0 0 612 648 0 36 gkbnlg 0.00 0 0 144 144 0 0 gki16m4 0.00 0 0 256 256 0 0 gki16m6 0.00 0 0 432 528 0 96 gki24m5 0.00 0 0 648 696 0 48 gki28m4 0.00 0 0 112 112 0 0 gki36m5 0.49 579 0 249 828 0 0 gko64m4 0.00 0 0 640 640 0 0 jpi36m5 0.00 0 0 252 252 0 0 jpo64m4 0.00 0 0 384 384 0 0
AR - Ananda RoyDK - David (Vanderbilt) & Karin (Rabe)GK - Gabriel (Kotliar ) & Kristjan (Haule)JP - Jedediah Pixley
-
Job Control Script Templates:
- submit_ompi.sh submission script one can use to submit jobs to all queues.
- For Serial jobs in all queues one can use the following script submit.sh .
Note: Guidelines for job control script:- substitute JOB_NAME with a meaningful job name
- NUMBER_OF_CPUS (#$ -pe line) can be just a number (e.g. 8) or a range (e.g. 4-32).
- also specify your e-mail address [email protected] to be notified when the job starts/ends.
- and finally adjust ./YOUR_EXEC to whatever you have as executable.
-
Basic Commands to Manage Jobs
- To submit a job:
- To check the status of submitted job:
qstat
qsub your_job_script.sh
- To delete a job from a queue:
qdel your_job_number
- To see activated queues and their load:
qstat -g c
(alias qs)
- To check whether if there is enough resources available to run your job:
qsub -w v job_script.sh
(where qw =queued, t = starting, r = running)
-
Note about jobs output:
One can access output from running jobs by login to the node where job is running. Usually it is in directory /src or /tmp. The generic name of your job directory is Job_numberUsername (i.e. 12345username).
Tips and Tricks
- To send job to a particular host:
qsub -l hostname=n17 job_script.sh
- To submit job for a period of time (example 1 hour):
- To submit job to nodes with 2Gb of memory:
qsub -l mem_total=2000M job_script.sh
- To submit job at specific time:
qsub -a yyyymmddhhmmss job_script.sh
(example -a 20041230153234 == 2004 Dec 30 15:32.34)
- There is also a nice and useful graphical utility (qmon) that allows you to submit new jobs. Qmon (Queue MONitor) also provides all possible information about the cluster.
- For complete information about the SGE please use User's
and Admins's guides as well as
nice presentation and
online resources.
- To access computer node in CoRE cluster you can use command: ssh nX .
qsub -l s_cpu=01:00:00 job_script.s
Backup:
-
Backup of home and work directories is done once a day.
Storage Policy:
Home directory usage should not exceed 50 GB.Work directory usage should not exceed 2.0 TB.
Cluster Usage Policy:
There are four groups of users allowed to run jobs on the cluster. Users are assigned to groups based on their group supervisor. Jobs should be submitted to the queues corresponding to the appropriate supervisor.- All new cluster users must answer at least three questions about the cluster usage policy before being granted access.
- Before leaving Rutgers, users should inform the cluster administrator of their departure date. Failure to do so will result in immediate access being blocked once the administrator receives this information through other channels.
- After the departure date, users have three months to archive their data and move it from the active work area to the storage directory.
- If a user’s account is not confirmed to have actively running projects by their supervisor, it will be closed six months after the departure date.
Notes:
- If a user exceeds their quota, they will receive three automatic reminders to reduce disk usage. After receiving the third reminder, the account will be suspended.
- Please ensure that all data in the storage directory is archived using tar and gzip. The storage directory should contain only files with .tar.gz, .tgz, or .tar extensions.