Metropolis HPC FAQ | qcn.physics.uoc.gr

Registration and Access

How do I become a Metropolis user?
The Metropolis HPC facility, is primarily available to all CCQCN members and researchers from the University of Crete. You may register at the registration page.

How much disk space do I have?
Default quota is 20 GB per user under /home and 100 GB under /scratch.

How do I log on to the Metropolis system?
You should use any recent SSH client available to you. Other connection methods, apart from Secure Shell are not supported. The node you should connect to, is ccqcn-l.hpc.physics.uoc.gr.

How do I change my password?
By using the command yppasswd. Please note, that the new password must be of equivalent strength (or better!) to the randomly generated password you were originally issued, i.e. at least 10 characters, using mixed case letters and some non-alphanumeric characters.
Example: at the command prompt, type "yppasswd" and press Enter. The system first ask you for your current password and then for the new:
ccqcn-l %> yppasswd
Changing NIS account information for user154 on ccqcn-l.hpc.physics.uoc.gr.
Please enter old password:
Changing NIS password for user154 on ccqcn-l.hpc.physics.uoc.gr.
Please enter new password:
Please retype new password:

The NIS password has been changed on ccqcn-l.hpc.physics.uoc.gr.

How do I transfer files to/from the Metropolis systems?
Similarly to the log on method, you should use any Secure FTP or Secure Copy enabled client, like sftp, scp, putty or Filezilla.

How do I access the Metropolis system from outside the University network or a wireless network?
The Metropolis cluster accepts Secure Shell connections from anywhere, but the preferable method is to first setup a VPN connection and then use Secure Shell to connect to ccqcn-l.hpc.physics.uoc.gr

Software and Applications

What compilers, software/applications are available?
Please consult the software page

How do I submit a batch job?
In brief, to run an application, you have to:

load the appropriate module,
prepare a job script for the job scheduler and
submit the job to the batch processing queue.

The Metropolis cluster the SLURM Resource Manager. You can find more in the SLURM documentation.

How do I submit a parallel job?
There are several ways to achieve that. The easiest and safest way, is to create a batch script, load the appropriate environment module inside the script and call the srun command with the appropriate flags inside the script. More at the SLURM documentation.

What is hyper-threading and how does it affect my applications?
Hyper-Threading Technology (HT) uses processor resources more efficiently, enabling multiple threads to run on each core. This can potentially boost the performance of scientific applications, but this cannot be taken as a fact for every application.
Each physical modern processor, has one or more physical processing cores inside it. With Hyper-Threading, each core can run multiple threads, on the same core at once. Recent E5 26xx processors from Intel support up to 2 threads running on the same core at the same time. A number of applications can benefit from this technology, where for others observed declined performance.
In any case, users should experiment on that and choose whether to take advantage of HT or not.

How do I vizualize my data?
Until the vizualization nodes are available, you should avoid processing your data on the logon node. Please transfer your data to your own desktop/laptop and use the appropriate visualization software there.

Submitting and controlling Jobs

How do I submit, check the status, and/or delete a batch job?

Job submission: sbatch (or qsub) for batch scripts or srun for single executables
Check job status: squeue
Delete job: qdel

Why do I get "Job exceeds queue and/or server resource limits"?
When requesting for resources via a batch script or the srun command, when these resources are unavailable the scheduler will decine the job you submitted. You may retry to submit the job again later and make sure that you do not ask for resources that exceed the physical resources of the processing nodes.

How do I specify when my job should enter the queue ?

After submitting a job, I get "Out of memory: Kill process XXXXX (YYYYYY) score XX or sacrifice child" and job stops. Why?
The job scheduler (actually the kernel on a processing node) can terminate the execution of a job with this message in case of a memory allocation failure, in fact, when there is no more memory available to allocate to your job. This can often happen when running a parallel job with many individual tasks. Each task consumes memory and usually the memory consumption increases as the task executes. In case one or more tasks request more memory that it is available on a particular node the task will be terminated and the parallel job will probably fail. So, before running a job (especially a parallel one) please estimate the amount of memory your job needs and limit the number of tasks on each node to match the available physical resources during execution.

How can I get an email notification when a job begins/finishes?
You can use the #SBATCH --mail-user=<email_address> directive inside a batch script, or the --mail-user=<email_address> command line argument for the srun command. The --mail-type=<type>, where <type> may be BEGIN, END, FAIL, REQUEUE or ALL can also be usefull.

How can I check the availability of free compute nodes?
You can get an overview of the nodes in regards to status and partitions (queues) with the sinfo command

Troubleshooting

I cannot login to Metropolis

There might be several reasons for that. Some of the most common reasons are:

You use the wrong combination of username and password
You have tried several times to login with the wrong username/password. Your machine is automatically locked out and you should contact technical support.
In case the host key for ccqcn-l.hpc.physics.uoc.gr has changed, ssh will fail to connect with the following warning: "WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!". You should probably change the ccqcn-l.hpc.physics.uoc.gr key stored in your machine under $HOME/.ssh/known_hosts (in case of Linux, MacOSX and UNIX machines), or inside the settings of the SSH client you use.
Your account has been locked. Please contact technical support.

Disk quota exceeded
Each user has a limited disk storage for storing data. In case a user overcomes this limit, he won't be able to store new data or update existing ones.

X11 applications fail to start
For X11 to work over SSH, X11Forwarding must be enabled at both server and client side. The login node has already enabled X11Forwarding but sometimes the client initiate an SSH connection without this option. In this case, applications that require X11, will fail to load. To solve this problem you should call ssh client with the -Y or the -X option. Moreover, for Windows (only) machines, you should install an X server, like Xming.

Jobs do not start
There are several reasons for that, two of the most common are:

Luck of available resources, like insufficient free memory, processor cores or nodes for exclusive use: you should wait for enough resources, or reschedule your job with limited requirements
You have exceeded your CPU quota: you must wait until your previously submitted jobs terminate

Jobs fail, or terminate with errors
There can be several errors when executing your job, like memory related problems or wrong active environment. When you receive errors like "Job XXXXX exceeded YYYY KB memory limit, being killed", it means that your job with id XXXXX tried to use more memory than you requested, and the job scheduler killed it, or that the node couldn't provide more memory to your job during its execution. This usually happens on very busy nodes, and unless you are absolutely certain that you need that much memory you need to rearrange the memory limits of your job. Errors related to wrong active environment, means that you have not loaded (or not requested at all!) the correct module environment for your application. For example, If you have a program compiled with OpenMPI and you do not load the correct OpenMPI environment, it is most likely that your program will terminate with errors or with a core dump. Please make sure tht you run your programs in appropriate environment.
In any case, it is always a good idea to carefully inspect the error log file of your job and identify possible causes for the failure.

Cannot find compilers or applications
Please double check that you have loaded the correct environmental module

Performance and optimization issues

In order to achieve maximum performance, you should at least compile your code with the appropriate optimization flags, meaning, that you should use the proper compiler options which take advantage of the CPU architecture of Metropolis. All available CPUs are of the same architecture and the minimum (and safe!) compiler flags we suggest are:

GNU C, C++ or Fortran compiler: -march=core-avx-i -O2 -pipe
Intel compilers: -xCORE-AVX-I -O2

Another performance issue, could be the extensive I/O. In case you generate large number of files during the execution, it could lead to degraded performance. Unless you are absolutely sure of what you are doing, it is generally a good idea to limit the number of files you generate at execution time in order to avoid such issues. Another reason for slow I/O, could be that you are using your home directory as a scratch space for generating large amount of data instead of using /scratch. This particular directory is built on top of an advanced parallel filesystem and offers better performance.

Optimizing code is an iterative process. Please follow these steps to efficiently optimize your code:

Evaluate your code
Identify the areas where optimization techniques can be used
Apply the techniques
Evaluate performance
IF NOT sufficiently optimized → GO TO step 1

There are several ways to evaluate and check the performance of your code. The most efficient is to use a profiler like gprof. In order to use a profiler, you must first compile your program with appropriate switches that turn profiling on. These flags will generate an additional output file containing profiling data. When you run the profiler, it will use these data to generate a report on how the program ran.

Some important information you should pay attention to, is how much time your program spent in each subroutine and how many times a subroutine was called. This information will give you a better understanding of how your code actually executes and will probably give you some hints on how to optimize it. For more information, please consult the manual page of the profiler you use.