Registration and Access
- How do I become a Metropolis user?
- How much disk space do I have?
- How do I log on to the Metropolis system?
- How do I change my password?
- How do I transfer files to/from the Metropolis systems?
- How do I access the Metropolis system from outside the University network or a wireless network?
Software and Applications
- What compilers, software/applications are available?
- How do I submit a batch job?
- How do I submit a parallel job?
- What is hyper-threading and how does that affect my applications?
- How do I vizualize my data?
Submitting and controlling Jobs
- How do I submit, check the status, and/or delete a batch job?
- Why do I get "Job exceeds queue and/or server resource limits"?
- How do I specify when my job should enter the queue?
- After submitting a job, I get "Out of memory: Kill process XXXXX (YYYYYY) score XX or sacrifice child" and job stops. Why?
- How can I get an email notification when a job begins/finishes?
- How can I check the availability of free compute nodes?
Troubleshooting
- I cannot login to Metropolis
- Disk quota exceeded
- X11 applications fail to start
- Jobs do not start
- Jobs fail, or terminate with errors
- Cannot find compilers or applications
- Performance and optimization issues
Registration and Access
How do I become a Metropolis user?
The Metropolis HPC facility, is primarily available to all CCQCN members and researchers from the University of Crete. You may register at the registration page.
How much disk space do I have?
Default quota is 20 GB per user under /home and 100 GB under /scratch.
How do I log on to the Metropolis system?
You should use any recent SSH client available to you. Other connection methods, apart from Secure Shell are not supported. The node you should connect to, is ccqcn-l.hpc.physics.uoc.gr.
How do I change my password?
By using the command yppasswd. Please note, that the new password must be of equivalent strength (or better!) to the randomly generated password you were originally issued, i.e. at least 10 characters, using mixed case letters and some non-alphanumeric characters.
Example: at the command prompt, type "yppasswd" and press Enter. The system first ask you for your current password and then for the new:
ccqcn-l %> yppasswd
Changing NIS account information for user154 on ccqcn-l.hpc.physics.uoc.gr.
Please enter old password:
Changing NIS password for user154 on ccqcn-l.hpc.physics.uoc.gr.
Please enter new password:
Please retype new password:
The NIS password has been changed on ccqcn-l.hpc.physics.uoc.gr.
How do I transfer files to/from the Metropolis systems?
Similarly to the log on method, you should use any Secure FTP or Secure Copy enabled client, like sftp, scp, putty or Filezilla.
How do I access the Metropolis system from outside the University network or a wireless network?
The Metropolis cluster accepts Secure Shell connections from anywhere, but the preferable method is to first setup a VPN connection and then use Secure Shell to connect to ccqcn-l.hpc.physics.uoc.gr
Software and Applications
What compilers, software/applications are available?
Please consult the software page
How do I submit a batch job?
In brief, to run an application, you have to:
- load the appropriate module,
- prepare a job script for the job scheduler and
- submit the job to the batch processing queue.
The Metropolis cluster the SLURM Resource Manager. You can find more in the SLURM documentation.
How do I submit a parallel job?
There are several ways to achieve that. The easiest and safest way, is to create a batch script, load the appropriate environment module inside the script and call the srun command with the appropriate flags inside the script. More at the SLURM documentation.
What is hyper-threading and how does it affect my applications?
Hyper-Threading Technology (HT) uses processor resources more efficiently, enabling multiple threads to run on each core. This can potentially boost the performance of scientific applications, but this cannot be taken as a fact for every application.
Each physical modern processor, has one or more physical processing cores inside it. With Hyper-Threading, each core can run multiple threads, on the same core at once. Recent E5 26xx processors from Intel support up to 2 threads running on the same core at the same time. A number of applications can benefit from this technology, where for others observed declined performance.
In any case, users should experiment on that and choose whether to take advantage of HT or not.
How do I vizualize my data?
Until the vizualization nodes are available, you should avoid processing your data on the logon node. Please transfer your data to your own desktop/laptop and use the appropriate visualization software there.
Submitting and controlling Jobs
- Job submission: sbatch (or qsub) for batch scripts or srun for single executables
- Check job status: squeue
- Delete job: qdel
Why do I get "Job exceeds queue and/or server resource limits"?
When requesting for resources via a batch script or the srun command, when these resources are unavailable the scheduler will decine the job you submitted. You may retry to submit the job again later and make sure that you do not ask for resources that exceed the physical resources of the processing nodes.
How do I specify when my job should enter the queue ?
After submitting a job, I get "Out of memory: Kill process XXXXX (YYYYYY) score XX or sacrifice child" and job stops. Why?
The job scheduler (actually the kernel on a processing node) can terminate the execution of a job with this message in case of a memory allocation failure, in fact, when there is no more memory available to allocate to your job. This can often happen when running a parallel job with many individual tasks. Each task consumes memory and usually the memory consumption increases as the task executes. In case one or more tasks request more memory that it is available on a particular node the task will be terminated and the parallel job will probably fail. So, before running a job (especially a parallel one) please estimate the amount of memory your job needs and limit the number of tasks on each node to match the available physical resources during execution.
How can I get an email notification when a job begins/finishes?
You can use the #SBATCH --mail-user=<email_address> directive inside a batch script, or the --mail-user=<email_address> command line argument for the srun command. The --mail-type=<type>, where <type> may be BEGIN, END, FAIL, REQUEUE or ALL can also be usefull.
How can I check the availability of free compute nodes?
You can get an overview of the nodes in regards to status and partitions (queues) with the sinfo command
Troubleshooting
There might be several reasons for that. Some of the most common reasons are:
- You use the wrong combination of username and password
- You have tried several times to login with the wrong username/password. Your machine is automatically locked out and you should contact technical support.
- In case the host key for ccqcn-l.hpc.physics.uoc.gr has changed, ssh will fail to connect with the following warning: "WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!". You should probably change the ccqcn-l.hpc.physics.uoc.gr key stored in your machine under $HOME/.ssh/known_hosts (in case of Linux, MacOSX and UNIX machines), or inside the settings of the SSH client you use.
- Your account has been locked. Please contact technical support.
Disk quota exceeded
Each user has a limited disk storage for storing data. In case a user overcomes this limit, he won't be able to store new data or update existing ones.
X11 applications fail to start
For X11 to work over SSH, X11Forwarding must be enabled at both server and client side. The login node has already enabled X11Forwarding but sometimes the client initiate an SSH connection without this option. In this case, applications that require X11, will fail to load. To solve this problem you should call ssh client with the -Y or the -X option. Moreover, for Windows (only) machines, you should install an X server, like Xming.
There are several reasons for that, two of the most common are:
- Luck of available resources, like insufficient free memory, processor cores or nodes for exclusive use: you should wait for enough resources, or reschedule your job with limited requirements
- You have exceeded your CPU quota: you must wait until your previously submitted jobs terminate
Jobs fail, or terminate with errors
There can be several errors when executing your job, like memory related problems or wrong active environment. When you receive errors like "Job XXXXX exceeded YYYY KB memory limit, being killed", it means that your job with id XXXXX tried to use more memory than you requested, and the job scheduler killed it, or that the node couldn't provide more memory to your job during its execution. This usually happens on very busy nodes, and unless you are absolutely certain that you need that much memory you need to rearrange the memory limits of your job. Errors related to wrong active environment, means that you have not loaded (or not requested at all!) the correct module environment for your application. For example, If you have a program compiled with OpenMPI and you do not load the correct OpenMPI environment, it is most likely that your program will terminate with errors or with a core dump. Please make sure tht you run your programs in appropriate environment.
In any case, it is always a good idea to carefully inspect the error log file of your job and identify possible causes for the failure.
Cannot find compilers or applications
Please double check that you have loaded the correct environmental module
Performance and optimization issues
In order to achieve maximum performance, you should at least compile your code with the appropriate optimization flags, meaning, that you should use the proper compiler options which take advantage of the CPU architecture of Metropolis. All available CPUs are of the same architecture and the minimum (and safe!) compiler flags we suggest are:
- GNU C, C++ or Fortran compiler: -march=core-avx-i -O2 -pipe
- Intel compilers: -xCORE-AVX-I -O2
- Evaluate your code
- Identify the areas where optimization techniques can be used
- Apply the techniques
- Evaluate performance
- IF NOT sufficiently optimized → GO TO step 1
Some important information you should pay attention to, is how much time your program spent in each subroutine and how many times a subroutine was called. This information will give you a better understanding of how your code actually executes and will probably give you some hints on how to optimize it. For more information, please consult the manual page of the profiler you use.