On 7/28/22 14:56, Rob Sargent wrote:
On Jul 28, 2022, at 1:10 AM, Christian Meesters <meesters@uni-mainz.de> wrote:
Hi,
not quite. Under SLURM the jobstep starter (SLURM lingo) is "srun". You do not do ssh from job host to job host, but rather use "parallel" as a semaphore avoiding over subscription of job steps with "srun". I summarized this approach here:
https://mogonwiki.zdv.uni-mainz.de/dokuwiki/start:working_on_mogon:workflow_organization:node_local_scheduling#running_on_several_hosts (uh-oh - I need to clean up that site, many outdated sections there, but this one should still be ok)
One advantage: you can safely utilize the resources of both (or more) hosts - the master hosts and all secondaries. How much resources you require depends on your application and the work it does. Be sure to consider I/O (e.g. stage-in file to avoid random I/O with too many concurrent applications, etc.), if this is an issue for your application.
Cheers
Christian
Christian,
My use of GNU parallel does not include ssh. Rather I simply fill the slurm node with —jobs=ncores
That would require to have an interactive job and having
ncores_per_node/threads_per_application ssh-connections, and you
have to manually trigger the script. My solution is to use
parallel in a SLURM-job context and avoid the synchronization
step by a human, whilst offering a potential multi-node job with
smp applications. It's your choice, of course.
if I follow correctly that is what I am doing. Here's my slurm job