parallel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: file permissions on joblog


From: Christian Meesters
Subject: Re: file permissions on joblog
Date: Thu, 28 Jul 2022 19:46:12 +0200
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.11.0

This is no SLURM job file, as it contains no '#SBATCH' directives. (Yes, they could be given on the command line).

It is also a bit peculiar, as you must think it is necessary to adjust permissions. This is usually done in so-called prolog scripts, which run prior to the job start. If your cluster deviates, you should discuss this with your admins, as it makes your work cumbersome and error prone. Also, it is not necessary to infer the number of CPUs on a node as the number of CPUs available in your particular job should be available as environment variables (see the wiki-link I have given). Please contact your administrators about these things.

As for the job log: SLURM gathers stdout/stderr as specified by the sbatch -o and -e directives. They should be directed to shared file systems. Anything which is local to job, might not be accessible after the job finished. Whether or not /sratch is a global filesystem or a local one, cannot be understood from the context.

All in all, you should contact your local helpdesk, there are a number of things, which might be due to the application or the cluster settings, not parallel.



On 7/28/22 17:44, Rob Sargent wrote:
On 7/28/22 09:28, Christian Meesters wrote:


On 7/28/22 14:56, Rob Sargent wrote:
On Jul 28, 2022, at 1:10 AM, Christian Meesters <meesters@uni-mainz.de> wrote:
Hi,

not quite. Under SLURM the jobstep starter (SLURM lingo) is "srun". You do not do ssh from job host to job host, but rather use "parallel" as a semaphore avoiding over subscription of job steps with "srun". I summarized this approach here:

https://mogonwiki.zdv.uni-mainz.de/dokuwiki/start:working_on_mogon:workflow_organization:node_local_scheduling#running_on_several_hosts (uh-oh - I need to clean up that site, many outdated sections there, but this one should still be ok)

One advantage: you can safely utilize the resources of both (or more) hosts - the master hosts and all secondaries. How much resources you require depends on your application and the work it does. Be sure to consider I/O (e.g. stage-in file to avoid random I/O with too many concurrent applications, etc.), if this is an issue for your application.

Cheers

Christian
Christian,
My use of GNU parallel does not include ssh. Rather I simply fill the slurm  node with —jobs=ncores 

That would require to have an interactive job and having ncores_per_node/threads_per_application ssh-connections, and you have to manually trigger the script. My solution is to use parallel in a SLURM-job context and avoid the synchronization step by a human, whilst offering a potential multi-node job with smp applications. It's your choice, of course.


if I follow correctly that is what I am doing.  Here's my slurm job
#!/bin/bash
LOGDIR=/scratch/general/pe-nfs1/u0138544/logs
chmod a+x $LOGDIR/*
days=$1; shift
tid=$1; shift

if [[ "$tid"x == "x" ]]
then
    JOBDIR=`mktemp --directory --tmpdir=$LOGDIR XXXXXX`
    tid=$(basename $JOBDIR)
else
    JOBDIR=$LOGDIR/$tid
    mkdir -p $JOBDIR
fi
. /uufs/chpc.utah.edu/sys/installdir/sgspub/bin/sgsCP.sh

chmod -R a+rwx $JOBDIR
rnow=$(date +%s)
rsec=$(( $days * 24 * 3600 ))
endtime=$(( $rnow+$rsec ))

cores=`grep -c processor /proc/cpuinfo`
cores=$(( $cores / 2 ))

trap "chmod -R a+rw $JOBDIR" SIGCONT SIGTERM

parallel \
    --joblog $JOBDIR/${tid}.ll \
    --verbose \
    --jobs $cores \
    --delay 1 \
    /uufs/chpc.utah.edu/sys/installdir/sgspub/bin/chaser-10Mt 83a9a2ad-fe16-4872-b629-b9ba70ed5bbb $endtime $JOBDIR ::: {1..750}
chmod a+rw $JOBDIR/${tid}.ll

If the complete job finishes nicely then I can read/write the job log.  the trap is there in case the slurm job exceeds time limits.  But while things are running, I cannot look at the '.ll' file
rjs


  






reply via email to

[Prev in Thread] Current Thread [Next in Thread]