help-bash
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: bash -x doesn't always show "exec"


From: John Hanks
Subject: Re: bash -x doesn't always show "exec"
Date: Wed, 24 Mar 2021 08:09:20 -0500

Also it finally occurred to me to do the "oh, duh" test and compare a
"normal" session with a Slurm job session on the same physical node. The
"normal" session on the node does process the pipeline left to right like
the "normal" session on a login node, so there is something different here
between sessions via ssh to the node and using slurm to run a job on the
node. Offhand the only thing I can think of is that sessions in Slurm get
adopted into the Slurm managed cgroups, but can't see why that would affect
the order these get executed or displayed in bash -x output.

griznog

On Wed, Mar 24, 2021 at 6:03 AM John Hanks <griznog@gmail.com> wrote:

> I needed to increase the size of my readahead. The exec DOES show up
> eventually, just out of order in the Slurm job session. Which is still
> quite confusing and results in bats breaking. I'm still guessing but maybe
> some kind of race condition? I can't see how that happens, the next thing
> in a pipeline can't proceed until there is something to read from the pipe
> or the pipe goes away, can it? Given the line
>
> exec bats-exec-suite "${flags[@]}" "${filenames[@]}" |
> bats_test_count_validator | "bats-format-${formatter}"
> "${formatter_flags[@]}"
>
> Why in one session do things execute in right to left order and the other
> session in left to right order?
>
> Login "normal" session:
> + 10:52:12 /home/prd_rdip_mfish/local/libexec/bats-core/bats.debug:L213:
>  - [3,0,0][[ -d
> /hpc/projects/stv/mFish/UAT/dva_mfish/mfish/etl/tests/imports.bats ]]
> + 10:52:12 /home/prd_rdip_mfish/local/libexec/bats-core/bats.debug:L226:
>  - [3,0,1]filenames+=("$filename")
> + 10:52:12 /home/prd_rdip_mfish/local/libexec/bats-core/bats.debug:L231:
>  - [3,0,0]source /home/prd_rdip_mfish/local/lib/bats-core/validator.bash
> + 10:52:12 /home/prd_rdip_mfish/local/libexec/bats-core/bats.debug:L238:
>  - [3,0,0]exec bats-exec-suite -x
> /hpc/projects/stv/mFish/UAT/dva_mfish/mfish/etl/tests/imports.bats
> + 10:52:12 /home/prd_rdip_mfish/local/libexec/bats-core/bats.debug:L238:
>  - [3,0,0]bats_test_count_validator
> + 10:52:12 /home/prd_rdip_mfish/local/libexec/bats-core/bats.debug:L238:
>  - [3,0,0]bats-format-pretty
> + 10:52:12 /home/prd_rdip_mfish/local/lib/bats-core/validator.bash:L4:
> bats_test_count_validator():header_pattern='[0-9]+\.\.[0-9]+'
> + 10:52:12 /home/prd_rdip_mfish/local/lib/bats-core/validator.bash:L5:
> bats_test_count_validator():IFS=
> + 10:52:12 /home/prd_rdip_mfish/local/lib/bats-core/validator.bash:L5:
> bats_test_count_validator():read -r header
> + 10:52:13 /home/prd_rdip_mfish/local/lib/bats-core/validator.bash:L7:
> bats_test_count_validator():printf '%s\n' 1..43
> + 10:52:13 /home/prd_rdip_mfish/local/lib/bats-core/validator.bash:L10:
> bats_test_count_validator()[[ 1..43 =~ [0-9]+\.\.[0-9]+ ]]
>
> Slurm job session (broken behavior):
> + 10:51:58 /home/prd_rdip_mfish/local/libexec/bats-core/bats.debug:L213:
>  - [3,0,0][[ -d
> /hpc/projects/stv/mFish/UAT/dva_mfish/mfish/etl/tests/imports.bats ]]
> + 10:51:58 /home/prd_rdip_mfish/local/libexec/bats-core/bats.debug:L226:
>  - [3,0,1]filenames+=("$filename")
> + 10:51:58 /home/prd_rdip_mfish/local/libexec/bats-core/bats.debug:L231:
>  - [3,0,0]source /home/prd_rdip_mfish/local/lib/bats-core/validator.bash
> + 10:51:58 /home/prd_rdip_mfish/local/libexec/bats-core/bats.debug:L238:
>  - [3,0,0]bats-format-pretty
> + 10:51:58 /home/prd_rdip_mfish/local/libexec/bats-core/bats.debug:L238:
>  - [3,0,0]bats_test_count_validator
> + 10:51:58 /home/prd_rdip_mfish/local/lib/bats-core/validator.bash:L4:
> bats_test_count_validator():header_pattern='[0-9]+\.\.[0-9]+'
> + 10:51:58 /home/prd_rdip_mfish/local/lib/bats-core/validator.bash:L5:
> bats_test_count_validator():IFS=
> + 10:51:58 /home/prd_rdip_mfish/local/lib/bats-core/validator.bash:L5:
> bats_test_count_validator():read -r header
> + 10:51:58 /home/prd_rdip_mfish/local/libexec/bats-core/bats.debug:L238:
>  - [3,0,0]exec bats-exec-suite -x
> /hpc/projects/stv/mFish/UAT/dva_mfish/mfish/etl/tests/imports.bats
> + 10:51:59 /home/prd_rdip_mfish/local/lib/bats-core/validator.bash:L7:
> bats_test_count_validator():printf '%s\n' 1..43
> + 10:51:59 /home/prd_rdip_mfish/local/lib/bats-core/validator.bash:L10:
> bats_test_count_validator()[[ 1..43 =~ [0-9]+\.\.[0-9]+ ]]
>
> On Wed, Mar 24, 2021 at 5:46 AM John Hanks <griznog@gmail.com> wrote:
>
>> I tried dumping additional info and attaching with gdb to see differences
>> at a lower level. Still no difference in the env and set output that might
>> explain this, but attaching with gdb changed the behavior somehow. The exec
>> now appears in the "bash -x" output in both, but in the Slurm session the
>> downstream tests still fail to run so something is still differing between
>> these.
>>
>> As an aside, these machines are as identical as possible. Servers have
>> identical hardware, both PXE boot from the same golden image and run the OS
>> from in memory. Both mount the same filesystems via NFS.
>>
>> To the script I added:
>>
>> set > ~/set.$(hostname -s)
>> env | sort > ~/env.$(hostname -s)
>> echo "gdb -p $$ /bin/bash"
>> read
>> exec bats-exec-suite "${flags[@]}" "${filenames[@]}" |
>> bats_test_count_validator | "bats-format-${formatter}"
>> "${formatter_flags[@]}"
>>
>> From within slurm job:
>>
>> + 10:21:45 /home/prd_rdip_mfish/local/libexec/bats-core/bats.debug:L213:
>>  - [3,0,0][[ -d
>> /hpc/projects/stv/mFish/UAT/dva_mfish/mfish/etl/tests/imports.bats ]]
>> + 10:21:45 /home/prd_rdip_mfish/local/libexec/bats-core/bats.debug:L226:
>>  - [3,0,1]filenames+=("$filename")
>> + 10:21:45 /home/prd_rdip_mfish/local/libexec/bats-core/bats.debug:L231:
>>  - [3,0,0]source /home/prd_rdip_mfish/local/lib/bats-core/validator.bash
>> + 10:21:45 /home/prd_rdip_mfish/local/libexec/bats-core/bats.debug:L233:
>>  - [3,0,0]set -o pipefail execfail
>> + 10:21:45 /home/prd_rdip_mfish/local/libexec/bats-core/bats.debug:L234:
>>  - [3,0,0]set
>> ++ 10:21:45 /home/prd_rdip_mfish/local/libexec/bats-core/bats.debug:L234:
>>  - [3,1,0]hostname -s
>> + 10:21:45 /home/prd_rdip_mfish/local/libexec/bats-core/bats.debug:L235:
>>  - [3,0,0]sort
>> + 10:21:45 /home/prd_rdip_mfish/local/libexec/bats-core/bats.debug:L235:
>>  - [3,0,0]env
>> ++ 10:21:45 /home/prd_rdip_mfish/local/libexec/bats-core/bats.debug:L235:
>>  - [3,1,0]hostname -s
>> + 10:21:45 /home/prd_rdip_mfish/local/libexec/bats-core/bats.debug:L236:
>>  - [3,0,0]echo 'gdb -p 129624 /bin/bash'
>> gdb -p 129624 /bin/bash
>> + 10:21:45 /home/prd_rdip_mfish/local/libexec/bats-core/bats.debug:L237:
>>  - [3,0,0]read
>>
>> + 10:24:01 /home/prd_rdip_mfish/local/libexec/bats-core/bats.debug:L238:
>>  - [3,0,0]exec bats-exec-suite -x
>> /hpc/projects/stv/mFish/UAT/dva_mfish/mfish/etl/tests/imports.bats
>> + 10:24:01 /home/prd_rdip_mfish/local/libexec/bats-core/bats.debug:L238:
>>  - [3,0,0]bats_test_count_validator
>>
>> And similar flow through the normal login session:
>>
>> + 10:24:09 /home/prd_rdip_mfish/local/libexec/bats-core/bats.debug:L213:
>>  - [3,0,0][[ -d
>> /hpc/projects/stv/mFish/UAT/dva_mfish/mfish/etl/tests/imports.bats ]]
>> + 10:24:09 /home/prd_rdip_mfish/local/libexec/bats-core/bats.debug:L226:
>>  - [3,0,1]filenames+=("$filename")
>> + 10:24:09 /home/prd_rdip_mfish/local/libexec/bats-core/bats.debug:L231:
>>  - [3,0,0]source /home/prd_rdip_mfish/local/lib/bats-core/validator.bash
>> + 10:24:09 /home/prd_rdip_mfish/local/libexec/bats-core/bats.debug:L233:
>>  - [3,0,0]set -o pipefail execfail
>> + 10:24:09 /home/prd_rdip_mfish/local/libexec/bats-core/bats.debug:L234:
>>  - [3,0,0]set
>> ++ 10:24:09 /home/prd_rdip_mfish/local/libexec/bats-core/bats.debug:L234:
>>  - [3,1,0]hostname -s
>> + 10:24:09 /home/prd_rdip_mfish/local/libexec/bats-core/bats.debug:L235:
>>  - [3,0,0]env
>> + 10:24:09 /home/prd_rdip_mfish/local/libexec/bats-core/bats.debug:L235:
>>  - [3,0,0]sort
>> ++ 10:24:09 /home/prd_rdip_mfish/local/libexec/bats-core/bats.debug:L235:
>>  - [3,1,0]hostname -s
>> + 10:24:09 /home/prd_rdip_mfish/local/libexec/bats-core/bats.debug:L236:
>>  - [3,0,0]echo 'gdb -p 6035 /bin/bash'
>> gdb -p 6035 /bin/bash
>> + 10:24:09 /home/prd_rdip_mfish/local/libexec/bats-core/bats.debug:L237:
>>  - [3,0,0]read
>>
>> + 10:24:47 /home/prd_rdip_mfish/local/libexec/bats-core/bats.debug:L238:
>>  - [3,0,0]exec bats-exec-suite -x
>> /hpc/projects/stv/mFish/UAT/dva_mfish/mfish/etl/tests/imports.bats
>> + 10:24:47 /home/prd_rdip_mfish/local/libexec/bats-core/bats.debug:L238:
>>  - [3,0,0]bats_test_count_validator
>>
>>
>> Stepping through with gdb in the Slurm session:
>>
>> (gdb) n
>> Single stepping until exit from function __read_nocancel,
>> which has no line number information.
>> 0x0000000000483f30 in zread ()
>> (gdb) n
>> Single stepping until exit from function zread,
>> which has no line number information.
>> 0x0000000000484001 in zreadc ()
>> (gdb) n
>> Single stepping until exit from function zreadc,
>> which has no line number information.
>> 0x0000000000475611 in read_builtin ()
>> (gdb)
>> Single stepping until exit from function read_builtin,
>> which has no line number information.
>> 0x000000000042f35f in execute_builtin.isra.2 ()
>> (gdb)
>> Single stepping until exit from function execute_builtin.isra.2,
>> which has no line number information.
>> 0x0000000000431449 in execute_simple_command ()
>> (gdb)
>> Single stepping until exit from function execute_simple_command,
>> which has no line number information.
>> 0x0000000000432353 in execute_command_internal ()
>> (gdb)
>> Single stepping until exit from function execute_command_internal,
>> which has no line number information.
>> 0x0000000000433d3e in execute_command ()
>> (gdb)
>> Single stepping until exit from function execute_command,
>> which has no line number information.
>> 0x000000000041e375 in reader_loop ()
>> (gdb)
>> Single stepping until exit from function reader_loop,
>> which has no line number information.
>> [Detaching after fork from child process 132894]
>> [Detaching after fork from child process 132895]
>> [Detaching after fork from child process 132896]
>> [Detaching after fork from child process 132956]
>> 0x000000000041e2bf in reader_loop ()
>>
>>
>> Compared to stepping through in the normal login session:
>>
>> Single stepping until exit from function __read_nocancel,
>> which has no line number information.
>> 0x0000000000483f30 in zread ()
>> (gdb)
>> Single stepping until exit from function zread,
>> which has no line number information.
>> 0x0000000000484001 in zreadc ()
>> (gdb)
>> Single stepping until exit from function zreadc,
>> which has no line number information.
>> 0x0000000000475611 in read_builtin ()
>> (gdb)
>> Single stepping until exit from function read_builtin,
>> which has no line number information.
>> 0x000000000042f35f in execute_builtin.isra.2 ()
>> (gdb)
>> Single stepping until exit from function execute_builtin.isra.2,
>> which has no line number information.
>> 0x0000000000431449 in execute_simple_command ()
>> (gdb)
>> Single stepping until exit from function execute_simple_command,
>> which has no line number information.
>> 0x0000000000432353 in execute_command_internal ()
>> (gdb)
>> Single stepping until exit from function execute_command_internal,
>> which has no line number information.
>> 0x0000000000433d3e in execute_command ()
>> (gdb)
>> Single stepping until exit from function execute_command,
>> which has no line number information.
>> 0x000000000041e375 in reader_loop ()
>> (gdb)
>> Single stepping until exit from function reader_loop,
>> which has no line number information.
>> [Detaching after fork from child process 6967]
>> [Detaching after fork from child process 6968]
>> [Detaching after fork from child process 6969]
>> 0x000000000041c9de in main ()
>>
>> On Tue, Mar 23, 2021 at 6:00 PM John Hanks <griznog@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I'm troubleshooting some bash scripts and have an issue where when ran
>>> from a normal ssh session the scripts work fine, but when ran from with a
>>> Slurm managed job they fail. I traced this down to a point in the code
>>> where they fork directions at runtime. The script has at the end:
>>>
>>> set -o pipefail execfail
>>> exec bats-exec-suite "${flags[@]}" "${filenames[@]}" |
>>> bats_test_count_validator | "bats-format-${formatter}"
>>> "${formatter_flags[@]}"
>>>
>>> When executed on a vanilla ssh session connection it works and does this:
>>>
>>> + 22:38:48 /home/prd_rdip_mfish/local/libexec/bats-core/bats.debug:L213:
>>>  - [3,0,0][[ -d
>>> /hpc/projects/stv/mFish/UAT/dva_mfish/mfish/etl/tests/imports.bats ]]
>>> + 22:38:48 /home/prd_rdip_mfish/local/libexec/bats-core/bats.debug:L226:
>>>  - [3,0,1]filenames+=("$filename")
>>> + 22:38:48 /home/prd_rdip_mfish/local/libexec/bats-core/bats.debug:L231:
>>>  - [3,0,0]source /home/prd_rdip_mfish/local/lib/bats-core/validator.bash
>>> + 22:38:48 /home/prd_rdip_mfish/local/libexec/bats-core/bats.debug:L233:
>>>  - [3,0,0]set -o pipefail execfail
>>> + 22:38:48 /home/prd_rdip_mfish/local/libexec/bats-core/bats.debug:L234:
>>>  - [3,0,0]exec bats-exec-suite -x
>>> /hpc/projects/stv/mFish/UAT/dva_mfish/mfish/etl/tests/imports.bats
>>> + 22:38:48 /home/prd_rdip_mfish/local/libexec/bats-core/bats.debug:L234:
>>>  - [3,0,0]bats-format-pretty
>>> + 22:38:48 /home/prd_rdip_mfish/local/libexec/bats-core/bats.debug:L234:
>>>  - [3,0,0]bats_test_count_validator
>>>
>>> But within a slurm job (whether interactive or a batch job, think with a
>>> pty or without one, either way same failure):
>>>
>>> + 22:37:17 /home/prd_rdip_mfish/local/libexec/bats-core/bats.debug:L213:
>>>  - [3,0,0][[ -d
>>> /hpc/projects/stv/mFish/UAT/dva_mfish/mfish/etl/tests/imports.bats ]]
>>> + 22:37:17 /home/prd_rdip_mfish/local/libexec/bats-core/bats.debug:L226:
>>>  - [3,0,1]filenames+=("$filename")
>>> + 22:37:17 /home/prd_rdip_mfish/local/libexec/bats-core/bats.debug:L231:
>>>  - [3,0,0]source /home/prd_rdip_mfish/local/lib/bats-core/validator.bash
>>> + 22:37:17 /home/prd_rdip_mfish/local/libexec/bats-core/bats.debug:L233:
>>>  - [3,0,0]set -o pipefail execfail
>>> + 22:37:17 /home/prd_rdip_mfish/local/libexec/bats-core/bats.debug:L234:
>>>  - [3,0,0]bats-format-pretty
>>> + 22:37:17 /home/prd_rdip_mfish/local/libexec/bats-core/bats.debug:L234:
>>>  - [3,0,0]bats_test_count_validator
>>>
>>> I've looked at diffs of "env" and "set" between these sessions, can't
>>> find any significant difference between them. All files referenced are
>>> available. shopt settings are identical on both. I can't find any reason
>>> they behave differently at this point, but they do.
>>>
>>> Any answers, advice or pointers to where I should be looking are
>>> appreciated. I'm really confused as to how "bash -x" output can differ for
>>> these lines of code. The Google queries I've come up with return a lot of
>>> hits, but the relevant terms are just too generic to narrow things down to
>>> a useful set.
>>>
>>> Best,
>>>
>>> griznog
>>>
>>


reply via email to

[Prev in Thread] Current Thread [Next in Thread]