help-bash
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: bash -x doesn't always show "exec"


From: John Hanks
Subject: Re: bash -x doesn't always show "exec"
Date: Wed, 24 Mar 2021 15:44:03 -0500

I think I've more or less satisfied my curiosity here. My (still shaky)
understanding of this is that there is no guaranteed order of execution in
a pipeline, processes are kicked off concurrently and order is somewhat
non-deterministic. The differences I see between a normal login shell and a
Slurm managed job environment can probably be explained by Slurm
changing the way I/O is buffered, causing the change in order of appearance
in "bash -x" output. That may not be completely factually correct, but it's
good enough for me to believe it and move on to the next problem I don't
understand.

Thanks,

griznog

On Wed, Mar 24, 2021 at 8:09 AM John Hanks <griznog@gmail.com> wrote:

> Also it finally occurred to me to do the "oh, duh" test and compare a
> "normal" session with a Slurm job session on the same physical node. The
> "normal" session on the node does process the pipeline left to right like
> the "normal" session on a login node, so there is something different here
> between sessions via ssh to the node and using slurm to run a job on the
> node. Offhand the only thing I can think of is that sessions in Slurm get
> adopted into the Slurm managed cgroups, but can't see why that would affect
> the order these get executed or displayed in bash -x output.
>
> griznog
>
> On Wed, Mar 24, 2021 at 6:03 AM John Hanks <griznog@gmail.com> wrote:
>
>> I needed to increase the size of my readahead. The exec DOES show up
>> eventually, just out of order in the Slurm job session. Which is still
>> quite confusing and results in bats breaking. I'm still guessing but maybe
>> some kind of race condition? I can't see how that happens, the next thing
>> in a pipeline can't proceed until there is something to read from the pipe
>> or the pipe goes away, can it? Given the line
>>
>> exec bats-exec-suite "${flags[@]}" "${filenames[@]}" |
>> bats_test_count_validator | "bats-format-${formatter}"
>> "${formatter_flags[@]}"
>>
>> Why in one session do things execute in right to left order and the other
>> session in left to right order?
>>
>> Login "normal" session:
>> + 10:52:12 /home/prd_rdip_mfish/local/libexec/bats-core/bats.debug:L213:
>>  - [3,0,0][[ -d
>> /hpc/projects/stv/mFish/UAT/dva_mfish/mfish/etl/tests/imports.bats ]]
>> + 10:52:12 /home/prd_rdip_mfish/local/libexec/bats-core/bats.debug:L226:
>>  - [3,0,1]filenames+=("$filename")
>> + 10:52:12 /home/prd_rdip_mfish/local/libexec/bats-core/bats.debug:L231:
>>  - [3,0,0]source /home/prd_rdip_mfish/local/lib/bats-core/validator.bash
>> + 10:52:12 /home/prd_rdip_mfish/local/libexec/bats-core/bats.debug:L238:
>>  - [3,0,0]exec bats-exec-suite -x
>> /hpc/projects/stv/mFish/UAT/dva_mfish/mfish/etl/tests/imports.bats
>> + 10:52:12 /home/prd_rdip_mfish/local/libexec/bats-core/bats.debug:L238:
>>  - [3,0,0]bats_test_count_validator
>> + 10:52:12 /home/prd_rdip_mfish/local/libexec/bats-core/bats.debug:L238:
>>  - [3,0,0]bats-format-pretty
>> + 10:52:12 /home/prd_rdip_mfish/local/lib/bats-core/validator.bash:L4:
>> bats_test_count_validator():header_pattern='[0-9]+\.\.[0-9]+'
>> + 10:52:12 /home/prd_rdip_mfish/local/lib/bats-core/validator.bash:L5:
>> bats_test_count_validator():IFS=
>> + 10:52:12 /home/prd_rdip_mfish/local/lib/bats-core/validator.bash:L5:
>> bats_test_count_validator():read -r header
>> + 10:52:13 /home/prd_rdip_mfish/local/lib/bats-core/validator.bash:L7:
>> bats_test_count_validator():printf '%s\n' 1..43
>> + 10:52:13 /home/prd_rdip_mfish/local/lib/bats-core/validator.bash:L10:
>> bats_test_count_validator()[[ 1..43 =~ [0-9]+\.\.[0-9]+ ]]
>>
>> Slurm job session (broken behavior):
>> + 10:51:58 /home/prd_rdip_mfish/local/libexec/bats-core/bats.debug:L213:
>>  - [3,0,0][[ -d
>> /hpc/projects/stv/mFish/UAT/dva_mfish/mfish/etl/tests/imports.bats ]]
>> + 10:51:58 /home/prd_rdip_mfish/local/libexec/bats-core/bats.debug:L226:
>>  - [3,0,1]filenames+=("$filename")
>> + 10:51:58 /home/prd_rdip_mfish/local/libexec/bats-core/bats.debug:L231:
>>  - [3,0,0]source /home/prd_rdip_mfish/local/lib/bats-core/validator.bash
>> + 10:51:58 /home/prd_rdip_mfish/local/libexec/bats-core/bats.debug:L238:
>>  - [3,0,0]bats-format-pretty
>> + 10:51:58 /home/prd_rdip_mfish/local/libexec/bats-core/bats.debug:L238:
>>  - [3,0,0]bats_test_count_validator
>> + 10:51:58 /home/prd_rdip_mfish/local/lib/bats-core/validator.bash:L4:
>> bats_test_count_validator():header_pattern='[0-9]+\.\.[0-9]+'
>> + 10:51:58 /home/prd_rdip_mfish/local/lib/bats-core/validator.bash:L5:
>> bats_test_count_validator():IFS=
>> + 10:51:58 /home/prd_rdip_mfish/local/lib/bats-core/validator.bash:L5:
>> bats_test_count_validator():read -r header
>> + 10:51:58 /home/prd_rdip_mfish/local/libexec/bats-core/bats.debug:L238:
>>  - [3,0,0]exec bats-exec-suite -x
>> /hpc/projects/stv/mFish/UAT/dva_mfish/mfish/etl/tests/imports.bats
>> + 10:51:59 /home/prd_rdip_mfish/local/lib/bats-core/validator.bash:L7:
>> bats_test_count_validator():printf '%s\n' 1..43
>> + 10:51:59 /home/prd_rdip_mfish/local/lib/bats-core/validator.bash:L10:
>> bats_test_count_validator()[[ 1..43 =~ [0-9]+\.\.[0-9]+ ]]
>>
>> On Wed, Mar 24, 2021 at 5:46 AM John Hanks <griznog@gmail.com> wrote:
>>
>>> I tried dumping additional info and attaching with gdb to see
>>> differences at a lower level. Still no difference in the env and set output
>>> that might explain this, but attaching with gdb changed the behavior
>>> somehow. The exec now appears in the "bash -x" output in both, but in the
>>> Slurm session the downstream tests still fail to run so something is still
>>> differing between these.
>>>
>>> As an aside, these machines are as identical as possible. Servers have
>>> identical hardware, both PXE boot from the same golden image and run the OS
>>> from in memory. Both mount the same filesystems via NFS.
>>>
>>> To the script I added:
>>>
>>> set > ~/set.$(hostname -s)
>>> env | sort > ~/env.$(hostname -s)
>>> echo "gdb -p $$ /bin/bash"
>>> read
>>> exec bats-exec-suite "${flags[@]}" "${filenames[@]}" |
>>> bats_test_count_validator | "bats-format-${formatter}"
>>> "${formatter_flags[@]}"
>>>
>>> From within slurm job:
>>>
>>> + 10:21:45 /home/prd_rdip_mfish/local/libexec/bats-core/bats.debug:L213:
>>>  - [3,0,0][[ -d
>>> /hpc/projects/stv/mFish/UAT/dva_mfish/mfish/etl/tests/imports.bats ]]
>>> + 10:21:45 /home/prd_rdip_mfish/local/libexec/bats-core/bats.debug:L226:
>>>  - [3,0,1]filenames+=("$filename")
>>> + 10:21:45 /home/prd_rdip_mfish/local/libexec/bats-core/bats.debug:L231:
>>>  - [3,0,0]source /home/prd_rdip_mfish/local/lib/bats-core/validator.bash
>>> + 10:21:45 /home/prd_rdip_mfish/local/libexec/bats-core/bats.debug:L233:
>>>  - [3,0,0]set -o pipefail execfail
>>> + 10:21:45 /home/prd_rdip_mfish/local/libexec/bats-core/bats.debug:L234:
>>>  - [3,0,0]set
>>> ++ 10:21:45
>>> /home/prd_rdip_mfish/local/libexec/bats-core/bats.debug:L234:  -
>>> [3,1,0]hostname -s
>>> + 10:21:45 /home/prd_rdip_mfish/local/libexec/bats-core/bats.debug:L235:
>>>  - [3,0,0]sort
>>> + 10:21:45 /home/prd_rdip_mfish/local/libexec/bats-core/bats.debug:L235:
>>>  - [3,0,0]env
>>> ++ 10:21:45
>>> /home/prd_rdip_mfish/local/libexec/bats-core/bats.debug:L235:  -
>>> [3,1,0]hostname -s
>>> + 10:21:45 /home/prd_rdip_mfish/local/libexec/bats-core/bats.debug:L236:
>>>  - [3,0,0]echo 'gdb -p 129624 /bin/bash'
>>> gdb -p 129624 /bin/bash
>>> + 10:21:45 /home/prd_rdip_mfish/local/libexec/bats-core/bats.debug:L237:
>>>  - [3,0,0]read
>>>
>>> + 10:24:01 /home/prd_rdip_mfish/local/libexec/bats-core/bats.debug:L238:
>>>  - [3,0,0]exec bats-exec-suite -x
>>> /hpc/projects/stv/mFish/UAT/dva_mfish/mfish/etl/tests/imports.bats
>>> + 10:24:01 /home/prd_rdip_mfish/local/libexec/bats-core/bats.debug:L238:
>>>  - [3,0,0]bats_test_count_validator
>>>
>>> And similar flow through the normal login session:
>>>
>>> + 10:24:09 /home/prd_rdip_mfish/local/libexec/bats-core/bats.debug:L213:
>>>  - [3,0,0][[ -d
>>> /hpc/projects/stv/mFish/UAT/dva_mfish/mfish/etl/tests/imports.bats ]]
>>> + 10:24:09 /home/prd_rdip_mfish/local/libexec/bats-core/bats.debug:L226:
>>>  - [3,0,1]filenames+=("$filename")
>>> + 10:24:09 /home/prd_rdip_mfish/local/libexec/bats-core/bats.debug:L231:
>>>  - [3,0,0]source /home/prd_rdip_mfish/local/lib/bats-core/validator.bash
>>> + 10:24:09 /home/prd_rdip_mfish/local/libexec/bats-core/bats.debug:L233:
>>>  - [3,0,0]set -o pipefail execfail
>>> + 10:24:09 /home/prd_rdip_mfish/local/libexec/bats-core/bats.debug:L234:
>>>  - [3,0,0]set
>>> ++ 10:24:09
>>> /home/prd_rdip_mfish/local/libexec/bats-core/bats.debug:L234:  -
>>> [3,1,0]hostname -s
>>> + 10:24:09 /home/prd_rdip_mfish/local/libexec/bats-core/bats.debug:L235:
>>>  - [3,0,0]env
>>> + 10:24:09 /home/prd_rdip_mfish/local/libexec/bats-core/bats.debug:L235:
>>>  - [3,0,0]sort
>>> ++ 10:24:09
>>> /home/prd_rdip_mfish/local/libexec/bats-core/bats.debug:L235:  -
>>> [3,1,0]hostname -s
>>> + 10:24:09 /home/prd_rdip_mfish/local/libexec/bats-core/bats.debug:L236:
>>>  - [3,0,0]echo 'gdb -p 6035 /bin/bash'
>>> gdb -p 6035 /bin/bash
>>> + 10:24:09 /home/prd_rdip_mfish/local/libexec/bats-core/bats.debug:L237:
>>>  - [3,0,0]read
>>>
>>> + 10:24:47 /home/prd_rdip_mfish/local/libexec/bats-core/bats.debug:L238:
>>>  - [3,0,0]exec bats-exec-suite -x
>>> /hpc/projects/stv/mFish/UAT/dva_mfish/mfish/etl/tests/imports.bats
>>> + 10:24:47 /home/prd_rdip_mfish/local/libexec/bats-core/bats.debug:L238:
>>>  - [3,0,0]bats_test_count_validator
>>>
>>>
>>> Stepping through with gdb in the Slurm session:
>>>
>>> (gdb) n
>>> Single stepping until exit from function __read_nocancel,
>>> which has no line number information.
>>> 0x0000000000483f30 in zread ()
>>> (gdb) n
>>> Single stepping until exit from function zread,
>>> which has no line number information.
>>> 0x0000000000484001 in zreadc ()
>>> (gdb) n
>>> Single stepping until exit from function zreadc,
>>> which has no line number information.
>>> 0x0000000000475611 in read_builtin ()
>>> (gdb)
>>> Single stepping until exit from function read_builtin,
>>> which has no line number information.
>>> 0x000000000042f35f in execute_builtin.isra.2 ()
>>> (gdb)
>>> Single stepping until exit from function execute_builtin.isra.2,
>>> which has no line number information.
>>> 0x0000000000431449 in execute_simple_command ()
>>> (gdb)
>>> Single stepping until exit from function execute_simple_command,
>>> which has no line number information.
>>> 0x0000000000432353 in execute_command_internal ()
>>> (gdb)
>>> Single stepping until exit from function execute_command_internal,
>>> which has no line number information.
>>> 0x0000000000433d3e in execute_command ()
>>> (gdb)
>>> Single stepping until exit from function execute_command,
>>> which has no line number information.
>>> 0x000000000041e375 in reader_loop ()
>>> (gdb)
>>> Single stepping until exit from function reader_loop,
>>> which has no line number information.
>>> [Detaching after fork from child process 132894]
>>> [Detaching after fork from child process 132895]
>>> [Detaching after fork from child process 132896]
>>> [Detaching after fork from child process 132956]
>>> 0x000000000041e2bf in reader_loop ()
>>>
>>>
>>> Compared to stepping through in the normal login session:
>>>
>>> Single stepping until exit from function __read_nocancel,
>>> which has no line number information.
>>> 0x0000000000483f30 in zread ()
>>> (gdb)
>>> Single stepping until exit from function zread,
>>> which has no line number information.
>>> 0x0000000000484001 in zreadc ()
>>> (gdb)
>>> Single stepping until exit from function zreadc,
>>> which has no line number information.
>>> 0x0000000000475611 in read_builtin ()
>>> (gdb)
>>> Single stepping until exit from function read_builtin,
>>> which has no line number information.
>>> 0x000000000042f35f in execute_builtin.isra.2 ()
>>> (gdb)
>>> Single stepping until exit from function execute_builtin.isra.2,
>>> which has no line number information.
>>> 0x0000000000431449 in execute_simple_command ()
>>> (gdb)
>>> Single stepping until exit from function execute_simple_command,
>>> which has no line number information.
>>> 0x0000000000432353 in execute_command_internal ()
>>> (gdb)
>>> Single stepping until exit from function execute_command_internal,
>>> which has no line number information.
>>> 0x0000000000433d3e in execute_command ()
>>> (gdb)
>>> Single stepping until exit from function execute_command,
>>> which has no line number information.
>>> 0x000000000041e375 in reader_loop ()
>>> (gdb)
>>> Single stepping until exit from function reader_loop,
>>> which has no line number information.
>>> [Detaching after fork from child process 6967]
>>> [Detaching after fork from child process 6968]
>>> [Detaching after fork from child process 6969]
>>> 0x000000000041c9de in main ()
>>>
>>> On Tue, Mar 23, 2021 at 6:00 PM John Hanks <griznog@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> I'm troubleshooting some bash scripts and have an issue where when ran
>>>> from a normal ssh session the scripts work fine, but when ran from with a
>>>> Slurm managed job they fail. I traced this down to a point in the code
>>>> where they fork directions at runtime. The script has at the end:
>>>>
>>>> set -o pipefail execfail
>>>> exec bats-exec-suite "${flags[@]}" "${filenames[@]}" |
>>>> bats_test_count_validator | "bats-format-${formatter}"
>>>> "${formatter_flags[@]}"
>>>>
>>>> When executed on a vanilla ssh session connection it works and does
>>>> this:
>>>>
>>>> + 22:38:48
>>>> /home/prd_rdip_mfish/local/libexec/bats-core/bats.debug:L213:  - [3,0,0][[
>>>> -d /hpc/projects/stv/mFish/UAT/dva_mfish/mfish/etl/tests/imports.bats ]]
>>>> + 22:38:48
>>>> /home/prd_rdip_mfish/local/libexec/bats-core/bats.debug:L226:  -
>>>> [3,0,1]filenames+=("$filename")
>>>> + 22:38:48
>>>> /home/prd_rdip_mfish/local/libexec/bats-core/bats.debug:L231:  -
>>>> [3,0,0]source /home/prd_rdip_mfish/local/lib/bats-core/validator.bash
>>>> + 22:38:48
>>>> /home/prd_rdip_mfish/local/libexec/bats-core/bats.debug:L233:  - [3,0,0]set
>>>> -o pipefail execfail
>>>> + 22:38:48
>>>> /home/prd_rdip_mfish/local/libexec/bats-core/bats.debug:L234:  -
>>>> [3,0,0]exec bats-exec-suite -x
>>>> /hpc/projects/stv/mFish/UAT/dva_mfish/mfish/etl/tests/imports.bats
>>>> + 22:38:48
>>>> /home/prd_rdip_mfish/local/libexec/bats-core/bats.debug:L234:  -
>>>> [3,0,0]bats-format-pretty
>>>> + 22:38:48
>>>> /home/prd_rdip_mfish/local/libexec/bats-core/bats.debug:L234:  -
>>>> [3,0,0]bats_test_count_validator
>>>>
>>>> But within a slurm job (whether interactive or a batch job, think with
>>>> a pty or without one, either way same failure):
>>>>
>>>> + 22:37:17
>>>> /home/prd_rdip_mfish/local/libexec/bats-core/bats.debug:L213:  - [3,0,0][[
>>>> -d /hpc/projects/stv/mFish/UAT/dva_mfish/mfish/etl/tests/imports.bats ]]
>>>> + 22:37:17
>>>> /home/prd_rdip_mfish/local/libexec/bats-core/bats.debug:L226:  -
>>>> [3,0,1]filenames+=("$filename")
>>>> + 22:37:17
>>>> /home/prd_rdip_mfish/local/libexec/bats-core/bats.debug:L231:  -
>>>> [3,0,0]source /home/prd_rdip_mfish/local/lib/bats-core/validator.bash
>>>> + 22:37:17
>>>> /home/prd_rdip_mfish/local/libexec/bats-core/bats.debug:L233:  - [3,0,0]set
>>>> -o pipefail execfail
>>>> + 22:37:17
>>>> /home/prd_rdip_mfish/local/libexec/bats-core/bats.debug:L234:  -
>>>> [3,0,0]bats-format-pretty
>>>> + 22:37:17
>>>> /home/prd_rdip_mfish/local/libexec/bats-core/bats.debug:L234:  -
>>>> [3,0,0]bats_test_count_validator
>>>>
>>>> I've looked at diffs of "env" and "set" between these sessions, can't
>>>> find any significant difference between them. All files referenced are
>>>> available. shopt settings are identical on both. I can't find any reason
>>>> they behave differently at this point, but they do.
>>>>
>>>> Any answers, advice or pointers to where I should be looking are
>>>> appreciated. I'm really confused as to how "bash -x" output can differ for
>>>> these lines of code. The Google queries I've come up with return a lot of
>>>> hits, but the relevant terms are just too generic to narrow things down to
>>>> a useful set.
>>>>
>>>> Best,
>>>>
>>>> griznog
>>>>
>>>


reply via email to

[Prev in Thread] Current Thread [Next in Thread]