Hi,
I'm running into build behaviour which might be caused by a non-deterministic GNU linker?! As this is quiet unlikely I hope you can support me in reducing this issue so I can find the real root cause.
Thanks,
Roger
----------------------------------------------
What is the issue?
at the company I'm working for we have an automatic (Solaris) build environment based on IBM's Clearmake. One of the translations done within the build process is a cross compilation and linking for powerpc.
.. as our products are critical in the customers' production process, we do not want to bother our customers with unneeded patches. To arrange this we have a tool made who will check the build's libraries/binaries on differences against the latest baseline created.
... lately we have a lot of 'unexpected' patches popping up which I'm analysing in more detail. What we see here is that a rebuild of such an unexpected target does solve this issue?!
----------------------------------------------
So what is happening here?
I analysed such an unexpected patch (ppc binary) and see that there is a small delta in the objdump (-x) output.
In the binary which we qualify as 'ok' we see that a symbol is put as UND
UND.objdump:0000000000000000 O *UND* 000000000000000c OOXA_interface_info
While the wrong binary has the same symbol in the .sbss segment:
sbss.objdump:0000000010019560 g O .sbss 000000000000000c OOXA_interface_info
And the library that is implementing the symbol :
lib.objdump:0000000000012c60 g O .bss 000000000000000c OOXA_interface_info
So this library looks fine as he provided the symbol OOXA_interface_info. The good 'UND' library looks fine to as he should not have this symbol allocated (extern const struct)
Now why does the wrong 'sbss' file put this symbol in the .sbss data segment? It looks as the struct is really allocated here?!
---------------
Now I made a simple script which links the binary for many times in a row (>100)
This is being done on -exact- the same buildhost with -exact- the same input files (objects,libs) and tooling etc. (note that I have the clearmake config records to confirm this). To make sure I have a stable reproduction scenario, I only do the linking step over and over again
... depending on the load (other builds) running on this buildhost I see that the output file is different (UND or .sbss?) If there is no other load on the buildhost I do not get the unwanted (sbss) binary but on a certain moment there something happening on the buildhost which will influence the crosslinker...
---------------
So now I did try to reproduce the wrong/sbss situation..
- is it the load on the buildhost causing this behavior?
a. I created an application which simulates load on this machine. Running multiple instances with approx. 100% load does not influence the linker and the result is still ok
b. I did some random parallel builds but still no results
- is it the memory swapping
.. here I created a small app which really fills the memory but still is the result ok
... but again, at moments there are more other builds running on this buildhost (by other developers) I can again reproduce it. Note that in this situation not each build will reproduce the '.sbss' situation. Depending on the moment this varies between 0% to 25% and sometimes even about 50%.
But what is causing this?!
---------------
Getting a little lost here I'm dumping as much info as possible. So I created for both the 'Ok' and 'not ok' situations the following files:
- truss output
- ps info
- objdump -x
- cleartool config record
The first differnce I notice in the truss output is:
------------------------------------------
-------------- sbss (wrong) --------------
------------------------------------------
open("/vobs/litho/.caddata/lxfs_glp/lxfs_glp/devel/sysroots/wrs_sbc8548-glibc_cgl/sysroot/te500v2/usr/lib//libc.so", O_RDONLY) = 10
ioctl(10, TCGETA, 0xFFBE6C84) Err#25 ENOTTY
fstat64(10, 0xFFBE6150) = 0
d=0x04550167 i=6151272 m=0100644 l=1 u=14302 g=2010 sz=235
at = Nov 18 08:44:09 MET 2010 [ 1290066249 ]
mt = Dec 17 02:08:28 MET 2008 [ 1229476108 ]
ct = Feb 19 08:57:07 MET 2010 [ 1266566227 ]
bsz=8192 blks=8 fs=nfs
ioctl(10, TCGETA, 0xFFBE60DC) Err#25 ENOTTY
read(10, " / * G N U l d s c".., 8192) = 235
read(10, 0x0024D02C, 8192) = 0
------------------------------------------
-------------- UND (Ok) ------------------
------------------------------------------
open("/vobs/litho/.caddata/lxfs_glp/lxfs_glp/devel/sysroots/wrs_sbc8548-glibc_cgl/sysroot/te500v2/usr/lib//libc.so", O_RDONLY) = 10
ioctl(10, TCGETA, 0xFFBE6C84) Err#25 ENOTTY
fstat64(10, 0xFFBE6150) = 0
d=0x04550167 i=6151272 m=0100644 l=1 u=14302 g=2010 sz=235
at = Nov 18 14:00:36 MET 2010 [ 1290085236 ]
mt = Dec 17 02:08:28 MET 2008 [ 1229476108 ]
ct = Feb 19 08:57:07 MET 2010 [ 1266566227 ]
bsz=8192 blks=8 fs=nfs
ioctl(10, TCGETA, 0xFFBE60DC) Err#25 ENOTTY
read(10, " / * G N U l d s c".., 8192) = 235
read(10, 0x0024D114, 8192) = 0
------------------------------------------
...here you see a shift in the address of the last load instruction. Of course at the end of the truss output there are more differences as the file created is different.
-------------------
Next I see that there is a difference in the alignment of both the objdump -x files (note the diff mentioned above):
Wrong/sbss file:
24 .sbss 00000014 0000000010019560 0000000010019560 00009560 2**3
Good/UND file:
24 .sbss 00000008 0000000010019560 0000000010019560 00009560 2**2
-------------------
Any help welcome and please let me know if I should provide more info.
thanks!
Roger