[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
bug#8824: sort occasionally hangs - appears to be in a merge-sort loop
From: |
Mina Naguib |
Subject: |
bug#8824: sort occasionally hangs - appears to be in a merge-sort loop |
Date: |
Wed, 8 Jun 2011 16:16:17 -0400 |
Hi
I've observed a few instances when `sort` simply "hangs" and never returns the
sorted data.
Currently using coreutils 8.12 on gentoo linux 2.6.25, `locale` outputs:
LANG=
LC_CTYPE="POSIX"
LC_NUMERIC="POSIX"
LC_TIME="POSIX"
LC_COLLATE="POSIX"
LC_MONETARY="POSIX"
LC_MESSAGES="POSIX"
LC_PAPER="POSIX"
LC_NAME="POSIX"
LC_ADDRESS="POSIX"
LC_TELEPHONE="POSIX"
LC_MEASUREMENT="POSIX"
LC_IDENTIFICATION="POSIX"
LC_ALL=
An example ongoing right now on one of my servers. Sort invoked as:
/usr/bin/sort -t '|' -k1,1 -k6,6n
It was fed via STDIN pipe-delimited line-based records, and STDIN was closed.
The parent process is waiting on sort to start receiving the output via STDOUT
which never happens.
Input was roughly 350 million records, each record looking something like this:
000WQg16MdC2Puk|direct_count|2049|63581|15090|1306695571
I dug into the sort process with strace and I believe it's doing one of the
final merge sorts:
read(12, "986|61518|14528|1301616354\nnHM5Og"..., 4096) = 4096
read(12, "|1967|61925|14584|1301634780\nnHMk"..., 4096) = 4096
read(12, "533\nnHNQDohSrKT3Ffn|direct_count|"..., 4096) = 4096
write(4, "4488|1302016964\nnGq7jG4hfFS9flL|d"..., 4096) = 4096
write(4, "nGq9POv11Jh7OiQ|direct_count|1965"..., 4096) = 4096
write(4, "05|1299763468\nnGqCgAT3jwkrzhc|dir"..., 4096) = 4096
write(4, "460\nnGqFItcr2F1Pz8n|tag|1995|6275"..., 4096) = 4096
write(4, "JuhUfMkOPC7J|direct_count|1914|57"..., 4096) = 4096
write(4, "|1914|57180|13801|1302640204\nnGqO"..., 4096) = 4096
After watching it do this diddy for 5.5 hours I dug in further. Here's its
open temp files:
$ date; lsof -n -p 31370 | grep tmp | nl
Wed Jun 8 15:37:41 EDT 2011
1 sort 31370 dmt 4u REG 8,1 3123896320 12705303
/tmp/sort0YtEtm
2 sort 31370 dmt 5r REG 8,1 1199907358 12705551
/tmp/sortWNOpxy
3 sort 31370 dmt 7r REG 8,1 1181994889 12705627
/tmp/sortIpm44r
4 sort 31370 dmt 8r REG 8,1 1199857626 12705636
/tmp/sort2WOPPo
5 sort 31370 dmt 9r REG 8,1 1196868088 12705640
/tmp/sortCUkW6a
6 sort 31370 dmt 10r REG 8,1 1199488185 12705644
/tmp/sortSDopQX
7 sort 31370 dmt 11r REG 8,1 1188267263 12705646
/tmp/sortUdXqke
8 sort 31370 dmt 12r REG 8,1 1176087055 12705665
/tmp/sortMiavw5
9 sort 31370 dmt 13r REG 8,1 1184572313 12705666
/tmp/sortUf5q03
10 sort 31370 dmt 14r REG 8,1 1187433621 12705672
/tmp/sorti3mKO9
11 sort 31370 dmt 15r REG 8,1 1184537051 12705680
/tmp/sort0iYrQr
12 sort 31370 dmt 16r REG 8,1 1186874578 12705696
/tmp/sortUwVkpQ
13 sort 31370 dmt 17r REG 8,1 1173767980 12705703
/tmp/sortov5uyz
14 sort 31370 dmt 18r REG 8,1 1172616590 12705711
/tmp/sort2ibXqt
15 sort 31370 dmt 19r REG 8,1 1184319006 12705714
/tmp/sortA1DmGE
16 sort 31370 dmt 20r REG 8,1 1188372691 12705728
/tmp/sortibGbvd
17 sort 31370 dmt 21r REG 8,1 1185407259 12705713
/tmp/sort4Ou1u6
again:
$ date; lsof -n -p 31370 | grep tmp | nl
Wed Jun 8 15:37:58 EDT 2011
1 sort 31370 dmt 4u REG 8,1 3432607744 12705303
/tmp/sort0YtEtm
2 sort 31370 dmt 5r REG 8,1 1199907358 12705551
/tmp/sortWNOpxy
3 sort 31370 dmt 7r REG 8,1 1181994889 12705627
/tmp/sortIpm44r
4 sort 31370 dmt 8r REG 8,1 1199857626 12705636
/tmp/sort2WOPPo
5 sort 31370 dmt 9r REG 8,1 1196868088 12705640
/tmp/sortCUkW6a
6 sort 31370 dmt 10r REG 8,1 1199488185 12705644
/tmp/sortSDopQX
7 sort 31370 dmt 11r REG 8,1 1188267263 12705646
/tmp/sortUdXqke
8 sort 31370 dmt 12r REG 8,1 1176087055 12705665
/tmp/sortMiavw5
9 sort 31370 dmt 13r REG 8,1 1184572313 12705666
/tmp/sortUf5q03
10 sort 31370 dmt 14r REG 8,1 1187433621 12705672
/tmp/sorti3mKO9
11 sort 31370 dmt 15r REG 8,1 1184537051 12705680
/tmp/sort0iYrQr
12 sort 31370 dmt 16r REG 8,1 1186874578 12705696
/tmp/sortUwVkpQ
13 sort 31370 dmt 17r REG 8,1 1173767980 12705703
/tmp/sortov5uyz
14 sort 31370 dmt 18r REG 8,1 1172616590 12705711
/tmp/sort2ibXqt
15 sort 31370 dmt 19r REG 8,1 1184319006 12705714
/tmp/sortA1DmGE
16 sort 31370 dmt 20r REG 8,1 1188372691 12705728
/tmp/sortibGbvd
17 sort 31370 dmt 21r REG 8,1 1185407259 12705713
/tmp/sort4Ou1u6
again several times, you see as the main writable file approaches the total
size sum of the other files combined:
Wed Jun 8 15:55:20 EDT 2011
1 sort 31370 dmt 4u REG 8,1 18965975040 12705303
/tmp/sort0YtEtm
2 sort 31370 dmt 5r REG 8,1 1199907358 12705551
/tmp/sortWNOpxy
3 sort 31370 dmt 7r REG 8,1 1181994889 12705627
/tmp/sortIpm44r
4 sort 31370 dmt 8r REG 8,1 1199857626 12705636
/tmp/sort2WOPPo
5 sort 31370 dmt 9r REG 8,1 1196868088 12705640
/tmp/sortCUkW6a
6 sort 31370 dmt 10r REG 8,1 1199488185 12705644
/tmp/sortSDopQX
7 sort 31370 dmt 11r REG 8,1 1188267263 12705646
/tmp/sortUdXqke
8 sort 31370 dmt 12r REG 8,1 1176087055 12705665
/tmp/sortMiavw5
9 sort 31370 dmt 13r REG 8,1 1184572313 12705666
/tmp/sortUf5q03
10 sort 31370 dmt 14r REG 8,1 1187433621 12705672
/tmp/sorti3mKO9
11 sort 31370 dmt 15r REG 8,1 1184537051 12705680
/tmp/sort0iYrQr
12 sort 31370 dmt 16r REG 8,1 1186874578 12705696
/tmp/sortUwVkpQ
13 sort 31370 dmt 17r REG 8,1 1173767980 12705703
/tmp/sortov5uyz
14 sort 31370 dmt 18r REG 8,1 1172616590 12705711
/tmp/sort2ibXqt
15 sort 31370 dmt 19r REG 8,1 1184319006 12705714
/tmp/sortA1DmGE
16 sort 31370 dmt 20r REG 8,1 1188372691 12705728
/tmp/sortibGbvd
17 sort 31370 dmt 21r REG 8,1 1185407259 12705713
/tmp/sort4Ou1u6
then what appears to be finalization and cleanup:
Wed Jun 8 15:55:28 EDT 2011
1 sort 31370 dmt 4u REG 8,1 18990366720 12705303
/tmp/sort0YtEtm
2 sort 31370 dmt 5r REG 8,1 1199907358 12705551
/tmp/sortWNOpxy
3 sort 31370 dmt 7r REG 8,1 1181994889 12705627
/tmp/sortIpm44r
4 sort 31370 dmt 8r REG 8,1 1199857626 12705636
/tmp/sort2WOPPo
5 sort 31370 dmt 9r REG 8,1 1196868088 12705640
/tmp/sortCUkW6a
6 sort 31370 dmt 12r REG 8,1 1176087055 12705665
/tmp/sortMiavw5
7 sort 31370 dmt 13r REG 8,1 1184572313 12705666
/tmp/sortUf5q03
8 sort 31370 dmt 14r REG 8,1 1187433621 12705672
/tmp/sorti3mKO9
9 sort 31370 dmt 15r REG 8,1 1184537051 12705680
/tmp/sort0iYrQr
10 sort 31370 dmt 16r REG 8,1 1186874578 12705696
/tmp/sortUwVkpQ
11 sort 31370 dmt 17r REG 8,1 1173767980 12705703
/tmp/sortov5uyz
12 sort 31370 dmt 18r REG 8,1 1172616590 12705711
/tmp/sort2ibXqt
again, almost done:
Wed Jun 8 15:55:34 EDT 2011
1 sort 31370 dmt 4u REG 8,1 18990370816 12705303
/tmp/sort0YtEtm
2 sort 31370 dmt 16r REG 8,1 1186874578 12705696
/tmp/sortUwVkpQ
again, 1 main file left:
Wed Jun 8 15:55:35 EDT 2011
1 sort 31370 dmt 4u REG 8,1 18990370816 12705303
/tmp/sort0YtEtm
But alas, 16 more readable files pop up and 1 writable starts adding up again:
Wed Jun 8 15:55:36 EDT 2011
1 sort 31370 dmt 4u REG 8,1 0 12705551
/tmp/sortevbKK1
2 sort 31370 dmt 5r REG 8,1 1178392494 12705739
/tmp/sortIIvZhW
3 sort 31370 dmt 7r REG 8,1 1180102559 12706259
/tmp/sortwZr2sU
4 sort 31370 dmt 8r REG 8,1 1181439270 12706530
/tmp/sort8B2hbx
5 sort 31370 dmt 9r REG 8,1 1183275901 12706731
/tmp/sorteGJa35
6 sort 31370 dmt 10r REG 8,1 1178538478 12706793
/tmp/sortoGXmOo
7 sort 31370 dmt 11r REG 8,1 1191511729 12706795
/tmp/sortG0RNr9
8 sort 31370 dmt 12r REG 8,1 1199965207 12706799
/tmp/sort6Q8tAy
9 sort 31370 dmt 13r REG 8,1 1192944995 12706817
/tmp/sortCGXzT2
10 sort 31370 dmt 14r REG 8,1 1191485559 12706823
/tmp/sortMwsygv
11 sort 31370 dmt 15r REG 8,1 1200144993 12706830
/tmp/sortyA2RWw
12 sort 31370 dmt 16r REG 8,1 1198251727 12706866
/tmp/sortyZlqHC
13 sort 31370 dmt 17r REG 8,1 1184876212 12706871
/tmp/sortCCjKOH
14 sort 31370 dmt 18r REG 8,1 1174916186 12706880
/tmp/sortKrUJSJ
15 sort 31370 dmt 19r REG 8,1 1184800950 12706824
/tmp/sort2Vjy7s
16 sort 31370 dmt 20r REG 8,1 1187817055 12706891
/tmp/sort89xsBm
17 sort 31370 dmt 21r REG 8,1 1185686697 12706944
/tmp/sortwjlE5L
Wed Jun 8 15:55:40 EDT 2011
1 sort 31370 dmt 4u REG 8,1 62631936 12705551
/tmp/sortevbKK1
2 sort 31370 dmt 5r REG 8,1 1178392494 12705739
/tmp/sortIIvZhW
3 sort 31370 dmt 7r REG 8,1 1180102559 12706259
/tmp/sortwZr2sU
4 sort 31370 dmt 8r REG 8,1 1181439270 12706530
/tmp/sort8B2hbx
5 sort 31370 dmt 9r REG 8,1 1183275901 12706731
/tmp/sorteGJa35
6 sort 31370 dmt 10r REG 8,1 1178538478 12706793
/tmp/sortoGXmOo
7 sort 31370 dmt 11r REG 8,1 1191511729 12706795
/tmp/sortG0RNr9
8 sort 31370 dmt 12r REG 8,1 1199965207 12706799
/tmp/sort6Q8tAy
9 sort 31370 dmt 13r REG 8,1 1192944995 12706817
/tmp/sortCGXzT2
10 sort 31370 dmt 14r REG 8,1 1191485559 12706823
/tmp/sortMwsygv
11 sort 31370 dmt 15r REG 8,1 1200144993 12706830
/tmp/sortyA2RWw
12 sort 31370 dmt 16r REG 8,1 1198251727 12706866
/tmp/sortyZlqHC
13 sort 31370 dmt 17r REG 8,1 1184876212 12706871
/tmp/sortCCjKOH
14 sort 31370 dmt 18r REG 8,1 1174916186 12706880
/tmp/sortKrUJSJ
15 sort 31370 dmt 19r REG 8,1 1184800950 12706824
/tmp/sort2Vjy7s
16 sort 31370 dmt 20r REG 8,1 1187817055 12706891
/tmp/sort89xsBm
17 sort 31370 dmt 21r REG 8,1 1185686697 12706944
/tmp/sortwjlE5L
I may be mistaken, but I don't think this is normal behaviour - once the 16-way
merge sort is done, I see no reason to observe this whole process happening
over and over.
Ironically, I had upgraded from coreutils 8.5 to 8.12 because other tools on
the same machine invokes sort with "--compress-program=gzip" and had observed
similar-but-different hanging on that tool's invocation. I dug in and found
this commit (
http://git.savannah.gnu.org/gitweb/?p=coreutils.git;a=commitdiff;h=1b31ce6982a9151d9dfe2ea3595ad7595cb9ca86
) which encouraged me to try the upgrade to fix the hang-while-compressing
issue, but I'm now seeing this hanging even in the non-compressed invocation.
In my case, I think I'll downgrade coreutils back down to 8.5 as well as
disable all tools from invoking it with "--compress-program" to get predictable
non-hanging behaviour across the board.
If I can be of further help please let me know.
Thank you.
- bug#8824: sort occasionally hangs - appears to be in a merge-sort loop,
Mina Naguib <=