gluster-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Gluster-devel] Gluster 3.x hangs


From: nicolas prochazka
Subject: [Gluster-devel] Gluster 3.x hangs
Date: Mon, 29 Mar 2010 12:33:24 +0200

Hello, 
some weeks ago, i send report to tell you that's glusterfs 3.x reboot our system when we are testing some ha ( desactivate network interface : ifconfig eth0 down).
You cannot reproduce into your systems.

Reboot of our system is due to : hung_task_panic  and hung_task_timeout_secs , when a task is blocking during 120 s , linux kernel does panic.
so set ung_task_panic to 0 or hung_task_timeout_secs > 600 to let some time.


1 - two server / client in replicate mode
2 - First server 10.98.98.1 is configuration server 
3 - run gluster on two servers as :
 /usr/local/sbin/glusterfsd --log-level=DEBUG --log-file=/tmpsafe/server.log -N -f /etc/glusterfs/glusterfs-server.vol
/usr/local/sbin/glusterfs --log-level=DEBUG --log-file=/tmpsafe/client.log -N -s 10.98.98.1 /mnt/vdisk/

4 - now on 10.98.98.1, do a ifconfig eth0 down.
5 - on 10.98.98.10, after a little timeout, ls /mnt/vdisk comes back  ( using 10.98.98.10 as server )
6 - on 10.98.98.1 , ls /mnt/vdisk hangs forever
7  - on 10.98.98.1 , kill glusterfs client, rerun glusterfs , then ls /mnt/vdisk reworks again ( using 10.98.98.1 as server )

during 6 , there's no log on server and client on 10.98.98.1

show log, 
Regards, 
Nicolas Prochazka.

-----------------------------------------------



#This file is auto generated, not edit ( Nicolas Prochazka Sep 2009)
# -------------    Create Brick blade definition
volume 10.98.98.1
type protocol/client
option transport-type tcp/client
option remote-host 10.98.98.1
option transport.socket.nodelay on
option remote-subvolume brick
end-volume


volume 10.98.98.10
type protocol/client
option transport-type tcp/client
option remote-host 10.98.98.10
option transport.socket.nodelay on
option remote-subvolume brick
end-volume


# -------------    Create Brick Replicate  definition
# -------------    Create Distribute definition
volume last
type cluster/distribute
subvolumes  10.98.98.1 10.98.98.10
end-volume



volume iothreads
type performance/io-threads
option thread-count 8
subvolumes last
end-volume

volume io-cache
type performance/io-cache
option cache-size 2GB             # default is 32MB
option cache-timeout 5  # default is 1
subvolumes iothreads
end-volume

volume writebehind
type performance/write-behind
option cache-size 4MB
subvolumes io-cache
end-volume






DEV-10.98.98.1:~# cat /etc/glusterfs/glusterfs-server.vol  
volume brickless
type storage/posix
option directory /mnt/disks/export
end-volume

volume brickthread
type features/locks
subvolumes brickless
end-volume

volume brickcache
type performance/io-cache
option cache-size 2GB             # default is 32MB
option cache-timeout 2  # default is 1
subvolumes brickthread
end-volume


volume brick
type performance/io-threads
option thread-count 8
subvolumes brickcache
end-volume



volume server
type protocol/server
subvolumes brick
option client-volume-filename /etc/glusterfs/Gglusterfs-client.vol
option transport-type tcp
option transport.socket.nodelay on
option verify-volfile-checksum no
option auth.addr.brick.allow 10.98.98.*
end-volume


Log of client on 10.98.98.10 , all seems to be ok.  

[2010-03-29 12:48:04] E [client-protocol.c:415:client_ping_timer_expired] 10.98.98.1: Server 10.98.98.1:6996 has not responded in the last 42 seconds, disconnecting.
[2010-03-29 12:48:04] E [saved-frames.c:165:saved_frames_unwind] 10.98.98.1: forced unwinding frame type(1) op(STATFS)
[2010-03-29 12:48:04] E [saved-frames.c:165:saved_frames_unwind] 10.98.98.1: forced unwinding frame type(2) op(PING)
[2010-03-29 12:48:04] D [client-protocol.c:537:client_ping_cbk] 10.98.98.1: timer must have expired
[2010-03-29 12:48:04] N [client-protocol.c:6994:notify] 10.98.98.1: disconnected
[2010-03-29 12:48:06] E [socket.c:762:socket_connect_finish] 10.98.98.1: connection to 10.98.98.1:6996 failed (No route to host)
[2010-03-29 12:48:09] E [socket.c:762:socket_connect_finish] 10.98.98.1: connection to 10.98.98.1:6996 failed (No route to host)


log on 10.98.98.1


[2010-03-29 16:30:17] D [dht-diskusage.c:71:dht_du_info_cbk] last: on subvolume '10.98.98.1': avail_percent is: 99.00 and avail_space is: 15069396992
[2010-03-29 16:30:17] N [client-protocol.c:6246:client_setvolume_cbk] 10.98.98.1: Connected to 10.98.98.1:6996, attached to remote volume 'brick'.
[2010-03-29 16:30:17] N [client-protocol.c:6246:client_setvolume_cbk] 10.98.98.10: Connected to 10.98.98.10:6996, attached to remote volume 'brick'.
[2010-03-29 16:30:17] N [client-protocol.c:6246:client_setvolume_cbk] 10.98.98.10: Connected to 10.98.98.10:6996, attached to remote volume 'brick'.
[2010-03-29 16:30:17] D [dht-diskusage.c:71:dht_du_info_cbk] last: on subvolume '10.98.98.1': avail_percent is: 99.00 and avail_space is: 15069396992
[2010-03-29 16:30:17] D [dht-diskusage.c:71:dht_du_info_cbk] last: on subvolume '10.98.98.10': avail_percent is: 99.00 and avail_space is: 88316628992
[2010-03-29 16:30:17] D [dht-diskusage.c:71:dht_du_info_cbk] last: on subvolume '10.98.98.10': avail_percent is: 99.00 and avail_space is: 88316628992
[2010-03-29 16:30:21] D [dht-layout.c:576:dht_layout_normalize] last: found anomalies in /iso. holes=1 overlaps=0
[2010-03-29 16:30:21] D [dht-common.c:164:dht_lookup_dir_cbk] last: fixing assignment on /iso
[2010-03-29 16:30:21] D [dht-layout.c:576:dht_layout_normalize] last: found anomalies in /ha. holes=1 overlaps=0
[2010-03-29 16:30:21] D [dht-common.c:164:dht_lookup_dir_cbk] last: fixing assignment on /ha
[2010-03-29 16:30:21] D [dht-layout.c:576:dht_layout_normalize] last: found anomalies in /monitoring. holes=1 overlaps=0
[2010-03-29 16:30:21] D [dht-common.c:164:dht_lookup_dir_cbk] last: fixing assignment on /monitoring

nothing during hang
restart

[2010-03-29 16:58:26] E [saved-frames.c:165:saved_frames_unwind] 10.98.98.10: forced unwinding frame type(1) op(LOOKUP)
[2010-03-29 16:58:26] E [saved-frames.c:165:saved_frames_unwind] 10.98.98.10: forced unwinding frame type(1) op(STATFS)
[2010-03-29 16:58:26] E [saved-frames.c:165:saved_frames_unwind] 10.98.98.10: forced unwinding frame type(1) op(LOOKUP)
[2010-03-29 16:58:26] E [saved-frames.c:165:saved_frames_unwind] 10.98.98.10: forced unwinding frame type(1) op(STATFS)
[2010-03-29 16:58:26] E [saved-frames.c:165:saved_frames_unwind] 10.98.98.10: forced unwinding frame type(1) op(STATFS)
[2010-03-29 16:58:26] E [saved-frames.c:165:saved_frames_unwind] 10.98.98.10: forced unwinding frame type(1) op(STATFS)
[2010-03-29 16:58:26] E [saved-frames.c:165:saved_frames_unwind] 10.98.98.10: forced unwinding frame type(1) op(LOOKUP)
[2010-03-29 16:58:26] E [saved-frames.c:165:saved_frames_unwind] 10.98.98.10: forced unwinding frame type(1) op(STATFS)
[2010-03-29 16:58:26] E [saved-frames.c:165:saved_frames_unwind] 10.98.98.10: forced unwinding frame type(1) op(LOOKUP)
[2010-03-29 16:58:26] E [saved-frames.c:165:saved_frames_unwind] 10.98.98.10: forced unwinding frame type(1) op(STATFS)
[2010-03-29 16:58:26] E [saved-frames.c:165:saved_frames_unwind] 10.98.98.10: forced unwinding frame type(1) op(LOOKUP)
[2010-03-29 16:58:26] D [socket.c:1326:socket_submit] 10.98.98.10: not connected (priv->connected = 255)
[2010-03-29 16:58:26] D [dht-common.c:1590:dht_fd_cbk] last: subvolume 10.98.98.10 returned -1 (Transport endpoint is not connected)
[2010-03-29 16:58:26] E [saved-frames.c:165:saved_frames_unwind] 10.98.98.10: forced unwinding frame type(1) op(STATFS)
[2010-03-29 16:58:26] E [saved-frames.c:165:saved_frames_unwind] 10.98.98.10: forced unwinding frame type(1) op(STATFS)
[2010-03-29 16:58:26] E [saved-frames.c:165:saved_frames_unwind] 10.98.98.10: forced unwinding frame type(1) op(STATFS)
[2010-03-29 16:58:26] D [dht-common.c:1590:dht_fd_cbk] last: subvolume 10.98.98.10 returned -1 (Transport endpoint is not connected)
[2010-03-29 16:58:26] E [saved-frames.c:165:saved_frames_unwind] 10.98.98.10: forced unwinding frame type(1) op(STATFS)
[2010-03-29 16:58:26] E [saved-frames.c:165:saved_frames_unwind] 10.98.98.10: forced unwinding frame type(1) op(STATFS)
[2010-03-29 16:58:26] E [saved-frames.c:165:saved_frames_unwind] 10.98.98.10: forced unwinding frame type(1) op(STATFS)
[2010-03-29 16:58:26] E [saved-frames.c:165:saved_frames_unwind] 10.98.98.10: forced unwinding frame type(1) op(STATFS)
[2010-03-29 16:58:26] E [saved-frames.c:165:saved_frames_unwind] 10.98.98.10: forced unwinding frame type(1) op(STATFS)
[2010-03-29 16:58:26] E [saved-frames.c:165:saved_frames_unwind] 10.98.98.10: forced unwinding frame type(1) op(STATFS)
[2010-03-29 16:58:26] E [saved-frames.c:165:saved_frames_unwind] 10.98.98.10: forced unwinding frame type(1) op(STATFS)
[2010-03-29 16:58:26] E [saved-frames.c:165:saved_frames_unwind] 10.98.98.10: forced unwinding frame type(1) op(STATFS)
[2010-03-29 16:58:26] E [saved-frames.c:165:saved_frames_unwind] 10.98.98.10: forced unwinding frame type(1) op(STATFS)
[2010-03-29 16:58:26] E [saved-frames.c:165:saved_frames_unwind] 10.98.98.10: forced unwinding frame type(1) op(STATFS)
[2010-03-29 16:58:26] E [saved-frames.c:165:saved_frames_unwind] 10.98.98.10: forced unwinding frame type(1) op(STATFS)
[2010-03-29 16:58:26] E [saved-frames.c:165:saved_frames_unwind] 10.98.98.10: forced unwinding frame type(1) op(STATFS)
[2010-03-29 16:58:26] E [saved-frames.c:165:saved_frames_unwind] 10.98.98.10: forced unwinding frame type(1) op(STATFS)
[2010-03-29 16:58:26] E [saved-frames.c:165:saved_frames_unwind] 10.98.98.10: forced unwinding frame type(1) op(STATFS)
[2010-03-29 16:58:26] E [saved-frames.c:165:saved_frames_unwind] 10.98.98.10: forced unwinding frame type(1) op(STATFS)
[2010-03-29 16:58:26] E [saved-frames.c:165:saved_frames_unwind] 10.98.98.10: forced unwinding frame type(1) op(STATFS)
[2010-03-29 16:58:26] E [saved-frames.c:165:saved_frames_unwind] 10.98.98.10: forced unwinding frame type(1) op(STATFS)
[2010-03-29 16:58:26] E [saved-frames.c:165:saved_frames_unwind] 10.98.98.10: forced unwinding frame type(1) op(STATFS)
[2010-03-29 16:58:26] E [saved-frames.c:165:saved_frames_unwind] 10.98.98.10: forced unwinding frame type(1) op(STATFS)
[2010-03-29 16:58:26] E [saved-frames.c:165:saved_frames_unwind] 10.98.98.10: forced unwinding frame type(1) op(STATFS)
[2010-03-29 16:58:26] E [saved-frames.c:165:saved_frames_unwind] 10.98.98.10: forced unwinding frame type(1) op(STATFS)
[2010-03-29 16:58:26] E [saved-frames.c:165:saved_frames_unwind] 10.98.98.10: forced unwinding frame type(2) op(PING)
[2010-03-29 16:58:26] D [client-protocol.c:537:client_ping_cbk] 10.98.98.10: timer must have expired
[2010-03-29 16:58:29] E [socket.c:762:socket_connect_finish] 10.98.98.10: connection to 10.98.98.10:6996 failed (No route to host)


reply via email to

[Prev in Thread] Current Thread [Next in Thread]