[Gluster-devel] 3.0.1

gluster-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Gluster-devel] 3.0.1

From:	Gordan Bobic
Subject:	[Gluster-devel] 3.0.1
Date:	Tue, 26 Jan 2010 13:00:30 +0000
User-agent:	Thunderbird 2.0.0.22 (X11/20090625)

I upgraded to 3.0.1 last night and it still doesn't seem as stable as2.0.9. Things I have bumped into since the upgrade:

1) I've had unfsd lock up hard when exporting the volume, it couldn't be"kill -9"-ed. This happened just after a spurious disconnect (see 2).

2) Seeing random disconnects/timeouts between the servers that are onthe same switch (this was happening with 2.0.x as well, though, so notsure what's going on). This is where the file clobbering/corruption usedto occur that causes contents of one file to be replaced with contentsof a different file, when the files are open. I HAVEN'T observedclobbering with 3.0.1 (yet at least - it wasn't a particularly frequentoccurrence, but the chances of it were high on shared libraries during abig yum update when glfs is rootfs), but the disconnects still happenoccassionally, usually under heavy-ish load.

My main concern here is that open file self-healing may cover up theunderlying bug that causes the clobbering, and possibly make it occur ineven more heisenbuggy ways.

ssh sessions to both servers don't show anyproblems/disconnections/dropouts at the same time as the disconnects onglfs happen. Is there a setting to set how many heartbeat packets haveto be lost before the disconnect is initiated?


This is the sort of thing I see in the logs:

[2010-01-26 07:36:56] N [server-protocol.c:6780:notify] server:10.2.0.13:1010 disconnected[2010-01-26 07:36:56] N [server-protocol.c:6780:notify] server:10.2.0.13:1013 disconnected[2010-01-26 07:36:56] N [server-helpers.c:849:server_connection_destroy]server: destroyed connection ofthor.winterhearth.co.uk-11823-2010/01/26-05:29:32:239464-home2[2010-01-26 07:37:25] E [saved-frames.c:165:saved_frames_unwind] home3:forced unwinding frame type(1) op(SETATTR)[2010-01-26 07:37:25] E [saved-frames.c:165:saved_frames_unwind] home3:forced unwinding frame type(1) op(SETXATTR)[2010-01-26 07:37:25] E [saved-frames.c:165:saved_frames_unwind] home3:forced unwinding frame type(2) op(PING)

[2010-01-26 07:37:25] N [client-protocol.c:6973:notify] home3: disconnected

[2010-01-26 07:38:19] E[client-protocol.c:415:client_ping_timer_expired] home3: Server10.2.0.13:6997 has not responded in the last 42 seconds, disconnecting.[2010-01-26 07:38:19] E [saved-frames.c:165:saved_frames_unwind] home3:forced unwinding frame type(2) op(SETVOLUME)[2010-01-26 07:38:19] E [saved-frames.c:165:saved_frames_unwind] home3:forced unwinding frame type(2) op(SETVOLUME)[2010-01-26 08:06:17] N [server-protocol.c:5811:mop_setvolume] server:accepted client from 10.2.0.13:1018[2010-01-26 08:06:17] N [server-protocol.c:5811:mop_setvolume] server:accepted client from 10.2.0.13:1017[2010-01-26 08:06:17] N [client-protocol.c:6225:client_setvolume_cbk]home3: Connected to 10.2.0.13:6997, attached to remote volume 'home3'.[2010-01-26 08:06:17] N [client-protocol.c:6225:client_setvolume_cbk]home3: Connected to 10.2.0.13:6997, attached to remote volume 'home3'.

3) Something that started off as not being able to ssh in using publickeys turned out to be due to my home directory somehow acquiring 777permissions. I certainly didn't do it, so at a guess it's a filecorruption issue, possibly during an unclean shutdown. Further, I'vefound that / directory (I'm running glusterfs root on this cluster) hadpermissions 777, too, which seems to have happened at the same time asthe home directory getting 777 permissions. If sendmail and ssh weren'tfailing to work properly because of this, it's possible I wouldn't havenoticed. It's potentially quite a concerning problem, even if it iscaused by an unclean shutdown (put it this way - I've never seen ithappen on any other file system).


4) This looks potentially a bit concerning:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

5633 root 15 0 25.8g 119m 1532 S 36.7 3.0 36:25.42/usr/sbin/glusterfs --log-level=NONE --log-file=/dev/null--disable-direct-io-mode --volfile=/etc/glusterfs.root/root2.vol/mnt/newroot

This is the rootfs daemon. 25.8GB of virtual address space mapped?Surely that can't be right, even if the resident size looks reasonably sane.

Worse - it's growing by about 100MB/minute during heavy compiling on thesystem. I've just tried to test the nvidia driver installer to see ifthat old bug report I filed is still valid, and it doesn't seem to getanywhere (just makes glusterfsd and gcc use CPU time but doesn't everfinish - which is certainly a different fail case from 2.0.9 - that atleast finishes the compile stage).

The virtual memory bloat is rather reminiscent of the memoryfragmentation/leak problem that was fixed on 2.0.x branch a while backthat was arising when shared libraries were on glusterfs. A bit leakedevery time a shared library call was made. A regression, perhaps? Wasn'tthere a memory consumption sanity check added to the test suite afterthis was fixed last time?


Other glfs daemons are exhibiting similar behaviour:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

5633 root 15 0 26.1g 119m 1532 S 0.7 3.0 37:57.01/usr/sbin/glusterfs --log-level=NONE --log-file=/dev/null--disable-direct-io-mode --volfile=/etc/glusterfs.root/root2.vol/mnt/newroot12037 root 15 0 24.8g 68m 1072 S 0.0 1.7 3:21.41/usr/sbin/glusterfs --log-level=NORMAL--volfile=/etc/glusterfs/shared.vol /shared

11977 root 15 0 24.8g 67m 1092 S 0.7 1.7 3:59.11/usr/sbin/glusterfs --log-level=NORMAL --disable-direct-io-mode--volfile=/etc/glusterfs/home.vol /home

11915 root 15 0 24.9g 32m 972 S 0.0 0.8 0:21.65/usr/sbin/glusterfs --log-level=NORMAL --volfile=/etc/glusterfs/boot.vol/boot

The home, shared and boot volumes don't have any shared libraries onthem, and 24.9GB of virtual memory mapped for the /boot volume which isbacked with a 250MB file system also seems a bit excessive.


Gordan

[Prev in Thread]

Current Thread

[Next in Thread]

[Gluster-devel] 3.0.1, nicolas prochazka, 2010/01/25
- Re: [Gluster-devel] 3.0.1, Vijay Bellur, 2010/01/25
  - Re: [Gluster-devel] 3.0.1, nicolas prochazka, 2010/01/25
- [Gluster-devel] 3.0.1, Gordan Bobic <=
  - Re: [Gluster-devel] 3.0.1, Anand Avati, 2010/01/26
    - Re: [Gluster-devel] 3.0.1, Gordan Bobic, 2010/01/26
    - Re: [Gluster-devel] 3.0.1, Gordan Bobic, 2010/01/27

Prev by Date: Re: [Gluster-devel] gluster 3.0.0 vs gluster 3.0.1
Next by Date: Re: [Gluster-devel] 3.0.1
Previous by thread: Re: [Gluster-devel] 3.0.1
Next by thread: Re: [Gluster-devel] 3.0.1
Index(es):
- Date
- Thread