Re: [Gluster-devel] Major lock-up problem

gluster-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Gluster-devel] Major lock-up problem

From:	Gareth Bult
Subject:	Re: [Gluster-devel] Major lock-up problem
Date:	Wed, 9 Jan 2008 19:07:41 +0000 (GMT)

Yup, already have that page. 

Currently I have a one-line shell script which does a "sync" every 10 seconds. 

I've tried a couple of PDFLUSH tweaks to no avail, I'm fairly convinced it's 
broken. 

gareth. 

----- Original Message ----- 
From: "Anand Avati" <address@hidden> 
To: "Gareth Bult" <address@hidden> 
Cc: "gluster-devel" <address@hidden> 
Sent: Wednesday, January 9, 2008 7:00:21 PM (GMT) Europe/London 
Subject: Re: [Gluster-devel] Major lock-up problem 

Gareth, 
See if frequent flushing of pdflush rather than letting it aggregate changes 
the situation. here is a link with some interesting tips. 

http://www.westnet.com/~gsmith/content/linux-pdflush.htm 

avati 


2008/1/9, Gareth Bult < address@hidden >: 

Ok, this looks like being a XEN/kernel issue as I can reproduce it without 
actually "using" the glusterfs, even though it's there and mounted. 

I've included my XEN mailing list post here as this problem could well effect 
anyone else using gluster and XEN and it's a bit nasty in that it becomes more 
frequent the less memory you have .. so the more XEN instances you add, the 
more unstable your server becomes. 

(and I'm fairly convinced gluster is "the" FS to use with XEN .. especially 
when the current feature requests are processed) 

:) 

Regards, 
Gareth. 

----------- 

Posting to XEN list; 

Ok, I've been chasing this for many days .. I have a server running 10 
instances that periodically freezes .. then sometimes "comes back." 

I tried many things to try to spot the problem and finally found it by 
accident. 
It's a little frustrating as typically the Dom0 and One (or two) instances "go" 
and the rest carry on .. and there is diddley squat when it comes to logging 
information or error messages. 

I'm now using 'watch "cat /proc/meminfo"' in the Dom0. 
I watch the Dirty figure increase, and occasionally decrease. 

In an instance (this is just an easy way to reproduce it quickly) do; 
dd if=/dev/zero of=/tmp/bigfile bs=1M count=1000 

Watch the "dirty" rise and at some point you'll see "writeback" cut in. 
All looks good. 

Give it a few seconds and your "watch" of /proc/meminfo will freeze. 
On my system "Dirty" will at this point be reading about "500M" and "writeback" 
will have gone down to zero. 
"xm list" in another session will confirm that you have a major problem. (it 
will hang) 

For some reason PDFLUSH is not working properly !!! 
On another shell "sync" and the machine instantly jumps back to life! 

I'm running a stock Ubuntu XEN 3.1 kernel. 
File back XEN instances, typically 5Gb with 1Gb swap. 
Dual / Dual Core 2.8G Xeon (4 in total) with 6Gb RAM. 
Twin 500Gb SATA HDD (software RAID1) 

To my way of thinking (!) when it runs out of memory, it should force a sync 
(or similar) and it's not, it's just sitting there. If I wait for the 
dirty_expire_centisecs timer to expire, I may get some life back, some 
instances will survive and some will have hung. 

Here's a working "meminfo"; 

MemTotal: 860160 kB 
MemFree: 22340 kB 
Buffers: 49372 kB 
Cached: 498416 kB 
SwapCached: 15096 kB 
Active: 92452 kB 
Inactive: 491840 kB 
SwapTotal: 4194288 kB 
SwapFree: 4136916 kB 
Dirty: 3684 kB 
Writeback: 0 kB 
AnonPages: 29104 kB 
Mapped: 13840 kB 
Slab: 45088 kB 
SReclaimable: 25304 kB 
SUnreclaim: 19784 kB 
PageTables: 2440 kB 
NFS_Unstable: 0 kB 
Bounce: 0 kB 
CommitLimit: 4624368 kB 
Committed_AS: 362012 kB 
VmallocTotal: 34359738367 kB 
VmallocUsed: 3144 kB 
VmallocChunk: 34359735183 kB 

Here's one where "xm list" hangs, but my "watch" is still updating the 
/proc/meminfo display; 

MemTotal: 860160 kB 
MemFree: 13756 kB 
Buffers: 53656 kB 
Cached: 502420 kB 
SwapCached: 14812 kB 
Active: 84356 kB 
Inactive: 507624 kB 
SwapTotal: 4194288 kB 
SwapFree: 4136900 kB 
Dirty: 213096 kB 
Writeback: 0 kB 
AnonPages: 28832 kB 
Mapped: 13924 kB 
Slab: 45988 kB 
SReclaimable: 25728 kB 
SUnreclaim: 20260 kB 
PageTables: 2456 kB 
NFS_Unstable: 0 kB 
Bounce: 0 kB 
CommitLimit: 4624368 kB 
Committed_AS: 361796 kB 
VmallocTotal: 34359738367 kB 
VmallocUsed: 3144 kB 
VmallocChunk: 34359735183 kB 

Here's a frozen one; 

MemTotal: 860160 kB 
MemFree: 15840 kB 
Buffers: 2208 kB 
Cached: 533048 kB 
SwapCached: 7956 kB 
Active: 49992 kB 
Inactive: 519916 kB 
SwapTotal: 4194288 kB 
SwapFree: 4136916 kB 
Dirty: 505112 kB 
Writeback: 3456 kB 
AnonPages: 34676 kB 
Mapped: 14436 kB 
Slab: 64508 kB 
SReclaimable: 18624 kB 
SUnreclaim: 45884 kB 
PageTables: 2588 kB 
NFS_Unstable: 0 kB 
Bounce: 0 kB 
CommitLimit: 4624368 kB 
Committed_AS: 368064 kB 
VmallocTotal: 34359738367 kB 
VmallocUsed: 3144 kB 
VmallocChunk: 34359735183 kB 

Help!!! 

Gareth. 

-- 
Managing Director, Encryptec Limited 
Tel: 0845 25 77033, Mob: 07853 305393, Int: 00 44 1443205756 
Email: address@hidden 
Statements made are at all times subject to Encryptec's Terms and Conditions of 
Business, which are available upon request. 

----- Original Message ----- 
From: "Gareth Bult" < address@hidden > 
To: "gluster-devel" < address@hidden > 
Sent: Wednesday, January 9, 2008 3:40:49 PM (GMT) Europe/London 
Subject: [Gluster-devel] Major lock-up problem 


Hi, 

I've been developing a new system (which is now "live", hence the lack of debug 
information) and have been experiencing lots of inexplicable lock up and pause 
problems with lots of different components, and I've been working my way 
through the systems removing / fixing problems as I go. 

I seem to have a problem with gluster I can't nail down. 

When hitting the server with sustained (typically multi-file) writes, after a 
while the server goes "D" state. 
If I have io-threads running on the server, only ONE process goes "D" state. 

Trouble is, it stays "D" state and starts to lock up other processes .. a 
favourite is "vi". 

Funny thing is, the machine is a XEN server (glusterfsd in the Dom0) and the 
XEN instances NOT using gluster are not affected. 
Some of the instances using the glusterfs are affected, depending on whether 
io-threads is used on the server. 

If I'm lucky, I kill the IO process and 5 mins later the machine springs back 
to life. 
If I'm not, I reboot. 

Anyone any ideas? 

glfs7 and tla. 

Gareth. 
_______________________________________________ 
Gluster-devel mailing list 
address@hidden 
http://lists.nongnu.org/mailman/listinfo/gluster-devel 



_______________________________________________ 
Gluster-devel mailing list 
address@hidden 
http://lists.nongnu.org/mailman/listinfo/gluster-devel 



-- 
If I traveled to the end of the rainbow 
As Dame Fortune did intend, 
Murphy would be there to tell me 
The pot's at the other end.

[Prev in Thread]

Current Thread

[Next in Thread]

[Gluster-devel] Major lock-up problem, Gareth Bult, 2008/01/09
- Re: [Gluster-devel] Major lock-up problem, Gareth Bult, 2008/01/09
  - Re: [Gluster-devel] Major lock-up problem, Anand Avati, 2008/01/09
    - Re: [Gluster-devel] Major lock-up problem, Gareth Bult <=

Prev by Date: Re: [Gluster-devel] Major lock-up problem
Next by Date: Re: [Gluster-devel] would an io-disk-cache be feasible?
Previous by thread: Re: [Gluster-devel] Major lock-up problem
Next by thread: [Gluster-devel] cache doesn t seem to be taken into account
Index(es):
- Date
- Thread