gluster-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Gluster-devel] GlusterFS hangs/fails: Transport endpoint is not con


From: Harald Stürzebecher
Subject: Re: [Gluster-devel] GlusterFS hangs/fails: Transport endpoint is not connected
Date: Tue, 25 Nov 2008 13:18:51 +0100

Hello!

<Disclaimer>
I'm just a small scale user with only a few month experience with
GlusterFS so my conclusions might be totally wrong.
</Disclaimer>

2008/11/25 Fred Hucht <address@hidden>:
> Hi devels!
>
> We consider GlusterFS as parallel file server (8 server nodes) for our
> parallel Opteron cluster (88 nodes, ~500 cores), as well as for a unified
> nufa /scratch distributed over all nodes. We use the cluster within a
> scientific environment (theoretical physics) and use Scientific Linux with
> kernel 2.6.25.16. After similar problems with 1.3.x we installed 1.4.0qa61
> and set up a /scratch for testing using the following script
> "glusterconf.sh" which runs local on all nodes on startup and writes the two
> config files /usr/local/etc/glusterfs-{server,client}.vol:

[...]

> The cluster uses MPI over Infiniband, while GlusterFS runs over TCP/IP
> Gigabit Ethernet. I use FUSE 2.7.4 with patch fuse-2.7.3glfs10.diff (Is that
> OK? The patch succeeded)

Interesting setup, not using Infiniband for GlusterFS. The GlusterFS
homepage says "GlusterFS can sustain 1 GB/s per storage brick over
Infiniband RDMA". Personally I'd like to know if you did try it at
some time and chose not to use it?

> Everything is fine until some nodes which are used by a job block on access
> to /scratch or, sometimes later, give
>
> df: `/scratch': Transport endpoint is not connected
>
> The glusterfs.log on node36 is flooded by
>

[...]

> On node68 I find

[...]

> The third affected node node77 says:

[...]

> As I said, similar problems occurred with version 1.3.x. If these problems
> cannot be solved, we have to use a different file system, so any help is
> very appreciated.

If I read that correctly, there are only three nodes out of 88
affected by this problem. In that case I think I'd look for hardware
problems first. Do you have an easy way to check your network
connections for e.g. packet loss.
Increasing timeouts might help until the real problem can be found and fixed.
Additionally, I'd like to suggest running a test using Infiniband - if
possible - to rule out any Ethernet-related problems.


Harald Stürzebecher




reply via email to

[Prev in Thread] Current Thread [Next in Thread]