Re: [Gluster-devel] GlusterFS hangs/fails: Transport endpoint is not con

gluster-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Gluster-devel] GlusterFS hangs/fails: Transport endpoint is not con

From:	Fred Hucht
Subject:	Re: [Gluster-devel] GlusterFS hangs/fails: Transport endpoint is not connected
Date:	Tue, 25 Nov 2008 13:55:33 +0100

Hello Harald!

I didn't test Infiniband transport until now, as I don't want tointerfere with the parallel applications which are running overInfiniband. Gigabit Ethernet throughput would be sufficient for us atthe moment.

Today "only" three nodes were affected, yesterday it were nine nodes.The problems only occur on nodes to which jobs are scheduled whichuse /scratch as working directory: We test the filesystem in normaloperation, one user submits jobs to the queueing system which use /scratch/... as working directory. While some of his jobs run withoutproblems, other jobs fail due to FS problems. No problems occur overthe usual NFS home directory.

When I test the FS with, e.g., dd on all nodes in parallel, noproblems occur.


Which timeout shall I increase?

Regards,

     Fred

On 25.11.2008, at 13:18, Harald Stürzebecher wrote:

Hello!

<Disclaimer>
I'm just a small scale user with only a few month experience with
GlusterFS so my conclusions might be totally wrong.
</Disclaimer>

2008/11/25 Fred Hucht <address@hidden>:
Hi devels!
We consider GlusterFS as parallel file server (8 server nodes) forourparallel Opteron cluster (88 nodes, ~500 cores), as well as for aunified
nufa /scratch distributed over all nodes. We use the cluster within a
scientific environment (theoretical physics) and use ScientificLinux withkernel 2.6.25.16. After similar problems with 1.3.x we installed1.4.0qa61
and set up a /scratch for testing using the following script
"glusterconf.sh" which runs local on all nodes on startup andwrites the two
config files /usr/local/etc/glusterfs-{server,client}.vol:
[...]
The cluster uses MPI over Infiniband, while GlusterFS runs over TCP/IPGigabit Ethernet. I use FUSE 2.7.4 with patch fuse-2.7.3glfs10.diff(Is that
OK? The patch succeeded)
Interesting setup, not using Infiniband for GlusterFS. The GlusterFS
homepage says "GlusterFS can sustain 1 GB/s per storage brick over
Infiniband RDMA". Personally I'd like to know if you did try it at
some time and chose not to use it?
Everything is fine until some nodes which are used by a job blockon access
to /scratch or, sometimes later, give

df: `/scratch': Transport endpoint is not connected

The glusterfs.log on node36 is flooded by
[...]
On node68 I find
[...]
The third affected node node77 says:
[...]
As I said, similar problems occurred with version 1.3.x. If theseproblemscannot be solved, we have to use a different file system, so anyhelp is
very appreciated.
If I read that correctly, there are only three nodes out of 88
affected by this problem. In that case I think I'd look for hardware
problems first. Do you have an easy way to check your network
connections for e.g. packet loss.
Increasing timeouts might help until the real problem can be foundand fixed.
Additionally, I'd like to suggest running a test using Infiniband - if
possible - to rule out any Ethernet-related problems.


Harald Stürzebecher


_______________________________________________
Gluster-devel mailing list
address@hidden
http://lists.nongnu.org/mailman/listinfo/gluster-devel


Dr. Fred Hucht <address@hidden>
Institute for Theoretical Physics
University of Duisburg-Essen, 47048 Duisburg, Germany

[Prev in Thread]

Current Thread

[Next in Thread]

[Gluster-devel] GlusterFS hangs/fails: Transport endpoint is not connected, Fred Hucht, 2008/11/25
- Re: [Gluster-devel] GlusterFS hangs/fails: Transport endpoint is not connected, Basavanagowda Kanur, 2008/11/25
  - Re: [Gluster-devel] GlusterFS hangs/fails: Transport endpoint is not connected, Fred Hucht, 2008/11/25
    - Re: [Gluster-devel] GlusterFS hangs/fails: Transport endpoint is not connected, Joe Landman, 2008/11/25
    - Re: [Gluster-devel] GlusterFS hangs/fails: Transport endpoint is not connected, Fred Hucht, 2008/11/25
    - Re: [Gluster-devel] GlusterFS hangs/fails: Transport endpoint is not connected, Joe Landman, 2008/11/25
    - Re: [Gluster-devel] GlusterFS hangs/fails: Transport endpoint is not connected, Fred Hucht, 2008/11/25
    - Re: [Gluster-devel] GlusterFS hangs/fails: Transport endpoint is not connected, Fred Hucht, 2008/11/25
- Re: [Gluster-devel] GlusterFS hangs/fails: Transport endpoint is not connected, Harald Stürzebecher, 2008/11/25
  - Re: [Gluster-devel] GlusterFS hangs/fails: Transport endpoint is not connected, Fred Hucht <=
    - Re: [Gluster-devel] GlusterFS hangs/fails: Transport endpoint is not connected, Harald Stürzebecher, 2008/11/25

Prev by Date: Re: [Gluster-devel] GlusterFS hangs/fails: Transport endpoint is not connected
Next by Date: Re: [Gluster-devel] GlusterFS hangs/fails: Transport endpoint is not connected
Previous by thread: Re: [Gluster-devel] GlusterFS hangs/fails: Transport endpoint is not connected
Next by thread: Re: [Gluster-devel] GlusterFS hangs/fails: Transport endpoint is not connected
Index(es):
- Date
- Thread