[Freeipmi-devel] Information requested By Pradhap on 4-Aug-2004

freeipmi-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Freeipmi-devel] Information requested By Pradhap on 4-Aug-2004

From:	anand . manian
Subject:	[Freeipmi-devel] Information requested By Pradhap on 4-Aug-2004
Date:	Thu, 5 Aug 2004 13:28:30 -0400

hi AB,

   Pradhap left me a voicemail yesterday asking to verify that the
"portblocker" program is being run on topnode (master node) that initiates
rsh command .. and not just the compute nodes.

   The question probably arose because at some point it was mentioned that
running dpcproxy may be sufficient to hold the reserved ports open.. thereby
keeping rsh away from it.

Since much time has passed, thought it best to re-do the test before
answering this query. Also, with nodes now being placed in production, my
test base has shrunk from being a full cluster (which now has been shipped
to Italy) to the last 6 nodes (n43..n48) of cluster gvlhpctw06 in
Greenville.

At this time, nodes n1..n42 of gvlhpctw06 are running with BMC turned OFF
(using the SSU program, set IP Address to 0.0.0.0, and set to "Disabled" the
SOL and LAN access modes).

The last 6 nodes, n43..n48, have BMC on (IP addresses matching eth1
interface set on BMC, and SOL + LAN access modes set to "Always available").

Additionally, nodes n43..n48 run a modified version of the "portblocker"
program. This version, called bmcfenced, is capable of accepting multiple
port#s in command-line and has scripting necessary to start/stop via the
service command. The attachment contains sources of both original
"portblocker" and my modified version. 

My test:
========
Telnet to n43;
Execute commands as below:

address@hidden hpcadmin]$ NODES="n44 n45 n46 n47 n48"
address@hidden hpcadmin]$ for node in $NODES
> do
>    rsh $node hostname
> done
n44.twulf06.cluster
n45.twulf06.cluster
n46.twulf06.cluster
n47.twulf06.cluster
n48.twulf06.cluster
address@hidden hpcadmin]$

The very first time I tried this, a long time passed between n46 and n47
nodenames appearing on screen.

Looked at /var/log/messages the two nodes:

n47, where the "hiccup" occurred:
=================================
Aug  5 11:45:35 n47 pam_rhosts_auth[30126]: allowed to
address@hidden as hpcadmin
Aug  5 11:46:35 n47 rsh(pam_unix)[30126]: session opened for user hpcadmin
by (uid=0)
Aug  5 11:46:35 n47 rsh(pam_unix)[30126]: session closed for user hpcadmin

n46, the node that rsh worked fine just before n47:
==============================================
Aug  5 11:45:35 n46 pam_rhosts_auth[26007]: allowed to
address@hidden as hpcadmin
Aug  5 11:45:35 n46 rsh(pam_unix)[26007]: session opened for user hpcadmin
by (uid=0)
Aug  5 11:45:35 n46 rsh(pam_unix)[26007]: session closed for user hpcadmin

As you will see, exactly 60 seconds elapsed on the hung node between
"allowed" and "session opened" 
messages.

Next, verified that bmcfenced is indeed running on all the nodes involved:
==========================================================================

address@hidden hpcadmin]$ ps -efwww | grep -i fence
root      5478     1  0 Jul14 ?        00:00:00 /usr/local/sbin/bmcfenced
623 664
root      5479     1  0 Jul14 ?        00:00:00 /usr/local/sbin/bmcfenced
623 664
hpcadmin  8101  8053  0 11:49 pts/0    00:00:00 grep -i fence
address@hidden hpcadmin]$ 
address@hidden hpcadmin]$ 
address@hidden hpcadmin]$ for node in ${NODES}
> do
>   rsh $node "hostname; ps -efwww | grep bmc"
>   echo ---------
> done
n44.twulf06.cluster
root      3569     1  0 Jul14 ?        00:00:00 /usr/local/sbin/bmcfenced
623 664
root      3570     1  0 Jul14 ?        00:00:00 /usr/local/sbin/bmcfenced
623 664
hpcadmin 29966 29965  0 12:29 ?        00:00:00 bash -c hostname; ps -efwww
| grep bmc
hpcadmin 29975 29966  0 12:29 ?        00:00:00 grep bmc
---------
n45.twulf06.cluster
root     12104     1  0 Jul14 ?        00:00:00 /usr/local/sbin/bmcfenced
623 664
root     12105     1  0 Jul14 ?        00:00:00 /usr/local/sbin/bmcfenced
623 664
hpcadmin  7210  7209  0 12:29 ?        00:00:00 bash -c hostname; ps -efwww
| grep bmc
hpcadmin  7219  7210  0 12:29 ?        00:00:00 grep bmc
---------
n46.twulf06.cluster
root       583     1  0 Jul20 ?        00:00:00 /usr/local/sbin/bmcfenced
623 664
root       584     1  0 Jul20 ?        00:00:00 /usr/local/sbin/bmcfenced
623 664
hpcadmin 26112 26111  0 12:29 ?        00:00:00 bash -c hostname; ps -efwww
| grep bmc
hpcadmin 26121 26112  0 12:29 ?        00:00:00 grep bmc
---------
n47.twulf06.cluster
root     12104     1  0 Jul14 ?        00:00:00 /usr/local/sbin/bmcfenced
623 664
root     12105     1  0 Jul14 ?        00:00:00 /usr/local/sbin/bmcfenced
623 664
hpcadmin 30232 30231  0 12:29 ?        00:00:00 bash -c hostname; ps -efwww
| grep bmc
hpcadmin 30241 30232  0 12:29 ?        00:00:00 grep bmc
---------
n48.twulf06.cluster
root     11246     1  0 Jul14 ?        00:00:00 /usr/local/sbin/bmcfenced
623 664
root     11247     1  0 Jul14 ?        00:00:00 /usr/local/sbin/bmcfenced
623 664
hpcadmin 27533 27532  0 12:29 ?        00:00:00 bash -c hostname; ps -efwww
| grep bmc
hpcadmin 27542 27533  0 12:29 ?        00:00:00 grep bmc
---------
address@hidden hpcadmin]$ 

Attempting on another set of nodes, without BMC enabled:
========================================================
address@hidden hpcadmin]$ hostname
n37.twulf06.cluster
address@hidden hpcadmin]$ alias r
alias r='for node in n38 n39 n40 n41 n42; do rsh $node hostname;done'
address@hidden hpcadmin]$ r
n38.twulf06.cluster
n39.twulf06.cluster
n40.twulf06.cluster
n41.twulf06.cluster
n42.twulf06.cluster
address@hidden hpcadmin]$

Repeated the above command (aliased to r) 200 more times in quick
succession, as rapidly as I could type r and hit RETURN, and saw no hang.

Some thoughts:
==============

* I believe this issue surfaces after a sufficient number of rsh attempts
have
  been initiated.. and is perhaps not related to the speed at which these
  rsh commands are issued. At GE, rsh is used heavily since all MPI libaries
and 
  user job-launch/query scripts are all rsh based. After a reboot, it has
been
  seen that for a while there will be no hangs.. and after that they occur
quite
  frequently.

* Error messages that accompany a socket oversubscription are not seen
during the
  hang. So am persuaded that it is not an issue spawned by lack of sockets.
(This
  line of thought was pursued for a while by Pradhap/Bala)

* Since I built (and currently control) the OS image on CDC nodes, am pretty
sure
  that no changes to xinetd parameters were made. The install was carried
out
  as below:
    - RH-7.3 CDs loaded on a system for network install
    - CDC node plugged into network using an old CompaQ NIC (for driver
compatibility)
    - Boot with BOOTNET.IMG floppy and carry out "Server Installation"
    - Changed default package selections as follows:
                no Standard X config
                Add NFS file server, Add FTP server
                Add kernel sources (under Individual packages)
                Select "No Firewall" 
    - Bring in kernel sources (sent earlier to Bala) and build/install
kernel + modules
    - Bring in sources for e1000.5.0.43 and build+install (sent to Bala
earlier)
    - Reboot and remove "Old NIC" and get rid of its config with kudzu

   If you proceed along lines listed above, you will end up with pretty much
the same 
   system as we have here.

Please let me know if I can be of any further assistance.

-Anand Manian
 864-254-3405

bmcfenced.tgz
Description: Binary data

[Prev in Thread]

Current Thread

[Next in Thread]

[Freeipmi-devel] Information requested By Pradhap on 4-Aug-2004, anand . manian <=
- [Freeipmi-devel] Re: Information requested By Pradhap on 4-Aug-2004, Ian Zimmerman, 2004/08/05

Prev by Date: Re: [Freeipmi-devel] Re: GE rsh problem
Next by Date: [Freeipmi-devel] Re: Information requested By Pradhap on 4-Aug-2004
Previous by thread: [Freeipmi-devel] bmc-config total rewritten
Next by thread: [Freeipmi-devel] Re: Information requested By Pradhap on 4-Aug-2004
Index(es):
- Date
- Thread