SUR Blue Gene Documentation

From ScorecWiki

(Difference between revisions)
Jump to: navigation, search
Revision as of 18:19, 18 August 2010
Weisse (Talk | contribs)

← Previous diff
Revision as of 18:23, 18 August 2010
Weisse (Talk | contribs)
(SLURM)
Next diff →
Line 54: Line 54:
pgood_linkcards all pgood_linkcards all
-=== SLURM ===+=== [[SLURM]] ===
SLURM requires that Munge be running on both the Sn and Fen; munge should start automatically when the machine boots. Check that it is indeed running (run as root): SLURM requires that Munge be running on both the Sn and Fen; munge should start automatically when the machine boots. Check that it is indeed running (run as root):
/etc/init.d/munge status /etc/init.d/munge status

Revision as of 18:23, 18 August 2010

Contents

Quick Shutdown Procedure

  • Shut off the compute rack by closing the breaker on the top left corner.

Restart Procedure (Quick Shutdown)

Applies to restarting from a quick shutdown (like due to a cooling loss) or power outage where the supporting equipment rack has remained up.

https://linux-software.arc.rpi.edu/nicwiki/index.php/Systems/SURBlueGenePowerdown and https://linux-software.arc.rpi.edu/nicwiki/index.php/Systems/SURBlueGenePowerup have more details on full system shutdown/restart.


Starting Software Management Processes

1. Make sure that db2 on Sn is up

su - bglsysdb
db2_local_ps
# A list of process IDs should be displayed, if not then
db2start

2. If the system was not cleanly shut down (ie: power was unexpectedly lost) then the database will need to be told that the hardware is missing before it tries to start finding it:

. /discovery/db.src
db2 "update bglidochip set status='M'"

3. Start the bare-metal processes (order does not matter, perform as root)

. /bgl/BlueLight/ppcfloor/bglsys/discovery/db.src
/bgl/BlueLight/ppcfloor/bglsys/bin/bglmaster start
/discovery/SystemController start

Closing Service Actions

If any service actions were opened before the machine was powered down then they will have to be closed. This is done, as root, with the EndServiceAction command in /discovery followed by the service action ID:

/discovery/EndServiceAction <ID>

Discovering the Hardware

1. Execute as root:

/discovery/Discovery0 start
# When Discovery0 is finished, Discovery1 may be started, run them one at a time until they all (4) complete.
/discovery/PostDiscovery start

The progress of the DiscoverN processes can be monitored by counting how many components in the database have been found (replace R0 with the row currently running discovery):

db2 "select count(*) from bglnode where status='A' and location like 'R0%'"

Note: 1 rack should have 1088 components.

Finishing Startup Processes

1. When all of the Discovery processes have completed, the bare metal processes may be stopped:

/discovery/SystemController stop
/discovery/DiscoveryN stop # N is {0,1,2,3}
/discovery/PostDiscovery stop

2. Restart the bglmaster process:

cd /bgl/BlueLight/ppcfloor/bglsys/bin
./bglmaster restart

3. Open mmcs_db_console and run pgood on the link cards:

cd /bgl/BlueLight/ppcfloor/bglsys/bin
./mmcs_db_console
pgood_linkcards all

SLURM

SLURM requires that Munge be running on both the Sn and Fen; munge should start automatically when the machine boots. Check that it is indeed running (run as root):

/etc/init.d/munge status
Checking for MUNGE: running
# or
ps auxww | grep munge|grep -v grep
daemon    8647  0.0  0.0   8332  2284 ?        Sl   Jun25   0:20 /bgl/local/munge/sbin/munged

Now, start the SLURM controller on the service node as the slurm user:

su - slurm
/bgl/local/slurm/sbin/slurmctld

Note that you should start slurmctld with the -c flag if you desire it to clear out any jobs it may have been running when the system was last up (this should not impact jobs that were merely in the queue).

Now, start slurmd on the front end node (or wherever user jobs are to be run from) as root:

/bgl/local/slurm/sbin/slurmd

Again, starting slurmd with the -c flag will clear out any job state data that was present from previous runs.


Resume the two base partitions in SLURM, assuming we're coming back from a quick shutdown:

scontrol update nodename=bp000 state=resume
scontrol update nodename=bp001 state=resume


Check that the system is running and ready to accept jobs:

/bgl/local/slurm/bin/sinfo

Conclusion

This concludes the startup procedure for the Blue GeneL system. Try running a simple job to check that the system is fully functional.


Tuesday/Thursday Debug Days

Two cron jobs to remove/add bp000 to/from the normal queue:

update partitionname=normal nodes=bp001
update partitionname=normal nodes=bp[000-001]
Personal tools