SUR Blue Gene Documentation

From ScorecWiki

Revision as of 15:59, 30 October 2012; view current revision
←Older revision | Newer revision→
Jump to: navigation, search


Quick Shutdown Procedure

  • Shut off the compute rack by closing the breaker on the top left corner.

Restart Procedure (Quick Shutdown)

Applies to restarting from a quick shutdown (like due to a cooling loss) or power outage where the supporting equipment rack has remained up. and have more details on full system shutdown/restart.

Starting Software Management Processes

1. Make sure that db2 on Sn is up

su - bglsysdb
# A list of process IDs should be displayed, if not then

2. If the system was not cleanly shut down (ie: power was unexpectedly lost) then the database will need to be told that the hardware is missing before it tries to start finding it:

. /discovery/db.src
db2 "update bglidochip set status='M'"

3. Start the bare-metal processes (order does not matter, perform as root)

. /bgl/BlueLight/ppcfloor/bglsys/discovery/db.src
/bgl/BlueLight/ppcfloor/bglsys/bin/bglmaster start
/discovery/SystemController start

Closing Service Actions

If any service actions were opened before the machine was powered down then they will have to be closed. This is done, as root, with the EndServiceAction command in /discovery followed by the service action ID:

/discovery/EndServiceAction <ID>

Discovering the Hardware

1. Execute as root:

/discovery/Discovery0 start
# When Discovery0 is finished, Discovery1 may be started, run them one at a time until they all (4) complete.
/discovery/PostDiscovery start

The progress of the DiscoverN processes can be monitored by counting how many components in the database have been found (replace R0 with the row currently running discovery):

db2 "select count(*) from bglnode where status='A' and location like 'R0%'"

Note: 1 rack should have 1088 components.

Finishing Startup Processes

1. When all of the Discovery processes have completed, the bare metal processes may be stopped:

/discovery/SystemController stop
/discovery/Discovery0 stop
/discovery/PostDiscovery stop

2. Restart the bglmaster process:

cd /bgl/BlueLight/ppcfloor/bglsys/bin
./bglmaster restart

3. Open mmcs_db_console and run pgood on the link cards:

cd /bgl/BlueLight/ppcfloor/bglsys/bin
pgood_linkcards all


SLURM requires that Munge be running on both the Sn and Fen; munge should start automatically when the machine boots. Check that it is indeed running (run as root):

/etc/init.d/munge status
Checking for MUNGE: running
# or
ps auxww | grep munge|grep -v grep
daemon    8647  0.0  0.0   8332  2284 ?        Sl   Jun25   0:20 /bgl/local/munge/sbin/munged

Now, start the SLURM controller on the service node as the slurm user:

su - slurm

Note that you should start slurmctld with the -c flag if you desire it to clear out any jobs it may have been running when the system was last up (this should not impact jobs that were merely in the queue).

Now, start slurmd on the front end node (or wherever user jobs are to be run from) as root:


Again, starting slurmd with the -c flag will clear out any job state data that was present from previous runs.

Resume the two base partitions in SLURM, assuming we're coming back from a quick shutdown:

scontrol update nodename=bp000 state=resume
scontrol update nodename=bp001 state=resume

Check that the system is running and ready to accept jobs:



This concludes the startup procedure for the Blue GeneL system. Try running a simple job to check that the system is fully functional.

Tuesday/Thursday Debug Days

Two cron jobs to remove/add bp000 to/from the normal queue:

update partitionname=normal nodes=bp001
update partitionname=normal nodes=bp[000-001]
Personal tools