SUR Blue Gene Documentation
From ScorecWiki
←Older revision | Newer revision→
Contents |
Quick Shutdown Procedure
- Shut off the compute rack by closing the breaker on the top left corner.
Restart Procedure (Quick Shutdown)
Applies to restarting from a quick shutdown (like due to a cooling loss) or power outage where the supporting equipment rack has remained up.
https://linux-software.arc.rpi.edu/nicwiki/index.php/Systems/SURBlueGenePowerdown and https://linux-software.arc.rpi.edu/nicwiki/index.php/Systems/SURBlueGenePowerup have more details on full system shutdown/restart.
Starting Software Management Processes
1. Make sure that db2 on Sn is up
su - bglsysdb db2_local_ps # A list of process IDs should be displayed, if not then db2start
2. If the system was not cleanly shut down (ie: power was unexpectedly lost) then the database will need to be told that the hardware is missing before it tries to start finding it:
. /discovery/db.src db2 "update bglidochip set status='M'"
3. Start the bare-metal processes (order does not matter, perform as root)
. /bgl/BlueLight/ppcfloor/bglsys/discovery/db.src /bgl/BlueLight/ppcfloor/bglsys/bin/bglmaster start /discovery/SystemController start
Closing Service Actions
If any service actions were opened before the machine was powered down then they will have to be closed. This is done, as root, with the EndServiceAction command in /discovery followed by the service action ID:
/discovery/EndServiceAction <ID>
Discovering the Hardware
1. Execute as root:
/discovery/Discovery0 start # When Discovery0 is finished, Discovery1 may be started, run them one at a time until they all (4) complete. /discovery/PostDiscovery start
The progress of the DiscoverN processes can be monitored by counting how many components in the database have been found (replace R0 with the row currently running discovery):
db2 "select count(*) from bglnode where status='A' and location like 'R0%'"
Note: 1 rack should have 1088 components.
Finishing Startup Processes
1. When all of the Discovery processes have completed, the bare metal processes may be stopped:
/discovery/SystemController stop /discovery/Discovery0 stop /discovery/PostDiscovery stop
2. Restart the bglmaster process:
cd /bgl/BlueLight/ppcfloor/bglsys/bin ./bglmaster restart
3. Open mmcs_db_console and run pgood on the link cards:
cd /bgl/BlueLight/ppcfloor/bglsys/bin ./mmcs_db_console pgood_linkcards all
SLURM
SLURM requires that Munge be running on both the Sn and Fen; munge should start automatically when the machine boots. Check that it is indeed running (run as root):
/etc/init.d/munge status Checking for MUNGE: running # or ps auxww | grep munge|grep -v grep daemon 8647 0.0 0.0 8332 2284 ? Sl Jun25 0:20 /bgl/local/munge/sbin/munged
Now, start the SLURM controller on the service node as the slurm user:
su - slurm /bgl/local/slurm/sbin/slurmctld
Note that you should start slurmctld with the -c flag if you desire it to clear out any jobs it may have been running when the system was last up (this should not impact jobs that were merely in the queue).
Now, start slurmd on the front end node (or wherever user jobs are to be run from) as root:
/bgl/local/slurm/sbin/slurmd
Again, starting slurmd with the -c flag will clear out any job state data that was present from previous runs.
Resume the two base partitions in SLURM, assuming we're coming back from a quick shutdown:
scontrol update nodename=bp000 state=resume scontrol update nodename=bp001 state=resume
Check that the system is running and ready to accept jobs:
/bgl/local/slurm/bin/sinfo
Conclusion
This concludes the startup procedure for the Blue GeneL system. Try running a simple job to check that the system is fully functional.
Tuesday/Thursday Debug Days
Two cron jobs to remove/add bp000 to/from the normal queue:
update partitionname=normal nodes=bp001
update partitionname=normal nodes=bp[000-001]