Using ASC Purple Exercise

For this workshop, we will be using an IBM Power5 (p5 575) system called "uP". uP is LC's 108 node unclassified Purple system. Each node has 8 processors and 32 GB of memory.

uP is an actual LC production system, which normally has a 99 node pbatch pool and a 2 node pdebug pool. For this workshop, a special pool has been configured to prevent competition with real users.

  1. Login to uP

    Workshops differ in how this is done. The instructor will go over this beforehand.

  2. Review the login banner information

    Notice any announcements and news items. Try reading a news announcement, such as news large_pages or news dat_up.

  3. Check uP's configuration and job information

    Try any/all of the following commands. See the respective man pages if you have questions.

    spjstat
    squeue
    sinfo
    ju
    pstat -m up
    news job.lim.up

  4. Copy the lab exercise files

    1. In your home directory, create a subdirectory for the lab exercise codes and then cd to it.
      mkdir purple
      cd  purple 

    2. Then copy the exercise files to your purple subdirectory:
      cp  /usr/global/docs/training/blaise/purple/*   ~/purple

  5. List the contents of your purple subdirectory

    You should have the following files:

    File Description
    hello.c
    hello.f
    Simple MPI program which prints a task's rank and hostname.
    bandwidth.c
    bandwidth.f
    An MPI communications bandwidth test between two tasks only.
    smp_bandwidth.c
    smp_bandwidth.f
    An MPI communications bandwidth test between any even number of tasks.
    par_io.c Parallel I/O example
    prog1
    prog2
    prog3
    prog4
    Simple shell scripts used for MPMD mode
    jobscript.example
    batchbugs
    batchbugs.fix
    batchhang
    hangme.c
    hangme
    Batch exercise files

  6. Determine which pool you will be using for the workshop

    Use either the spjstat or ju command as done previously to display the available pools. Which pool looks like it should be used for the class? Remember the name of this pool for later.

  7. Compile the hello program

    Depending upon your language preference, use one of the IBM parallel compilers to compile the hello program. Notice that we're using a very simple compilation and explicitly using large pages (-blpdata), 64-bit (-q64) and level 2 optimization (-O2).

    C:
    mpxlc -blpdata -q64 -O2 -o hello hello.c
    Fortran:
    mpxlf -blpdata -q64 -O2 -o hello hello.f 

  8. Setup your POE environment

    In this step you'll set a few POE environment variables. Specifically, those which answer the three questions:

    • How many nodes do I need?
    • How many tasks do I need?
    • Which pool should I use?

    Set the following environment variables as shown. We'll accept the default POE settings for everything else.

    Environment Variable Setting Description
    setenv MP_NODES 2
    Request 2 nodes
    setenv MP_PROCS 8
    Request 8 MPI tasks (processes)
    setenv MP_RMPOOL pclass
    This is the workshop pool which you determined from the spjstat or ju command previously.

  9. Run your hello executable

    1. This is the simple part. Just issue the command:

      hello

    2. Provided that everything is working and setup correctly, you should receive output that looks something like below (your node names may vary, of course).
      0:Total number of tasks = 8 
      0:Hello! From task 0 on host up037
      4:Hello! From task 4 on host up040
      1:Hello! From task 1 on host up037
      5:Hello! From task 5 on host up040
      2:Hello! From task 2 on host up037
      6:Hello! From task 6 on host up040
      3:Hello! From task 3 on host up037
      7:Hello! From task 7 on host up040
      

  10. Maximize your use of all 8 cpus on a node

    The previous step only used 4 cpus on each of 2 nodes. To make better use of the SMP nodes, try the following:

    1. Run 8 hello tasks on each of 2 nodes. Three different ways to do this are shown below, all of which use command line flags. The corresponding environment variables could be used instead. See the POE man page for details.

      Method 1: Specify POE flags for number of nodes and number of tasks:

      hello -nodes 2 -procs 16

      Method 2: Specify POE flags for number of tasks per node and and number of tasks:

      hello -tasks_per_node 8 -procs 16

      Method 3: Specify POE flags for number of nodes and and number of tasks per node:

      unsetenv MP_PROCS
      hello -nodes 2 -tasks_per_node 8

  11. Try the bandwidth exercise code

    1. Depending upon your language preference, compile the bandwidth source file as shown:

      C:
      mpxlc -blpdata -q64 -O2 -o bandwidth bandwidth.c
      Fortran:
      mpxlf -blpdata -q64 -O2 -o bandwidth bandwidth.f 

    2. This example only uses two tasks, but we want them to be on different nodes to test interprocessor bandwidth. So:
      setenv MP_PROCS 2 
      setenv MP_NODES 2

    3. Run the executable:
      bandwidth

      Note: It is very possible that when you try this step, you will get an error message that looks something like the one below. This is because there are others in the workshop using nodes in the small workshop pool at the same time as you. If you get this error message, just try running again in a few moments when the nodes are free.

      SLURMERROR: slurm_allocate_resources: Requested nodes are busy
      ERROR: 0031-362 Unexpected return code -5 from ll_request

      As the program runs, it will display the effective communications bandwidth between two nodes over the HPS switch fabric. The output should look something like that below:

      Sample output from bandwidth example (C version)
         0:
         0:****** MPI/POE Bandwidth Test ******
         0:Message start size= 100000 bytes
         0:Message finish size= 2000000 bytes
         0:Incremented by 100000 bytes per iteration
         0:Roundtrips per iteration= 1000
         0:Task 0 running on: up037
         0:Task 1 running on: up040
         0:
         0:Message Size   Bandwidth (bytes/sec)
         0:   100000     1.284070e+09
         0:   200000     1.502403e+09
         0:   300000     1.573937e+09
         0:   400000     1.628769e+09
         0:   500000     1.665738e+09
         0:   600000     1.673687e+09
         0:   700000     1.687317e+09
         0:   800000     1.701772e+09
         0:   900000     1.717159e+09
         0:  1000000     1.726645e+09
         0:  1100000     1.734018e+09
         0:  1200000     1.734382e+09
         0:  1300000     1.744496e+09
         0:  1400000     1.744378e+09
         0:  1500000     1.749541e+09
         0:  1600000     1.756001e+09
         0:  1700000     1.757546e+09
         0:  1800000     1.758167e+09
         0:  1900000     1.758724e+09
         0:  2000000     1.764081e+09
      

  12. Try the bandwidth code with RDMA

    1. Now, try running the executable again, but this time explicitly specify use of RDMA communications.
      setenv MP_USE_BULK_XFER yes
      bandwidth

    2. Notice the output. You should see a significant increase in bandwidth.

      Sample output from bandwidth example with RDMA (C version)
         0:
         0:****** MPI/POE Bandwidth Test ******
         0:Message start size= 100000 bytes
         0:Message finish size= 2000000 bytes
         0:Incremented by 100000 bytes per iteration
         0:Roundtrips per iteration= 1000
         0:Task 0 running on: up037
         0:Task 1 running on: up040
         0:
         0:Message Size   Bandwidth (bytes/sec)
         0:   100000     1.216868e+09
         0:   200000     1.640740e+09
         0:   300000     2.356462e+09
         0:   400000     2.525593e+09
         0:   500000     2.638849e+09
         0:   600000     2.708294e+09
         0:   700000     2.765728e+09
         0:   800000     2.809319e+09
         0:   900000     2.846926e+09
         0:  1000000     2.871639e+09
         0:  1100000     2.893212e+09
         0:  1200000     2.913451e+09
         0:  1300000     2.923798e+09
         0:  1400000     2.937274e+09
         0:  1500000     2.948080e+09
         0:  1600000     2.955056e+09
         0:  1700000     2.960913e+09
         0:  1800000     2.963961e+09
         0:  1900000     2.971846e+09
         0:  2000000     2.976140e+09
      

  13. Determine per-task communication bandwidth behavior

    In this exercise, pairs of tasks, located on two different nodes, will communicate with each other.

    1. Compile the code:

      C:
      mpxlc -blpdata -q64 -O2 -o smp_bandwidth smp_bandwidth.c 
      Fortran:
      mpxlf -blpdata -q64 -O2 -o smp_bandwidth smp_bandwidth.f  

    2. Then use the smp_bandwidth code to determine per-task bandwidth characteristics on an smp node:
      smp_bandwidth -nodes 2 -procs 2
      smp_bandwidth -nodes 2 -procs 4
      smp_bandwidth -nodes 2 -procs 8
      smp_bandwidth -nodes 2 -procs 16
      
      What happens to the average per-task bandwidth as the number of tasks increase? How about the aggregate bandwidth per node?

  14. Optimize intranode communication bandwidth

    When all of the task communications occur "on-node", it is possible to optimize the effective per-task bandwidth by utilizing shared memory instead of the network.

    1. First use shared memory and note the per-task bandwidth:

      setenv MP_SHARED_MEMORY yes
      smp_bandwidth -nodes 1 -procs 8

    2. Now try it without shared memory (using the switch network):

      setenv MP_SHARED_MEMORY no
      smp_bandwidth -nodes 1 -procs 8

      What differences do you notice?

  15. Generate diagnostic/statistical information for your run.

    1. POE provides several environment variables / command flags that collect diagnostic and statistical information about a job's run. Three of the more useful ones are shown below. Try running a job after setting these as shown. Direct stdout to a file so that you can easily read the output after the job runs.
      setenv MP_SAVEHOSTFILE myhosts
      setenv MP_PRINTENV yes
      setenv MP_STATISTICS print
      bandwidth -nodes 2 -procs 2  > myoutput
      
    2. After the job completes, examine both the myhosts file and myoutput file. The MP_PRINTENV environment variable can be particularly useful for troubleshooting since it tells you all of the POE environment variable settings. See the POE man page if you have any questions.

    3. Be sure to unset these variables when you're done to prevent cluttering your screen with their output for the remaining exercises.
      unsetenv MP_SAVEHOSTFILE
      unsetenv MP_PRINTENV
      unsetenv MP_STATISTICS
      

  16. Compile and run a job using parallel I/O. Then copy its output to HPSS storage
    1. First, you will need to edit par_io.c and change the line that reads:
      static char filename[] = "/p/gup1/class01/par_io.output";
      Instead of using class01, use your workshop userid, which appears on your OTP token.

    2. Compile the file and then run it:
      mpxlc -blpdata -q64 -O2 -o par_io par_io.c
      par_io -nodes 1 -procs 8

    3. After it finishes check your GPFS parallel file system directory for the output file you edited above. You should have a 32MB file.

      Transfer your output file to storage and then delete your GPFS file. A sample session to accomplish this is shown below. The commands you type are highlighted.

      up041{class01}61: cd /p/gup1/class01
      /p/gup1/class01
      up041{class01}62:  ls -l
      total 62720
      -rw-------   1 class01  class01    32000000 Jun 30 15:09 par_io.output
      up041{class01}63:  ftp storage
      Connected to toofast43.llnl.gov.
      220-NOTICE TO USERS
      220-
      
      [ blah blah blah removed ]
      
      220-
      220 toofast43 FTP server (HPSS 6.2 PFTPD V1.1.37 Thu Jun 15 10:09:51 PDT 2006) ready.
      Name (toofast43.llnl.gov:class01): just hit return 
      230 User class01 logged in as class01
      Remote system type is UNIX.
      Using binary mode to transfer files.
      ftp>  put par_io.output
      200 Command Complete (32000000, "par_io.output", 0, 1, 8388608, 0).
      200 Command Complete.
      150 Transfer starting.
      226 Transfer Complete.(moved = 32000000).
      32000000 bytes sent in 0.836 seconds (38.298 mbytes/s)
      200 Command Complete.
      ftp>  dir
      200 PORT command successful.
      150 Opening ASCII mode data connection for directory list.
      -rw-r-----   1 class01  class01 32000000 Jun 30 15:47 par_io.output
      226 Transfer complete.
      69 bytes received in 0.019 seconds (3.585 kbytes/s)
      ftp>  quit
      221 Goodbye.
      up041{class01}64:  rm par_io.output
      up041{class01}65: 
      

    4. Finally, cd back to you purple subdirectory to continue with the exercises:
      cd ~/purple

  17. Try using POE's Multiple Program Multiple Data (MPMD) mode

    POE allows you to load and run different executables on different nodes. This is controlled by the MP_PGMMODEL environment variable.

    1. First, set some environment variables:

      Environment Variable Setting Description
      setenv MP_PGMMODEL mpmd
      Specify MPMD mode
      setenv MP_PROCS 4
      Use 4 tasks again
      setenv MP_NODES 1
      Use one node for all four tasks
      setenv MP_STDOUTMODE ordered
      Sort the output by task

    2. Then, simply issue the poe command.

    3. After a moment, you will be prompted to enter your executables one at a time. Notice that the machine name where the executable will run is displayed as part of the prompt. In any order you choose, enter these four program names, one per prompt. For example:
      up041% poe
      0031-503  Enter program name and flags for each node
      0:up040> prog1
      1:up040> prog2
      2:up040> prog3
      3:up040> prog4

    4. After the last program name is entered, POE will run all four executables. Observe their different outputs. Note: these four programs are just simple shell scripts used to demonstrate how to use the MPMD programming model.

  18. Create an LCRM job control script

    1. Using your favorite text editor, create a file (name it whatever you like) that will be used to run a batch job. Your job control script should specify the following:
      • executes on the host "up"
      • runs within the workshop pool
      • uses two nodes
      • uses two tasks
      • has a time limit of 5 minutes
      • combines stdout and stderr
      • gives the job a name chosen by you
      • lists the hosts used
      • reports POE communication statistics
      • lists all of your POE environment variables
      • runs the executable bandwidth (which you created earlier)

    2. See the LCRM tutorial to assist with most of the above, in particular the Building a Job Control Script section. If you need more assistance, see the jobscript.example file provided with your other exercise files.

  19. Run your batch job

    1. Use the LCRM psub command to submit your job. For example:
      psub myjobscript
      Note the job id

    2. Check the status of your job as it queues, waits and eventually runs. Use the pstat command (several times) for this. See the pstat man page if you have questions about its output.

    3. Try the pstat -f jobid command for more info on your job.

    4. Try the pstat -m up command to view other jobs on the system. See the job detail on any job by using the pstat -f jobid command.

    5. After your job completes, check its output file. Does it show communication statistics and POE environment variables? Note the bandwidth report - how does it compare to the bandwidth numbers you did interactively earlier? That is, does it match the RDMA performance or not?

  20. Debug a job command script

    1. Submit the exercise code batchbugs.

    2. Monitor its progress (or lack thereof) with the pstat command. Also use the spjstat or ju commands to verify that adequate nodes are available.

    3. Figure out why it won't run and fix it. There are three problems with the script. The output file (when you get that far) should help diagnose two of them.

    4. Compare your solution to batchbugs.fix.

  21. Debug a batch job

    The primary purpose of this trivial exercise is to demonstrate the fact that you can login to a batch node when your job is running there, and then start a debugging session. This is just one way to debug in batch.

    1. Submit the job batchhang. You may want to review the script also to make sure you understand what it is doing.

    2. Use the pstat command to monitor your job. When it starts to RUN, proceed to the next step.

    3. Find the node where your job is running. The squeue command can be used for this. For example:

      up041% squeue
        JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)
         8565    pbatch cdp_if-l    eiur6   R    4:31:46      8 up[061-068]
         9616    pbatch checkHag   38kdcz   R    4:08:25      1 up047
         9035    pbatch cdp_if-l    88dj6   R    3:59:44      8 up[042,069-075]
         9330    pbatch Rep200_N     3kdh   R    3:56:10      8 up[033-036,105-108]
        10395    pbatch batchhan  thedude   R      26:23      1 up096
      

    4. ssh to the node where your job is running and then login. After you login, use the following command to verify that your job is running there:
      ps -A | grep hangme

    5. cd into your purple subdirectory. This is needed so that TotalView can find the source code for the hung program (hangme.c).

    6. Start TotalView with the totalview command. Two new TotalView windows will then appear.

    7. In the larger "New Program" window, select "Attach to an existing process" and then click on the "hangme" process. Then click OK. See the image below.

    8. After a few moments, a large window will open showing you that the "Thread is running". Click the Halt button. You can then see the hung program source code. Image here.

    9. In the real world, you could now begin debugging your hung program. However, this isn't a debugger workshop, so just click the Kill button to terminate the hung process.

    10. Quit TotalView: File --> Exit

  22. Familiarize yourself with some of the other LCRM commands

    Try any/all of the following. See the LCRM tutorial or man page if you have questions - especially about the various options or output.

    • defbank
    • pshare
    • pshare -T root
    • pstat -f
    • pstat -o jid,sid,user,status,prio -s prio -m up

  23. Finally, familiarize yourself with the LC website

    1. Go to computing.llnl.gov.

    2. Notice the High Performance Computing section and the links found there.

    3. In particular, try the following links:
      • Important Notices and News - look for any news items regarding uP in the "Latest LC IBM AIX News" section.
      • OCF Machine Status - enter your workshop userid and PIN + 6 digit OTP when prompted for userid/password. Then find uP in the list of machines. Review uP's status information, and then click on the uP link for additional detailed information.
      • Computing Resources - find uP's hardware configuration information
      • Code Development - find out which compilers are installed on uP
      • Running Jobs - find the current job limits for uP
      • Training - find the "Using ASC Purple" tutorial. Notice what else is available.
      • Search (upper left corner) - look up "ASC Purple"


    This completes the exercise.

    Evaluation Form       Please complete the online evaluation form if you have not already done so for this tutorial.

    Return to the tutorial