LCRM Exercise

  1. Login to the workshop machine

    Workshops differ in how this is done. The instructor will go over this beforehand.

  2. Copy the example files

    In your home directory, create a subdirectory for the LCRM example codes and cd to it. Then copy the example codes:

    mkdir ~/lcrm 
    cd  lcrm
    cp  /usr/global/docs/training/blaise/lcrm/*   ~/lcrm

  3. List the contents of your lcrm subdirectory

    You should notice the following files:

    File Name Description
    lcrm1.cmd Simple job command script
    lcrm2.cmd Another simple job command script
    lcrm3.cmd Parallel job command script which runs ep.B.X examples.
    Executables for lcrm3.cmd. NAS Embarrassingly Parallel benchmark code for 4, 8 and 16 tasks.
    pengra.cmd Parallel job command script for submission to Linux cluster
    mpi_multibandwidth.c MPI code that demonstrates varying bandwidths for different send/receive pairs.

  4. Display information about your bank

    Try the following commands to show your default bank, and then current allocation and usage information for that bank. Be sure to substitute your default bank for bank in the second command. See the man pages for defbank and pshare if you have questions.

    defbank -l
    pshare -p -0 -t bank

  5. Review the lcrm1.cmd command script file

    The commented example file, lcrm1.cmd should be in your lcrm subdirectory. Review this file, taking note of its #PSUB directives, as well as the shell commands.

  6. Submit a simple batch job

    When you are ready, use the psub command to submit this job to the LCRM system:

    psub lcrm1.cmd

    If your job is accepted by LCRM, you should immediately see a message that looks something like:

    Job 46847 submitted to batch

  7. Display your job's status

    You can now use the pstat command to monitor the status of your job. If you are quick, and your job hasn't finished executing already, you should see something similar to below:

      JID NAME            USER     ACCOUNT  BANK     STATUS     EXEHOST   CL
    13105 lcrm1.cmd       class01  000000   cs        RUN       newberg   N 

    If you see no output from the pstat command, then it probably means your job has already completed. Try repeating the previous psub step and then issuing this command immediately afterwards.

  8. Review your job's output

    You will know that your job has successfully completed when it no longer shows up with the pstat command, and when you have a new file in your directory called lcrm1.c.o#####, where ##### matches the LCRM job ID assigned to your job upon submission. This is the default naming scheme used by LCRM.

    Review the contents of your output file and compare them to the original job command script. Do they agree?

  9. Create and run a new job command script

    Now that you know the basics, create your own job command script based upon the previous example file. Begin by copying the lcrm1.cmd example file, and then modifying it to do these additional basic tasks:

    1. Give your job a name - rather than accept the LCRM default name
    2. Give your output file a name - rather than accept the LCRM default name
    3. List the contents of your lcrm subdirectory

    For assistance, you can check the psub man page. For an example solution, you can check lcrm2.cmd.

    After you've created your new job command script, submit it to LCRM. When it completes, verify its output. Note: because this is a batch system, its difficult to predict exactly when your job will run. If your job is sitting in the queue for more than a few minutes, proceed to the next step and come back here later after it has completed.

  10. Run a parallel job in LCRM

    The example job command script lcrm3.cmd will be used for this part of the exercise. This script runs the EP (embarrassingly parallel) benchmark from the NAS version 2.3 parallel benchmark set. You will be asked to run it three times, using a different number of nodes/tasks, and to evaluate its scalability.

    1. Review lcrm3.cmd. Notice the use of the #PSUB options for specifying 2 nodes with 4 tasks, 2 nodes with 8 tasks 2 nodes with 16 tasks.

    2. Also note that we're setting two new MP environment variables for showing us diagnostic information:

      These will generate a lot of diagnostic/informational message both before and after the job completes. These are shown just for demonstration purposes and possibly for your use later when running real jobs.

    3. Verify that your ep.B.4, ep.B.8 and ep.B.16 files are executable. If not, use the command: chmod +x ep* to make them so.

    4. Submit your 4 task job to the LCRM system.

    5. Now modify the lcrm3.cmd job command script so that it:
      - uses 2 nodes and 8 tasks
      - uses the ep.B.8 executable - produces the 8 task output file (ep8task.out)
      This can be done simply by commenting/uncommenting the appropriate lines. Then, submit this 8 task job.

    6. Finally, modify the lcrm3.cmd job command script again, so that it:
      - uses 2 nodes and 16 tasks
      - uses the ep.B.16 executable - produces the 16 task output file (ep16task.out)
      This can be done simply by commenting/uncommenting the appropriate lines as before. Then, submit this 16 task job.

    7. You can use the pstat command to monitor your jobs' executions. Upon their completion, you should have three distinct output files.

    8. Review these three output files. There will be lots of information in the beginning about your POE environment. The actual benchmark results follow that, and then POE communications statistics appear at the end.

      From the benchmark section, determine the execution time for each job. The easiest way to do this is just grep/search for the string "CPU Time = ".

      How scalable is the ep.B.X benchmark code? Do your results come close to those shown below (assuming that they were run on the workshop machine newberg)?

      Sample ep.B.X execution timings
      SMP Nodes MPI Tasks Execution time (sec.)
      2 4 83
      2 8 42
      2 16 21

      Note for the curious: this benchmark has been compiled so that it produces "gprof" profiling information, and generates a gmon*out file for each MPI task. This has nothing to do with LCRM. If you're curious, you can generate a gprof report of the benchmark run to see it's profiling information. For example:

      gprof ep.B.16 gmon*out >
      *** Ignore the [nllookup] error messages ***

      The most interesting information might be the "flat profile" which lists each routine and how much cputime it used. Open your gprof report and just search for "flat profile". Then scroll down a little ways to see this profile information. Again, this has nothing to do with LCRM, but is simply provided for the curious.

  11. Run a parallel job on a Linux cluster machine

    This exercise will submit a job to a different system. The mpi_multibandwidth.c code demonstrates how MPI bandwidth can be a function of the types of send/receive calls used.

    1. Review the pengra.cmd job command script. Note several items:
      • Specification of where to run job
      • Name of output file
      • Compilation of mpi_multipbandwidth.c
      • Cyclic distribution of hostname command
      • Use of the srun command to run the job

    2. Submit the job command script for execution on the pengra linux cluster:
      psub pengra.cmd

    3. Monitor the job's progress with the pstat command.

    4. When the job completes, review the pengra.out output file and notice the different bandwidths for each MPI send/receive pair. Also notice the cyclic distribution of the hostname command.

  12. Try some of the other LCRM related commands/utilities

    The tutorial covered several RAC Utilities, such as pshare, bac and lrmusage. Try using these commands with some of the options shown in the tutorial.

This completes the exercise.

Evaluation Form       Please complete the online evaluation form if you have not already done so for this tutorial.

Where would you like to go now?