Using Thunder Exercise


  1. Login to the workshop machine

    The instructor will demonstrate how to login to thunder. Once logged in, we will be using a special partition on the actual thunder machine. This partition is for exclusive use by this workshop and will go away after the workshop is over.

  2. Review the login banner. Specifically notice the various sections:

  3. Copy the example files

    First create a thunder subdirectory, then copy the files, and then cd into the thunder subdirectory:

    mkdir thunder
    cp -R /usr/global/docs/training/blaise/thunder/*   ~/thunder
    cd thunder

  4. Verify your exercise files

    Issue the ls -l command. Your output should show something like below:

    drwx------    2 class10  class10      4096 Apr 16 10:51 benchmarks
    drwx------    2 class10  class10      4096 Apr 16 10:51 bugs
    drwx------    4 class10  class10      4096 Apr 16 10:51 mpi
    drwx------    4 class10  class10      4096 Apr 16 10:51 openMP
    drwx------    2 class10  class10      4096 Apr 16 10:51 pthreads
    drwx------    4 class10  class10      4096 Apr 16 10:51 serial

Job Usage and Configuration Information:

  1. Before we attempt to actually compile and run anything, let's get familiar with some basic usage and configuration information for thunder. For the most part, these commands can be used on any parallel LC system with a high-speed interconnect.

  2. Try each of the following commands, comparing and contrasting them to each other. Most have a man page if you need more information.

    Command Description
    Basic job usage information for each partition. This is an LC developed command.
    More job usage information - more detail per job. Only running jobs. This is an LC developed command ported from the IBM SPs.
    Like the previous comand, but shows non-running (queued) jobs also. This is an LC developed command ported from the IBM SPs.
    Another job usage display. This is a SLURM command.
    Basic partition configuration information. This is a SLURM command.
    pstat -m thunder
    An LCRM command to show all LCRM jobs on thunder
    pstat -f jobid
    Display the full details for a specified job. For the jobid parameter, select any jobid from the output of the previous command. Jobids are in the first column. Might be more interesting to pick a job with a RUN status. This is an LCRM command.
    news job.lim.thunder
    Shows job limits on thunder

Building and Running Serial Applications:

  1. Go to either the C or Fortran versions of the serial applications:
    cd serial/c
    cd serial/fortran
    Review the Makefile, noting the compiler being used and it's options. See the compiler man page for an explanation of the options.

    When you are ready, issue the make command to build the examples.

  2. Run any of the example codes by typing the executable name at the command prompt. For example: ser_array

  3. Time the untuned code - make a note of its timing:
    time untuned
  4. Edit the Makefile. Comment out the compiler option assignment line that begins with 'COPT=' or 'FOPT=', depending on whether you are using C or Fortran. Do this by inserting a '#' in front of the line. Then, uncomment the line beneath it that begins with '#COPT=' or '#FOPT='.

  5. Build the untuned example with the new compiler options:
    make clean
    make untuned
  6. Now time the new untuned executable:
    time untuned
    How does it perform compared to the previous run? What changes in compiler options account for these?

    Note: if you try both C and Fortran, the result differences are due to loop index variables - C starts at 0 and Fortran at 1.

A Bit More About Optimization:

  1. As you undoubtedly noticed from the unoptimized vs. optimized version of the untuned program, the Intel compilers do a good job at optimizing code. Additionally (and optionally), they provide highly detailed reports that specify which optimizations were/were not performed and why. This information can be helpful for further tuning efforts on critical portions of code.

  2. To see an example of this, compile the untuned code with a new option that specifies an optimization (non parallel) report. Depending upon if you are using C or Fortran:
    icc -O2 -tpp2 -opt_report_file optreport untuned.c
    ifort -O2 -tpp2 -opt_report_file optreport untuned.f

  3. After the compilation review the optreport file. Granted, a lot of the information implies understanding of optimization techniques, but for those who wish to dig deeper, know that you can. The icc and ifort documentation provide information on other optimization reporting options.

MPI Runs:

Resolving Unresolved Externals:

  1. Deliberately create a situation with missing externals by re-linking the any of the above MPI codes using icc or ifort instead of mpiicc or mpiifort. For instance:
    icc -o mpi_ping mpi_ping.o
    ifort -o mpi_ping mpi_ping.o
  2. The linker will indicate a number of unresolved externals that prevent you from linking successfully. Select one of these symbol names and use it as an argument to findentry. For example, if you are using the C version, try:
    findentry MPI_Recv

  3. Notice the output of the findentry utility, such as the list of library directories it searches, and any possible matches to what you are looking for.

  4. With a real application, you could now attempt to link to a relevant library path and library to resolve the undefined reference. No need to do so here though...


  1. cd to your ~/thunder/pthreads subdirectory. You will see several C files written with pthreads. There are no Fortran files because a standardized Fortran API for pthreads never happened.

  2. If you are already familar with pthreads, you can review the files to see what is intended. If you are not familiar with pthreads, this part of the exercise will probably not be of interest.

  3. Compiling with pthreads is easy: just add the -pthread option to your compile command. For example:
    icc -pthread hello.c -o hello
    Compile any/all of the example codes.

  4. To run, just enter the name of the executable.


  1. Depending upon your preference, cd to your ~/thunder/openMP/c/   or   ~/thunder/openMP/fortran/ subdirectory. You will see several OpenMP codes.

  2. If you are already familar with OpenMP, you can review the files to see what is intended. If you are not familiar with OpenMP, this part of the exercise will probably not be of interest.

  3. Compiling with OpenMP is easy: just add the -openmp option to your compile command. For example:
    icc -openmp omp_hello.c -o hello
    ifort -openmp omp_reduction.f -o reduction
    Compile any/all of the example codes.

  4. To run, just enter the name of the executable.

  5. Note: by default, the number of OpenMP threads created will be equal to the number of cpus on a node. You can override this by setting the OMP_NUM_THREADS environment variable to a value of your choice.

Run a Few Benchmarks/Tests:

  1. Run the STREAM memory bandwidth benchmark

    1. cd ~/thunder/benchmarks

    2. Depending on whether you like C or Fortran, compile the code. You'll see some informational messages about OpenMP parallelization.

      icc -O3 -tpp2 -openmp stream.c -o stream 
      icc -O3 -tpp2 -DUNDERSCORE -c mysecond.c
      ifort -O3 -tpp2 -openmp stream.f mysecond.o  -o stream

    3. Then run the code on a single node:
      srun -n1 -ppclass stream
    4. Note the timings when it completes. Compare to the theoretical peak memory-to-cpu bandwidth for the E8870 chipset of 6.4 GB/s. Note that we are running this as a 4-way OpenMP threaded job, using all 4 CPUs on the node.

    5. For more information on this benchmark, see

  2. Run an MPI message passing test, which shows the bandwidth depending upon number of nodes used and type of MPI routine used. This isn't an official benchmark - just a local test.

    1. Assuming you are still in your ~/thunder/benchmarks subdirectory, compile the code (sorry, only a C version at this time):
      mpiicc -O3 -tpp2 mpi_multibandwidth.c -o mpitest

    2. Run it using all 4 CPUs on 2 different nodes. Also be sure to specify where to send output instead of stdout:
      srun -N2 -n8 -ppclass mpitest > mpitest.output8

    3. After the test runs, check the output file for the results. Notice in particular how bandwidth improves with message size and how much variation there can be between the 8 tasks at any given message size.

    4. To find the best OVERALL average do something like this:
      grep OVERALL mpitest.output8 | sort
      You can then search within your output file for the case that had the best performance.

    5. Now repeat the run, but instead, only use 1 task on each of 2 nodes and send the output to a new file:
      srun -N2 -ppclass mpitest > mpitest.output2

    6. Find the best OVERALL average again for this run:
      grep OVERALL mpitest.output2 | sort

    7. Notice the large difference in performance? Why? If you're curious, ask the instructor.

A Few Bugs:

  1. cd ~/thunder/bugs

  2. bug1

    1. Look at the bug1 files. You'll notice bug1.c and bug1.32bit.output. Review bug1.32bit.output to see what this example does on a IA32 machine.

    2. Compile and run the program, and notice the incorrect output:
      icc bug1.c -o bug1

    3. If you have time and interest, see if you can find the source of the problem and fix it. Otherwise, see/compile/run the solution file, bug1.fix.c, which documents the problem.

  3. bug2

    1. Look at the bug2 files. You'll notice bug2.c and bug2.32bit.output. Review bug2.32bit.output to see what this example does on a IA32 machine.

    2. Compile and run the program.
      icc bug2.c -o bug2
    3. Notice what happens. Why? This is a very simple program. See if you can figure out why and how to fix it.

    4. As with the bug1 problem, there is a "fix" file which explains the problem and provides a workaround solution.

  4. bug3

    1. Review the bug3.c or bug3.f, depending upon whether you like C or Fortran. These are very simple OpenMP programs.

    2. Compile and run the program.

      icc -openmp bug3.c -o bug3
      ifort -openmp bug3.f -o bug3

    3. It should seg fault. The lecture notes discuss the reason why under the memory constraints section, but it could take the average new programmer quite awhile to figure this out and what to do about it.

    4. For the solution, read the bug3.fix file and "source" it:
      source bug3.fix
    5. Now try executing your bug3 program this time. It should work fine.

Online Thunder Status Information:

  1. Go to the main LC web page by clicking on the link below. It will open a new window so you can follow along with the rest of the instructions.

  2. Look for the little green/red arrows for "OCF Machine Status". Click there.

  3. When prompted for your user name and password, use your class## userid and the PIN + OTP token for your password. Ask the instructor if you're not sure what this means.

  4. You will then be taken to the "LC OCF Machines Status" web matrix. Find the line for Thunder and note what info is displayed.

  5. Now actually click on the hyperlinked word "Thunder" and you will be taken to lots of additional information about Thunder, including links to yet more information, which you can follow if you like.

This completes the exercise.

Evaluation Form       Please complete the online evaluation form if you haven't done so for this tutorial already.

Where would you like to go now?