Introduction to Parallel Computing

Author: Blaise Barney, Lawrence Livermore National Laboratory UCRL-MI-133316

Table of Contents

  1. Abstract
  2. Overview
    1. What is Parallel Computing?
    2. Why Use Parallel Computing?
    3. Who is Using Parallel Computing?
  3. Concepts and Terminology
    1. von Neumann Computer Architecture
    2. Flynn's Classical Taxonomy
    3. Some General Parallel Terminology
    4. Limits and Costs of Parallel Programming
  4. Parallel Computer Memory Architectures
    1. Shared Memory
    2. Distributed Memory
    3. Hybrid Distributed-Shared Memory
  5. Parallel Programming Models
    1. Overview
    2. Shared Memory Model
    3. Threads Model
    4. Distributed Memory / Message Passing Model
    5. Data Parallel Model
    6. Hybrid Model
    7. SPMD and MPMP
  6. Designing Parallel Programs
    1. Automatic vs. Manual Parallelization
    2. Understand the Problem and the Program
    3. Partitioning
    4. Communications
    5. Synchronization
    6. Data Dependencies
    7. Load Balancing
    8. Granularity
    9. I/O
    10. Debugging
    11. Performance Analysis and Tuning
  7. Parallel Examples
    1. Array Processing
    2. PI Calculation
    3. Simple Heat Equation
    4. 1-D Wave Equation
  8. References and More Information


Abstract


This is the first tutorial in the "Livermore Computing Getting Started" workshop. It is intended to provide only a very quick overview of the extensive and broad topic of Parallel Computing, as a lead-in for the tutorials that follow it. As such, it covers just the very basics of parallel computing, and is intended for someone who is just becoming acquainted with the subject and who is planning to attend one or more of the other tutorials in this workshop. It is not intended to cover Parallel Programming in depth, as this would require significantly more time. The tutorial begins with a discussion on parallel computing - what it is and how it's used, followed by a discussion on concepts and terminology associated with parallel computing. The topics of parallel memory architectures and programming models are then explored. These topics are followed by a series of practical discussions on a number of the complex issues related to designing and running parallel programs. The tutorial concludes with several examples of how to parallelize simple serial programs.



Overview

What is Parallel Computing?

Serial Computing:

Parallel Computing:

Parallel Computers:



Overview

Why Use Parallel Computing?

The Real World is Massively Parallel:

  • In the natural world, many complex, interrelated events are happening at the same time, yet within a temporal sequence.

  • Compared to serial computing, parallel computing is much better suited for modeling, simulating and understanding complex, real world phenomena.

  • For example, imagine modeling these serially:



Main Reasons:

  • SAVE TIME AND/OR MONEY:
    • In theory, throwing more resources at a task will shorten its time to completion, with potential cost savings.
    • Parallel computers can be built from cheap, commodity components.

  • SOLVE LARGER / MORE COMPLEX PROBLEMS:
    • Many problems are so large and/or complex that it is impractical or impossible to solve them on a single computer, especially given limited computer memory.
    • Example: "Grand Challenge Problems" (en.wikipedia.org/wiki/Grand_Challenge) requiring PetaFLOPS and PetaBytes of computing resources.
    • Example: Web search engines/databases processing millions of transactions every second

  • PROVIDE CONCURRENCY:
    • A single compute resource can only do one thing at a time. Multiple compute resources can do many things simultaneously.
    • Example: Collaborative Networks provide a global venue where people from around the world can meet and conduct work "virtually".

  • TAKE ADVANTAGE OF NON-LOCAL RESOURCES:
    • Using compute resources on a wide area network, or even the Internet when local compute resources are scarce or insufficient.
    • Example: SETI@home (setiathome.berkeley.edu) over 1.5 million users in nearly every country in the world. Source: www.boincsynergy.com/stats/ (June, 2015).
    • Example: Folding@home (folding.stanford.edu) uses over 160,000 computers globally (June, 2015)

  • MAKE BETTER USE OF UNDERLYING PARALLEL HARDWARE:
    • Modern computers, even laptops, are parallel in architecture with multiple processors/cores.
    • Parallel software is specifically intended for parallel hardware with multiple cores, threads, etc.
    • In most cases, serial programs run on modern computers "waste" potential computing power.


      Intel Xeon processor with 6 cores and 6 L3 cache units

The Future:

  • During the past 20+ years, the trends indicated by ever faster networks, distributed systems, and multi-processor computer architectures (even at the desktop level) clearly show that parallelism is the future of computing.

  • In this same time period, there has been a greater than 500,000x increase in supercomputer performance, with no end currently in sight.

  • The race is already on for Exascale Computing!
    • Exaflop = 1018 calculations per second

Source: Top500.org


Overview

Who is Using Parallel Computing?

Science and Engineering:
  • Historically, parallel computing has been considered to be "the high end of computing", and has been used to model difficult problems in many areas of science and engineering:
    • Atmosphere, Earth, Environment
    • Physics - applied, nuclear, particle, condensed matter, high pressure, fusion, photonics
    • Bioscience, Biotechnology, Genetics
    • Chemistry, Molecular Sciences
    • Geology, Seismology
    • Mechanical Engineering - from prosthetics to spacecraft
    • Electrical Engineering, Circuit Design, Microelectronics
    • Computer Science, Mathematics
    • Defense, Weapons

Industrial and Commercial:

  • Today, commercial applications provide an equal or greater driving force in the development of faster computers. These applications require the processing of large amounts of data in sophisticated ways. For example:
    • "Big Data", databases, data mining
    • Oil exploration
    • Web search engines, web based business services
    • Medical imaging and diagnosis
    • Pharmaceutical design
    • Financial and economic modeling
    • Management of national and multi-national corporations
    • Advanced graphics and virtual reality, particularly in the entertainment industry
    • Networked video and multi-media technologies
    • Collaborative work environments

Global Applications:



Concepts and Terminology

von Neumann Architecture

  • Comprised of four main components:
    • Memory
    • Control Unit
    • Arithmetic Logic Unit
    • Input/Output

  • Read/write, random access memory is used to store both program instructions and data
    • Program instructions are coded data which tell the computer to do something
    • Data is simply information to be used by the program

  • Control unit fetches instructions/data from memory, decodes the instructions and then sequentially coordinates operations to accomplish the programmed task.

  • Arithmetic Unit performs basic arithmetic operations

  • Input/Output is the interface to the human operator

John von Neumann circa 1940s
(Source: LANL archives)



Concepts and Terminology

Flynn's Classical Taxonomy




Single Instruction, Single Data (SISD):




Single Instruction, Multiple Data (SIMD):




Multiple Instruction, Single Data (MISD):


Multiple Instruction, Multiple Data (MIMD):



Concepts and Terminology

Some General Parallel Terminology



Concepts and Terminology

Limits and Costs of Parallel Programming

Amdahl's Law:

  • Amdahl's Law states that potential program speedup is defined by the fraction of code (P) that can be parallelized:

    
                         1
        speedup   =   -------- 
                       1  - P
    
    

  • If none of the code can be parallelized, P = 0 and the speedup = 1 (no speedup).

  • If all of the code is parallelized, P = 1 and the speedup is infinite (in theory).

  • If 50% of the code can be parallelized, maximum speedup = 2, meaning the code will run twice as fast.

  • Introducing the number of processors performing the parallel fraction of work, the relationship can be modeled by:

    
                           1  
        speedup   =   ------------ 
                        P   +  S
                       ---
                        N
    
    

    where P = parallel fraction, N = number of processors and S = serial fraction.

  • It soon becomes obvious that there are limits to the scalability of parallelism. For example:

    
                           speedup
                 --------------------------------
        N        P = .50      P = .90     P = .99
      -----      -------      -------     -------
         10         1.82         5.26        9.17
        100         1.98         9.17       50.25     
      1,000         1.99         9.91       90.99
     10,000         1.99         9.91       99.02
    100,000         1.99         9.99       99.90
    
    


Complexity:

Portability:

Resource Requirements:

Scalability:



Parallel Computer Memory Architectures

Shared Memory

General Characteristics:

  • Shared memory parallel computers vary widely, but generally have in common the ability for all processors to access all memory as global address space.

  • Multiple processors can operate independently but share the same memory resources.

  • Changes in a memory location effected by one processor are visible to all other processors.

  • Historically, shared memory machines have been classified as UMA and NUMA, based upon memory access times.

Uniform Memory Access (UMA):

  • Most commonly represented today by Symmetric Multiprocessor (SMP) machines
  • Identical processors
  • Equal access and access times to memory
  • Sometimes called CC-UMA - Cache Coherent UMA. Cache coherent means if one processor updates a location in shared memory, all the other processors know about the update. Cache coherency is accomplished at the hardware level.

Non-Uniform Memory Access (NUMA):

  • Often made by physically linking two or more SMPs
  • One SMP can directly access memory of another SMP
  • Not all processors have equal access time to all memories
  • Memory access across link is slower
  • If cache coherency is maintained, then may also be called CC-NUMA - Cache Coherent NUMA

Advantages:

  • Global address space provides a user-friendly programming perspective to memory
  • Data sharing between tasks is both fast and uniform due to the proximity of memory to CPUs

Shared Memory (UMA)



Shared Memory (NUMA)

Disadvantages:



Parallel Computer Memory Architectures

Distributed Memory

General Characteristics:

Advantages:

Disadvantages:



Parallel Computer Memory Architectures

Hybrid Distributed-Shared Memory

General Characteristics:

Advantages and Disadvantages:



Parallel Programming Models

Overview



Parallel Programming Models

Shared Memory Model (without threads)

Implementations:



Parallel Programming Models

Threads Model

Implementations:

More Information:



Parallel Programming Models

Distributed Memory / Message Passing Model

Implementations:

More Information:



Parallel Programming Models

Data Parallel Model


Implementations:



Parallel Programming Models

Hybrid Model

  • A hybrid model combines more than one of the previously described programming models.

  • Currently, a common example of a hybrid model is the combination of the message passing model (MPI) with the threads model (OpenMP).
    • Threads perform computationally intensive kernels using local, on-node data
    • Communications between processes on different nodes occurs over the network using MPI

  • This hybrid model lends itself well to the most popular (currently) hardware environment of clustered multi/many-core machines.

  • Another similar and increasingly popular example of a hybrid model is using MPI with CPU-GPU (Graphics Processing Unit) programming.
    • MPI tasks run on CPUs using local memory and communicating with each other over a network.
    • Computationally intensive kernels are off-loaded to GPUs on-node.
    • Data exchange between node-local memory and GPUs uses CUDA (or something equivalent).

  • Other hybrid models are common:
    • MPI with Pthreads
    • MPI with non-GPU accelerators
    • ...



Parallel Programming Models

SPMD and MPMD

Single Program Multiple Data (SPMD):

Multiple Program Multiple Data (MPMD):



Designing Parallel Programs

Automatic vs. Manual Parallelization



Designing Parallel Programs

Understand the Problem and the Program

  • Identify the program's hotspots:
    • Know where most of the real work is being done. The majority of scientific and technical programs usually accomplish most of their work in a few places.
    • Profilers and performance analysis tools can help here
    • Focus on parallelizing the hotspots and ignore those sections of the program that account for little CPU usage.

  • Identify bottlenecks in the program:
    • Are there areas that are disproportionately slow, or cause parallelizable work to halt or be deferred? For example, I/O is usually something that slows a program down.
    • May be possible to restructure the program or use a different algorithm to reduce or eliminate unnecessary slow areas

  • Identify inhibitors to parallelism. One common class of inhibitor is data dependence, as demonstrated by the Fibonacci sequence above.

  • Investigate other algorithms if possible. This may be the single most important consideration when designing a parallel application.

  • Take advantage of optimized third party parallel software and highly optimized math libraries available from leading vendors (IBM's ESSL, Intel's MKL, AMD's AMCL, etc.).


Designing Parallel Programs

Partitioning

Domain Decomposition:

Functional Decomposition:



Designing Parallel Programs

Communications

Who Needs Communications?

Factors to Consider:



Designing Parallel Programs

Synchronization

Types of Synchronization:



Designing Parallel Programs

Data Dependencies

Definition:

Examples:

How to Handle Data Dependencies:



Designing Parallel Programs

Load Balancing

How to Achieve Load Balance:



Designing Parallel Programs

Granularity

Computation / Communication Ratio:

Fine-grain Parallelism:

  • Relatively small amounts of computational work are done between communication events

  • Low computation to communication ratio

  • Facilitates load balancing

  • Implies high communication overhead and less opportunity for performance enhancement

  • If granularity is too fine it is possible that the overhead required for communications and synchronization between tasks takes longer than the computation.

Coarse-grain Parallelism:

  • Relatively large amounts of computational work are done between communication/synchronization events

  • High computation to communication ratio

  • Implies more opportunity for performance increase

  • Harder to load balance efficiently

Which is Best?

  • The most efficient granularity is dependent on the algorithm and the hardware environment in which it runs.

  • In most cases the overhead associated with communications and synchronization is high relative to execution speed so it is advantageous to have coarse granularity.

  • Fine-grain parallelism can help reduce overheads due to load imbalance.


Designing Parallel Programs

I/O

The Bad News:

The Good News:



Designing Parallel Programs

Debugging



Designing Parallel Programs

Performance Analysis and Tuning



Parallel Examples

Array Processing

  • This example demonstrates calculations on 2-dimensional array elements; a function is evaluated on each array element.

  • The computation on each array element is independent from other array elements.

  • The problem is computationally intensive.

  • The serial program calculates one element at a time in sequential order.

  • Serial code could be of the form:

    
    do j = 1,n
      do i = 1,n
        a(i,j) = fcn(i,j)
      end do
    end do
    
    

  • Questions to ask:
    • Is this problem able to be parallelized?
    • How would the problem be partitioned?
    • Are communications needed?
    • Are there any data dependencies?
    • Are there synchronization needs?
    • Will load balancing be a concern?


Array Processing
Parallel Solution 1

One Possible Solution:

Example Programs:


Array Processing
Parallel Solution 2: Pool of Tasks

Pool of Tasks Scheme:

Discussion:



Parallel Examples

PI Calculation

  • The value of PI can be calculated in a number of ways. Consider the following method of approximating PI
    1. Inscribe a circle in a square
    2. Randomly generate points in the square
    3. Determine the number of points in the square that are also in the circle
    4. Let r be the number of points in the circle divided by the number of points in the square
    5. PI ~ 4 r
    6. Note that the more points generated, the better the approximation

  • Serial pseudo code for this procedure:

    
    npoints = 10000
    circle_count = 0
    
    do j = 1,npoints
      generate 2 random numbers between 0 and 1
      xcoordinate = random1
      ycoordinate = random2
      if (xcoordinate, ycoordinate) inside circle
      then circle_count = circle_count + 1
    end do
    
    PI = 4.0*circle_count/npoints
    
    

  • The problem is computationally intensive - most of the time is spent executing the loop

  • Questions to ask:
    • Is this problem able to be parallelized?
    • How would the problem be partitioned?
    • Are communications needed?
    • Are there any data dependencies?
    • Are there synchronization needs?
    • Will load balancing be a concern?


PI Calculation
Parallel Solution

  • Another problem that's easy to parallelize:
    • All point calculations are independent; no data dependencies
    • Work can be evenly divided; no load balance concerns
    • No need for communication or synchronization between tasks

  • Parallel strategy:
    • Divide the loop into equal portions that can be executed by the pool of tasks
    • Each task independently performs its work
    • A SPMD model is used
    • One task acts as the master to collect results and compute the value of PI

  • Pseudo code solution: red highlights changes for parallelism.

    
    npoints = 10000
    circle_count = 0
    
    p = number of tasks
    num = npoints/p
    
    find out if I am MASTER or WORKER 
    
    do j = 1,num 
      generate 2 random numbers between 0 and 1
      xcoordinate = random1
      ycoordinate = random2
      if (xcoordinate, ycoordinate) inside circle
      then circle_count = circle_count + 1
    end do
    
    if I am MASTER
    
      receive from WORKERS their circle_counts
      compute PI (use MASTER and WORKER calculations)
    
    else if I am WORKER
    
      send to MASTER circle_count
    
    endif
    
    

Example Programs:



Parallel Examples

Simple Heat Equation

  • Most problems in parallel computing require communication among the tasks. A number of common problems require communication with "neighbor" tasks.

  • The heat equation describes the temperature change over time, given initial temperature distribution and boundary conditions.

  • A finite differencing scheme is employed to solve the heat equation numerically on a square region.
    • The elements of a 2-dimensional array represent the temperature at points on the square.
    • The initial temperature is zero on the boundaries and high in the middle.
    • The boundary temperature is held at zero.
    • A time stepping algorithm is used.

  • The calculation of an element is dependent upon neighbor element values:

  • A serial program would contain code like:

    
    do iy = 2, ny - 1
      do ix = 2, nx - 1
        u2(ix, iy) =  u1(ix, iy)  +
            cx * (u1(ix+1,iy) + u1(ix-1,iy) - 2.*u1(ix,iy)) +
            cy * (u1(ix,iy+1) + u1(ix,iy-1) - 2.*u1(ix,iy))
      end do
    end do
    

  • Questions to ask:
    • Is this problem able to be parallelized?
    • How would the problem be partitioned?
    • Are communications needed?
    • Are there any data dependencies?
    • Are there synchronization needs?
    • Will load balancing be a concern?
Heat equation


Simple Heat Equation
Parallel Solution

Example Programs:



Parallel Examples

1-D Wave Equation


1-D Wave Equation
Parallel Solution

Example Programs:




This completes the tutorial.

      Please complete the online evaluation form.

Where would you like to go now?



References and More Information