This lesson is in the early stages of development (Alpha version)

Using resources effectively

Overview

Teaching: 20 min
Exercises: 25 min
Questions
  • What are the different types of parallelism?

  • How do we execute a task in parallel?

  • What benefits arise from parallel execution?

  • What are the limits of gains from execution in parallel?

Objectives
  • Prepare a job submission script for the parallel executable.

  • Launch jobs with parallel execution.

  • Record and summarize the timing and accuracy of jobs.

  • Describe the relationship between job parallelism and performance.

We now have the full toolset we need to run a job, and we’re going to learn how to scale up our job performance using parallelism. This is a very important aspect of HPC systems, as parallelism is one of the primary tools we have to improve the performance of computational tasks.

In this lesson, we will learn about three ways to parallelize a problem: embarrassingly parallel, shared memory parallelism, and distributed parallelism.

Infinite Monkey Theorem

The infinite monkey theorem states that a monkey hitting keys independently and at random on a typewriter keyboard for an infinite amount of time will almost surely type any given text, including the complete works of William Shakespeare.

/hpc-intro/Chimpanzee%20seated%20at%20typewriter,%20credit%20New%20York%20Zoological%20Society

We don’t have infinite time or resources, but we can simulate this problem with A LOT of monkeys.

Create the following script to simulate a “monkey”. Let’s name this monkey.py:

#!/usr/bin/env python3

import random
import string

nwords = 100
minlen = 2
maxlen = 10
dictionaryfile = "wordlist.txt"

# Generate random string of character of certain length
def randomchars(length):
    return "".join([
        random.choice(string.ascii_lowercase) for _ in range(length)
        ])


# Create a list of random character strings with range of lengths randomly
# distributed between minlen and maxlen
wordlist = [
    randomchars(random.choice(range(minlen,maxlen))) for i in range(nwords)
]

# Read the words from our downloaded dictionary
with open(dictionaryfile, "r") as f:
    englishwords = [ line.strip() for line in f ]

# Print all the randomly generated "words" that are also in the dictionary
for word in wordlist:
    if word in englishwords:
        print(word)

Let your monkey loose on a compute code

Create a submission file, requesting one task on a single node, then launch it.

[yourUsername@borah-login ~]$ nano monkey-job.sh
[yourUsername@borah-login ~]$ cat monkey-job.sh
#!/usr/bin/env bash
#SBATCH -J monkey
#SBATCH -p short
#SBATCH -N 1
#SBATCH -n 1

# Load the computing environment we need
module load python/3.9.7

# Execute the task
python ./monkey.py
[yourUsername@borah-login ~]$ sbatch monkey-job.sh

As before, use the Slurm status commands to check whether your job is running and when it ends:

[yourUsername@borah-login ~]$ squeue --me

Use ls to locate the output file. The -t flag sorts in reverse-chronological order: newest first. What was the output?

Read the Job Output

The cluster output should be written to a file in the folder you launched the job from. For example,

[yourUsername@borah-login ~]$ ls -t
slurm-2114623.out  monkey-job.sh   monkey.py  wordlist.txt
[yourUsername@borah-login ~]$ cat slurm-2114623.out
wow
ld
ax
tk
bl
kg

Your monkey no doubt came up with a different list of words.

Now we have our single “monkey” how can we scale this up using a compute cluster?

Running multiple jobs at once

This is an example of an embarassingly parallel problem: Each monkey doesn’t need to know anything about what the other monkeys are doing. So if our goal is for the monkeys to generate the most words, more monkeys is the solution.

By making the following modification to our submission script (monkey-job.sh), we can use a job array to put multiple monkeys to work!

#!/usr/bin/env bash
#SBATCH -J monkey
#SBATCH -p short
#SBATCH --array=0-10
#SBATCH -N 1
#SBATCH -n 1

# Load the computing environment we need
module load python/3.9.7

# Execute the task
python ./monkey.py

After modifying your submission script, resubmit your job:

[yourUsername@borah-login ~]$ sbatch monkey-job.sh

You can check on your job while it’s running using:

[yourUsername@borah-login ~]$ squeue --me

Or see the time, hostname, and exitcode of finished jobs using:

[yourUsername@borah-login ~]$ sacct -X

Once your job is finished, we can use the word count program, wc, to see how many words were generated by our array of monkeys:

[yourUsername@borah-login ~]$ wc -l  slurm-2114626_*
  8 slurm-2114626_0.out
  2 slurm-2114626_10.out
  6 slurm-2114626_1.out
  4 slurm-2114626_2.out
  7 slurm-2114626_3.out
 11 slurm-2114626_4.out
  8 slurm-2114626_5.out
  4 slurm-2114626_6.out
  4 slurm-2114626_7.out
  5 slurm-2114626_8.out
 10 slurm-2114626_9.out
 69 total

By using a job array, we were able to increase our word output with no changes to our code and a single change to our submission script. Next we’ll learn about more complex types of parallelism.

Distributed vs shared memory parallelism

/hpc-intro/Diagram%20of%20two%20nodes%20showing%20memory%20and%20processors,%20credit%20Cornell%0A%20%20%20Virtual%20Workshop

Hands on activity

To demonstrate the difference between these types of parallelism, your instructor will lead you through an activity.

Takeaway

Shared memory parallelism requires workers to be able to access the same information. In the puzzle example, the people working together at the same table can reach the same pieces. In the HPC world, shared memory parallelism can be used on a single compute node, where the CPU cores can access the same memory.

Distributed memory parallelism requires a communication framework to give a task to and collect output from each worker. In the puzzle example, people seated at different tables must have someone bring them pieces, take the pieces back, and organize the partial results. In the HPC world, a framework called MPI allows workers across multiple nodes to communicate over a specialized network.

Choosing the right parameters for your SLURM job

When submitting jobs to SLURM, choosing the right combination of --ntasks and --cpus-per-task depends on understanding how your program implements parallelism.

--ntasks specifies the number of separate processes with independent memory spaces. Each task runs as a distinct process that cannot directly access another task’s memory. This maps to distributed memory parallelism, where processes communicate through message passing (like MPI). Use --ntasks when your program:

--cpus-per-task specifies how many CPU cores each task can use through threads that share the same memory space. This maps to shared memory parallelism, where threads within a process can directly access the same variables and data structures. Use --cpus-per-task when your program:

Giving your job more resources doesn’t necessarily mean it will have better performance. It’s important to evaluate what kind of parallelization your program can use and tailor your resource request to fit. To learn more about parallelization, see the parallel novice lesson lesson and the Cornell Virtual Workshop on Parallel Programming Concepts and High Performance Computing.

Key Points

  • Parallel programming allows applications to take advantage of parallel hardware.

  • The queuing system facilitates executing parallel tasks.

  • Some performance improvements from parallel execution do not scale linearly.