Lecture 21

Theory and Design of PL (CS 538)

April 08, 2020

Parallelism

What is parallelism?

Multiple tasks executing at same instant in time
- Think: multicore, datacenter, supercomputer
Property of actual execution on hardware
- Not property of language/program

Why parallelism?

We want to go really fast!
We’re getting more and more cores on CPUs
- CPU clocks aren’t getting much faster
Custom chips becoming more common
- GPUs, ASICs, TPUs, …

Examples

Data parallelism

Divide up data into a bunch of pieces
- Useful when you have a lot of homogeneous data
- Image data, log files, training examples, …
Process parts independently, usually in same way
Wait for tasks to finish, collect results
Examples: MapReduce, Hadoop

Task parallelism

Divide up task into a bunch of pieces
Try to run tasks at the same time
Complications
- Some tasks may depend on other tasks
- Often unclear how to split up a complex task
- Scheduling tasks makes a big difference

Bit-level

Instruction-set architecture level
Use bigger instructions to operate on more data
Get more done with every instruction
Examples
- 16-bit, 32-bit, 64-bit microprocessors
- SIMD: single instruction, multiple data

Instruction-level

Instruction-set architecture level
Run instructions themselves in parallel
- Each clock cycle, execute multiple instructions
Common in all modern processors
- Pipelining
- Gain when instructions don’t interfere

Multicore

Package several processors into one
4-, 8-, 16-, 32-cores are not unusual
Each core is almost a separate CPU

GPUs and ASICs

Application Specific Integrated Circuits
Specialized chips for specialized tasks
- Really, really efficient for certain tasks
Examples
- GPUs: processing graphics
- TPUs: training neural networks
- ASICs: mining bitcoin

Distributed computing

Geographically spread-out computers
Grid computing: borrow time from idle computers
- SETI@Home, protein folding, …
Datacenters

Supercomputers

Really, really big computers
- Footprint of several basketball courts
- Hundreds of miles of cabling
Weather prediction, computational biology, …
Massively parallel
- Millions, or even tens of millions of “cores”

Challenges

Data races

Two requirements:
1. Multiple tasks read/write same piece of data
2. Final state depends on the interleaving
This is almost never what you want!
- Interleaving is not under programmer control
- Data race: result not under programmer control

Example

X := 0; X := 1      ||      Y := X

Final result depends on when Y := X is executed
- Earlier: Y ends up 0
- Later: Y ends up 1
Easy to see in small programs, harder in big programs

“Heisenbugs”

Unpredictable behavior
Sometimes show up, sometimes don’t
Very hard to reproduce and debug

"We’re hitting this catastrophic bug every 3 months or so. Can you fix it?

(In)famous race conditions

Many, many security vulnerabilities due to races
Therac-25 radiation therapy machine
- Serious injuries to patients
GE energy management system
- Caused Northeast blackout of 2003
- Two day outage, more than 50 million affected

Feeding the cores

When some steps get faster, bottlenecks shift
- “Amdahl’s law”
How to effectively use cores?
- If they sit idle, waste time
- 4 cores? 16 cores? 128 cores?
How to get data to where it is needed?
- Communication takes time
How to synchronize?

Compiler and hardware are out to get you

Compiler may reorder instructions to optimize
Hardware also reorders instructions to go fast
Rules for which reorderings are OK is… not clear
- Formally captured by a “memory model”
Most languages use the “C11 memory model”
- C11 memory model not really formalized

Fork-join model

Express parallelism in PL

Programmer knows something about the data
Can help compiler decide how to divide tasks
Indicate parts that can safely be done in parallel

Forking

Parent thread spawns child to execute some function
Parent thread doesn’t wait for child, keeps going
Child executes, hopefully in parallel with parent

Joining

Wait for another thread to finish before continuing
- “Block on another thread”
Example: wait until all child tasks are done
- Need to synchronize to collect results

In Rust: Rayon

Data parallelism crate

Name refers to Cilk: C/C++ parallel extensions
Simple interface to write data-parallel stuff
Often: change into_iter to into_par_iter

Example: sequential

Suppose we have:
- List of a bunch of shops
- List of products we care about
Want to compute: sum of prices across all stores

let total = shops.iter()
                 .map(|store| store.compute_price(&products))
                 .sum();

Example: parallel

Using Rayon: easy to make this computation parallel
Each task shares products, but read-only: no races!

let total = shops.par_iter()
                 .map(|store| store.compute_price(&products))
                 .sum();

Building block: join

Run two closures in parallel, wait until done

fn quick_sort<T:PartialOrd+Send>(v: &mut [T]) {
   if v.len() > 1 {
       let mid = partition(v);  // pick pivot index, partition v
       let (lo, hi) = v.split_at_mut(mid);
       rayon::join(|| quick_sort(lo),
                   || quick_sort(hi));
   }
}

Ref rules: prevent races

Can only have one mutable ref to data at a time
Can’t have mutable and immutable refs at same time

fn quick_sort<T:PartialOrd+Send>(v: &mut [T]) {
   if v.len() > 1 {
       let mid = partition(v);  // pick pivot index, partition v
       let (lo, hi) = v.split_at_mut(mid);
       rayon::join(|| quick_sort(lo),
                   || quick_sort(lo));  // <-- oops
   }
}

Under the hood

Program suggests parallelism, but doesn’t require
Library free to decide when and where to execute
Goal: balance out work among all cores
Work-stealing parallelism
- Each core has a public queue of tasks
- If core finishes early, steal from other queues

Concurrency

What is concurrency?

Tasks can make progress over overlapping periods
Concurrency is a property of two things:
1. Low-level execution (hardware level)
2. High-level concept of a “task” (PL level or higher)

Not same as parallelism

Concurrency without parallelism
- Multiple threads on a single core processor
- Each task is a thread, tasks overlap in time
Parallelism without concurrency
- SIMD parallelism: single instruction, multiple data
- One task, operate on multiple data at same time

Why concurrency?

Tasks are a useful abstraction for programmers
- Natural way to organize systems, group code
- Threads for UI, listening to network, writing file
Don’t need to manually specify interleaving
- Programmer usually can’t plan interleaving
- “Run whatever is ready, I don’t care what order”

Challenges

Interleaving execution

Scheduler decides which task to run, for how long
- Actual execution switches rapidly between tasks
Scheduler is not controlled by the programmer
- Tasks can be paused, restarted at any time
- Order may appear non-deterministic, random

Hard to think about!

One thread with 100 instructions
- One possible ordering
2 threads with 100 instructions
- 10^{344} possible orderings
1000 threads with 100,000 instructions
- A whole lot of possible orderings

If even one interleaving has a bug, the whole program has a bug

Concurrency bugs: bad

Can be intermittent
- Sometimes there, sometimes not
May be very rare, but still serious
- Every 7 months, system wipes all files
Very hard to reproduce
- Don’t know which interleaving caused bug