# Are You Ready for 1000-Way Parallelism on a Single Chip? Andreas Olofsson CODEMESH 2013

adapteva

## Who needs more Performance?



Cross

Adaptive Cruise Control

Traffic Alert

Cross

Traffic

Alart

Glind Spot

bind Spo

Detection











adapteva











# The Magical Cloud is NOT The solution!



Network latency and bandwidth kills all hope of real time operation





Radio Transmission Burns Even More Power Than Processing

Sometimes not practical to have to be connected at all times



Locality, locality, locality...



# **The Free Lunch is Over!**





1. Scale Frequency



# 10 Trends that Will Shape the Future of Computing



# Any Reason to Think the Future of Computing is NOT Parallel?

#### No Computing

#### Parallel Computing

No Electronic Computing -1943

"Von Neumann Age" Serial Computing 1943-2013?

ada

Parallel Computing 2013-??





HELLO DAVE

# How to Scale?

# Reduce shared and critical resources to zero (True for SW and HW)





# **A Brief History of Parallel Computing**



(1984)

- Ambric
- Asocs
- Aspex
- Axis Semi
- BOPS
- Boston Circuits •
- Brightscale
- Chameleon
- Clearspeed

- Cognivue
- Coherent Logix
- CPUtech
- Cradle
- Cswitch
  - ElementCXI
- Greenarrays
- Inmos
- Intellasys



- Icera
- Intrinsity
- IP-flex
- Mathstar
- Morphics
- Movidius
- Octasic

- PACT
- Picochip
- Plurality
- Quicksilver
- Rapport
- Recore
- Sandbridge
- SiByte
  - SiCortex



### Teraflop (Intel) (2007)

- SiCortex
- Silicon Hive
- Spiral Gateway
- Stream Processors

9

- Stretch
- Venray
- Xelerated
- XMOS
- Zililabs



# **Pragmatic Architecture Tradeoffs**

# IN

- Shared memory architecture
- Dual issue RISC processors
- 64 entry register file
- 32-128KB per core memory
- Multi-banked local memory
- Packet based Mesh NOC
- 32 Bit IEEE float / int
- Memory protection
- Timers, Interrupts, DMAs

# OUT

- Special purpose instructions
- Hardware caching
- SIMD
- Optimized read accesses
- Memory management unit
- Strict memory ordering

# **Key Epiphany Software Considerations**

- Critical code and data must sit in core's local memory (<32KB)</li>
- Optimized (but not restricted) to be a coprocessor
- Flat globally shared memory (NUMA) map (SRAM not cache)
- Row, column based "physical world" address mapping
- Remote writes MUCH faster than remote reads
- Off-chip bandwidth is VERY expensive
- On-chip core-to core communication is cheap and plenty
- Beware of read/write remote memory access races

# **Apples and Oranges...**





100 Epiphany CPU cores fit in the space of one Haswell CPU core!



AMD Jaguar 3.1mm<sup>2</sup>

ARM A15 1.62mm<sup>2</sup>

ARM A7 Epiphany 0.45mm<sup>2</sup> 0.13mm<sup>2</sup>

Intel Haswell 14.5mm<sup>2</sup>

Intel Atom 5.6mm<sup>2</sup>

# **Processor Comparison**

| Technology     | FPGA   | DSP        | GPU   | CPU    | Manycore | Manycore |
|----------------|--------|------------|-------|--------|----------|----------|
| Process        | 28nm   | 40nm       | 28nm  | 32nm   | 28nm     | 28nm     |
| Programming    | VHDL   | C++/Asmbly | CUDA  | C/C++  | C/C++    | C/C++    |
| Area (mm^2)    | 590    | 108        | 294   | 216    | 10       | 130      |
| Price          | \$6900 | \$200      | \$499 | \$285  | TBD      | TBD      |
| Chip Power (W) | 40     | 22         | 135   | 130    | 2        | 25       |
| CPUs           | n/a    | 8          | 32    | 4      | 64       | 1024     |
| Max GFLOPS     | 1500   | 160        | 3000  | 115    | 102      | 2048     |
| Max GMACS      | 3000   | 320        | n/a   | n/a    | 51.2     | 1024     |
| GHz * Cores    | n/a    | 12         | 16    | 14.4   | 51.2     | 1024     |
| L1 Memory      | 6MB    | 512KB      | 2.5MB | 256KB  | 2MB      | 32MB     |
| L2 Memory      | n/a    | 4MB        | 512KB | 1024KB | n/a      | n/a      |



#### **Epiphany: A Truly Scalable Architecture** A Single Unified Instruction Set Architecture! 16,384 92W 4,096 23W + H M 1,024 5.7W GFLOPS 256 1.4W 4096 64 0.35W 1024 256 16 64 16 4

# Epiphany CPU Cores



# Think 32KB is too small...no problem?

| Memory | IEEE | Cores | Array Size | Frequency |
|--------|------|-------|------------|-----------|
| 32KB   | SPF  | 16    | 2.45mm2    | 800MHz    |
| 64KB   | SPF  | 16    | 3.2mm2     | 700MHz    |
| 64KB   | DPF  | 16    | 3.5mm2     | 600MHz    |
| 128KB  | DPF  | 16    | 4.9mm2     | 600MHz    |



# How the \$#@% Do We Program This Thing?



# **Parallel Possibilities with Epiphany**

- Run different program on each core
- Run the same program on each core (SPMD)
- Look for natural data tiling (in images, packets etc)
- Pipeline functions across different cores (send/receive)
- Keep the data in place and move functions around the chip
- Use a host send programs to the Epiphany coprocessor
- Use a framework like OpenCL
- Fork/join threads from an Epiphany core



# **Parallel Programming Frameworks**

| Erlang    | SystemC  | Intel TBB | Co-Fortran | Lisp     | Janus    |
|-----------|----------|-----------|------------|----------|----------|
| Scala     | Haskell  | Pragmas   | Fortress   | Hadoop   | Linda    |
| Smalltalk | CUDA     | Clojure   | UPC        | PVM      | Rust     |
| Julia     | OpenCL   | Go        | X10        | Posix    | XC       |
| Occam     | OpenHMPP | ParaSail  | APL        | Simulink | Charm++  |
| Occam-pi  | OpenMP   | Ada       | Labview    | Ptolemy  | StreamIt |
| Verilog   | OpenACC  | C++Amp    | Rust       | Sisal    | Star-P   |
| VHDL      | Cilk     | Chapel    | MPI        | MCAPI    | ??       |
|           |          | adapt     | eva        |          | 19       |

# **The Parallella Project**

### "Raspberry Pi for parallel"

- 16/64-core Epiphany CPU
- Dual-core A9 ARM SoC with FPGA
- 1GB RAM
- Ethernet, USB, HDMI, etc
- Linux/Android OS
- Credit card form factor
- 5 Watts!
- Open Source SDKs
- ~6,500 backers/customers
- \$99-\$199
- ..visit parallella.org







## **Epiphany Roadmap**



**Recommendations**/Predictions >No easy fix, need to rewrite whole stack for massive parallelism Get ready for explicit memory management Software should be processor agnostic Know where your bits are stored >Hardware will fail, plan accordingly >1K core chips possible today and we will have 16K cores with 1MB/core by 2020 ... get ready! adc

22