#### Designing DNN Accelerators

**Qijing Jenny Huang** 

# Outline

Deep Neural Network (DNN) Basics
 DNN Accelerators
 High-level Synthesis (HLS)

# **DNN** Basics

# Learning from the Brain



- The basic computational unit of the brain is a neuron
  - 86B neurons in the brain
- Neurons are connected with nearly **10**<sup>14</sup> **10**<sup>15</sup> synapses
- Neurons receive input signal from **dendrites** and produce output signal along **axon**, which interact with the dendrites of other neurons via **synaptic weights**
- Synaptic weights learnable & control the influence strength

\* Slide from http://cs231n.github.io/

#### **Neural Networks**



- NNs are usually feed forward computational graphs constructed from many computational "Neurons"
- The "Neurons":
  - Integrate typically linear transform (dot-product of receptive field)
  - Fire followed by a non-linear "activation" function

\* Slide from http://cs231n.github.io/

#### **Deep Neural Networks (DNN)**

• An Neural Network with multiple layers between the inputs and outputs



#### **DNN Examples**



AlexNet 2012 (8 layers)

GoogLeNet 2014 (22 layers)

ResNet 2015 (152 layers)



DenseNet 2016 (dense connections)



DLA 2017 (deep aggregation)



NasNet 2017 (NAS design)

## **Training vs. Inference**



#### Training (supervised)

Process for a machine to learn by optimizing models (weights) from labeled data.

#### Inference

Using trained models to predict or estimate outcomes from new inputs.

#### **DNN Applications**



Autonomous Vehicles



Security Camera



Drones



Medical Imaging



Robots



**Mobile Applications** 

## **Computer Vision (CV) Tasks**



Image Classification



**Object Detection** 



Semantic Segmentation



Super Resolution



Activity Recognition

#### Nature Language Processing (NLP) Tasks



#### **Many Other Tasks**

- Recommendation Systems (DLRM)
- Machine Translation (Transformer and GNMT)
- Deep Reinforce Learning (AlphaGo)

## **DNN Evaluation Metrics**

- 1. Accuracy
- 2. Computation Complexity
- 3. Model Size



# **DNN Accelerators**

# **Many AI Chips**

In the Cloud (Training + Inference)

- 10s TFLOPs
- 10s MB on-chip memory
- 8 32 bit precision
- 700 MHz 1 GHz
- 10-100s Watts



Cloud TPU v3 (45 TFLOP/s)

At the Edge (Inference)

- 100s-1000s GFLOPs
- 100s KB on-chip memory
- 1 16 bit precision
- 50 MHz 400 MHz
- 1-10s Watts



Intel Movidius (4 TFLOP/s)

In the Edge SoC/SiP (Inference)

> 112 AI chip companies worldwide

(https://github.com/basicmi/AI-Chip)

- 10s-1000s GFLOPs
- 100s KB on-chip memory
- 1 16 bit precision
- 600 MHz 1 GHz
- 10-100s mWatts



#### Cambricon-1M IP

\* Data adapted from Prof. Kurt Keutzer's talk at DAC 2018



\* Image from https://www.electronicproducts.com/Digital\_ICs/Designer\_s\_Guide\_Selecting\_AI\_chips\_for\_embedded\_designs.aspx 16

#### **Accelerator Evaluation Metrics**

- 1. Throughput
  - Frames per second
- 2. Latency
  - Time to finish one frame
- 3. Power
- 4. Energy
- 5. Hardware Cost
  - Resource Utilization





https://mlperf.org/

#### Example Hardware Comparison

|            | #  | Metric                                                          | Google<br>TPU v3      | Nvidia<br>V100                 | Nvidia<br>A100                | Cerebras<br>WSE | GraphCore<br>IPU1  | GraphCore<br>IPU2   |
|------------|----|-----------------------------------------------------------------|-----------------------|--------------------------------|-------------------------------|-----------------|--------------------|---------------------|
| -          | 1  | Technology node                                                 | >12nm<br>(16 nm est.) | TSMC 12 nm                     | TSMC 7 nm                     | TSMC 16 nm      | TSMC 16 nm         | TSMC 7 nm           |
|            | 2  | Die Area (mm2)                                                  | <648 (600 est.)       | 815                            | 826                           | 46225           | 900 (est.)         | 823                 |
|            | 3  | Transistor Count (B)                                            | 11 (est.)             | 21                             | 54.2                          | 1200            | 23.6               | 59.4                |
|            | 4  | Architecture                                                    | Systolic Array        | SIMD                           | SIMD                          | SIMD            | SIMD               | SIMD                |
| ics        | 5  | Theoretical TFLOPS (16-bit mixed precision)                     | 123                   | 125                            | 312                           | 2500            | 125                | 250                 |
| let        | 6  | Freq (GHZ)                                                      | 0.92                  | 1.5                            | 1.4                           | Unknown         | 1.6                | Unknown             |
| 2 3        | 7  | DRAM Capacity (GB)                                              | 32                    | 32                             | 80                            | N/A             | N/A                | 112                 |
| Ra         | 8  | DRAM BW (GB/sec)                                                | 900                   | 900                            | 2039                          | N/A             | N/A                | 64 (est.)           |
|            | 9  | Total SRAM Capacity                                             | 32MB                  | 36 MB<br>(RF+L1+L2)            | 87 MB<br>(RF+L1+L2)           | 18 GB           | 300 MB             | 900 MB              |
|            | 10 | SRAM BW (TB/sec)                                                | Unknown               | 224 @RF +<br>14 @L1 +<br>3 @L2 | 608 @RF+<br>19 @L1 +<br>7 @L2 | 9000            | 45                 | 47.5                |
|            | 11 | Max TDP (Watts)                                                 | 450                   | 450                            | 400                           | 20K             | 150                | 150 (est.)          |
|            | 12 | GEMM Achievable TFLOPS                                          | 98%<br>(120 TFLOPS)   | 88%<br>(110 TFLOPS)            | 93%<br>(290 TFLOPS)           | Unknown         | 47%<br>(58 TFLOPS) | 61%<br>(154 TFLOPS) |
|            | 13 | Energy Efficiency (Achievable<br>GEMM TFLOPS/Max Watts)         | 0.26                  | 0.24                           | 0.72                          | Unknown         | 0.39               | 1.0                 |
| rics       | 14 | Theoretical Energy Efficiency<br>(Theoretical TFLOPS/Max Watts) | 0.27                  | 0.27                           | 0.78                          | 0.125           | 0.83               | 1.6                 |
| Mei        | 15 | Memory Capacity (GB)                                            | 16                    | 32                             | 80                            | 18              | 0.3                | 112                 |
| Efficiency | 16 | Memory Efficiency<br>(FLOP/DRAMByte)                            | 133                   | 122                            | 158                           | N/A             | N/A                | Unknown             |
|            | 17 | Memory Efficiency<br>(FLOP/SRAMByte)                            | Unknown               | 32                             | 35                            | Unknown         | 1.28               | 3.2                 |
|            | 18 | Area Efficiency<br>(Achievable TFLOPS/mm2)                      | 0.2                   | 0.13                           | 0.35                          | Unknown         | 0.06               | 0.17                |
|            | 19 | Area Efficiency<br>(Achievable TFLOPS/BTran)                    | 11                    | 5.2                            | 5.3                           | Unknown         | 2.5                | 2.6                 |

\* Table from

https://khairy2011.medium.com/tpu-vs-gpu-vs-cerebras-vs-graphcore-a-fair-comparison-between-ml-hardware-3f5a19d89e38

#### How to design your own DNN accelerator?

Understand the basic operations

#### **Common DNN Operations**

- Convolution (Groupwise, Dilated, Transposed, 3D and etc.)
- ReLU
- Pooling (Average, Max)
- Fully-Connected
- Batch Normalization

#### **Activation/Feature Maps**

- Input images have three dimensions with RGB channels
- Intermediate data might have more channels after performing convolution
- We refer to them as feature maps



#### Weights/Kernels

- weights for full convolution typically have four dimensions:
  - input channels, width, height, output channels
- input channel size matches the channel dimension of input features
- output channel size specifies the channel dimension of output features



## **3x3 Convolution - Spatially**





Output feature map

Input feature map

- 3x3 Conv with No Stride, No Padding
- Weights = [[0, 1, 2], [2,2,0], [0,1,2]]





Output feature map

#### Input feature map

- 3x3 Conv with Stride 2, Padding 1
- Weights = [[2, 0, 1], [1,0,0], [0,1,1]]

 $O_{00} = I_{00} \times W_{00} + I_{01} \times W_{01} + I_{02} \times W_{02} + I_{10} \times W_{10} + I_{11} \times W_{11} + I_{12} \times W_{12} + I_{20} \times W_{20} + I_{21} \times W_{21} + I_{22}$ 

\* gif from<sup>2</sup><u>Attp://deeplearning.net/software/theano\_versions/dev/\_images/</u>

#### 3x3 Convolution - 3D



# **Fully-Connected Layer (FC)**

- Each input activation is connected to every output activation
- Essentially a matrix-vector multiplication





#### **ReLU Activation Function**

- Implements the concept of "Firing"
- Introduces non-linearity
- Rectified Linear Unit
  - $\circ$  R(z) = max(0, z)
- Not differentiable at 0



### **Batch Normalization (BN)**

 Shifts and scales activations to achieve <u>zero-centered</u> <u>distribution with unit</u>

#### <u>variance</u>

- Subtracts mean
- Divides by standard deviation



27

\* images from https://en.wikipedia.org/wiki/Normal distribution

### Pooling

#### • Downsamples

- Takes the maximum
- Takes the average
- Operates at each feature map independently



\* images from http://cs231n.github.io/convolutional-networks/

112x112x64

112

112

#### **Full DNN Example: AlexNet**



| Top-1 Accuracy | 57.1% |
|----------------|-------|
| Top-5 Accuracy | 80.2% |
| Model Size     | 61M   |
| MACs           | 725M  |



#### How to design your own DNN accelerator?

Understand the basic operations

2. Analyze the workload

#### **The Roofline Model**



- $\pi$  the peak compute performance
- β the peak bandwidth
- I the arithmetic intensity
- The attainable throughput P:

$$P = \min \left\{ egin{smallmatrix} \pi \ eta imes I \end{array} 
ight.$$

- **Performance** is upper bounded by <u>the peak performance</u>, <u>the communication</u> <u>bandwidth</u>, and <u>the operational intensity</u>
- Arithmetic Intensity is the ratio of the compute to the memory traffic

#### **The Roofline Model**



Figure from https://arxiv.org/ftp/arxiv/papers/1704/1704.04760.pdf

#### How to design your own DNN accelerator?



### **Conv Mapping 1: Matrix-Matrix Multiplication**

- Im2Col stores in each column the necessary pixels for each kernel map
  - Duplicates input feature maps in memory
  - Restores output feature map structure



#### **Im2col Transform**



\* from https://www.researchgate.net/publication/327070011\_Accelerating\_Deep\_Neural\_Networks\_on\_Low\_Power\_Heterogeneous\_Architectures

#### Image to column operation (im2col) Slide the input image like a convolution but each patch become a column vector.



## **Optimization: Winograd Algorithm**

**Winograd** performs convolution in a transformed domain to reduces the total number of multiplications.

#### **GEMM Example:**



#### **Conv Mapping 2: Matrix-Vector Multiplication**

Input Channels (IC)

- For each pixel, we can first perform Matrix-Vector Multiplication along the input channel dimension
- Then we can use adder-tree to aggregate the sum of K x K pixels (K is the kernel size)



#### **Implementation: Systolic Array**

- **Systolic Array** is a homogeneous network of tightly coupled data processing units (DPUs).
- Each **DPU** independently computes a partial result as a function of the data received from its upstream neighbors, stores the result within itself and passes it downstream.
- Advantages of systolic array design:
  - Shorter wires -> lower propagation delay and lower power consumption
  - High degree of pipelining -> faster clock
  - High degree of parallelism -> high throughput
  - Simple control logic -> less design efforts



\* Images from http://www.telesens.co/2018/07/30/systolic-architectures/

#### **DNN Accelerator Design 1: Layer-based**



#### **DNN Accelerator Design 2: Spatially-mapped**



# **Line-Buffer Design**



- Buffers inputs to perform spatial operations
- Buffers inputs for reuse to improve the arithmetic intensity

 \* Ritchie Zhao, et al. Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA '17)

| 4 | 2 | 5 | 6 | 9 |  |
|---|---|---|---|---|--|
| 1 | 3 | 8 | 7 | 3 |  |
| 6 | 4 | 2 | 8 | 1 |  |
|   |   |   |   |   |  |
|   |   |   |   |   |  |

| 4 | 2 | 5 | 6 | 9 |  |
|---|---|---|---|---|--|
| 1 | 3 | 8 | 7 | 3 |  |
| 6 | 4 | 2 | 8 | 1 |  |
|   |   |   |   |   |  |
| 4 |   |   |   |   |  |

| 4 | 2 | 5 | 6 | 9 |
|---|---|---|---|---|
| 1 | 3 | 8 | 7 | 3 |
| 6 | 4 | 2 | 8 | 1 |



| 4 | 2 | 5 | 6 | 9 |
|---|---|---|---|---|
| 1 | 3 | 8 | 7 | 3 |
| 6 | 4 | 2 | 8 | 1 |
|   |   | / |   |   |
| 4 |   |   |   |   |

| 4 | 2 | 5 | 6 | 9 |
|---|---|---|---|---|
| 1 | 3 | 8 | 7 | 3 |
| 6 | 4 | 2 | 8 | 1 |
|   |   |   |   |   |
| 4 | 8 |   |   |   |

### How to design your own DNN accelerator?







AlexNet Design

























57





AlexNet Design



# **High-Level Synthesis (HLS)**

- Allows users to specify algorithm logic in high-level languages
  - No concept of clock
  - Not specifying register-transfer level activities
- HLS compiler generates RTL based on high-level algorithmic description
  - Allocation
  - Scheduling
  - Binding
- Advantages:
  - Faster development and debugging cycles
  - More structural code
  - Focuses on larger architecture design tradeoffs

#### **HLS Abstraction**

- High-level Languages
  - $\circ$  C/C++, OpenCL, GoLang
- Typical hardware mapping
  - C Function -> Verilog Module
  - Function Arguments -> Memory Ports
  - Basic Blocks (blocks without branches) -> Hardware Logic
  - Operators -> Functional Units
  - Arrays -> BRAMs
  - Control Flow Graph (CFG) -> Finite-state Machine (FSM)
- Limitations:
  - No dynamic memory allocation allowed
  - No recursion support

#### **Example: Matrix Multiplication**

#### **Step 1: Partition Local Arrays**

// Local memory to store input and output matrices
int localA[MAX\_SIZE][MAX\_SIZE];

#pragma HLS ARRAY\_PARTITION variable=localA dim=1 complete

int localB[MAX\_SIZE][MAX\_SIZE];
#pragma HLS ARRAY\_PARTITION variable=localB dim=2 complete

```
int localC[MAX_SIZE][MAX_SIZE];
```

#pragma HLS ARRAY\_PARTITION variable=localC dim=0 complete

Step 2: Design Systolic Array (Implicit)

```
systolic1: for(int k = 0; k < a_col; k++) {
#pragma HLS LOOP_TRIPCOUNT min=c_size max=c_size
#pragma HLS PIPELINE II=1
systolic2: for(int i = 0; i < MAX_SIZE; i++) {
    systolic3: for(int j = 0; j < MAX_SIZE; j++) {
}
</pre>
```

```
// Get previous sum
int last = (k==0) ? 0 : localC[i][j];
```

```
// Update current sum
// Handle boundary conditions
int a_val = (i < a_row && k < a_col)? localA[i][k] : 0;
int b_val = (k < b_row && j < b_col)? localB[k][j] : 0;
int result = last + a_val*b_val;</pre>
```

```
// Write back results
localC[i][j] = result;
```

}

Step 2: Design Systolic Array (Explicit)

```
for (int r = 0; r < N + 2 * MAX SIZE - 2; r++) {
#pragma HLS pipeline
                // update data (i.e., reads data from previous PE)
                for (int i = 0; i < MAX SIZE; i++)</pre>
                     for (int j = MAX SIZE - 1; j >= 1; j--)
                         localA[i][j] = localA[i][j - 1];
                 for (int i = MAX SIZE - 1; i >= 1; i--)
                     for (int j = 0; j < MAX SIZE; j++)
                         localB[i][j] = localB[i - 1][j];
                // read new data from inputs
                // not ok here!
                for (int i = 0; i < MAX SIZE; i++) {</pre>
                     if (r >= i \&\& r < i + N)
                         localA[i][0] = A[i + ii * MAX SIZE][r - i];
                     else
                         localA[i][0] = 0;
                 }
                 for (int j = 0; j < MAX SIZE; j++) {</pre>
                     if (r \ge j \& \& r < j + N)
                         localB[0][j] = B[r - j][j + jj * MAX SIZE];
                     else
                         localB[0][j] = 0;
                 }
                // PE
                 for (int i = 0; i < MAX SIZE; i++)</pre>
                     for (int j = 0; j < MAX_SIZE; j++)</pre>
                         C[i + ii * MAX SIZE][j + jj * MAX SIZE] += localA[i][j] * localB[i][j];
             }
```

**Step 3: Schedule Outer Loop Control Logic and** Memory Accesses

```
// Burst reads on input matrices from global memory
// Read Input A
 readA: for(int loc = 0, i = 0, j = 0; loc < a_row*a_col; loc++, j++) {
#pragma HLS LOOP_TRIPCOUNT min=c_size*c_size max=c_size*c_size
#pragma HLS PIPELINE II=1
    if(j == a_col) { i++; j = 0;}
    localA[i][j] = a[loc];
 }
// Read Input B
 readB: for(int loc = 0, i = 0, j = 0; loc < b_row*b_col; loc++, j++) {
#pragma HLS LOOP_TRIPCOUNT min=c_size*c_size max=c_size*c_size
#pragma HLS PIPELINE II=1
    if(j == b_col) { i++; j = 0; }
    localB[i][j] = b[loc];
 }
// Burst write from output matrices to global memory
// Burst write from matrix C
writeC: for(int loc = 0, i = 0, j = 0; loc < c_row*c_col; loc++, j++) {</pre>
#pragma HLS LOOP_TRIPCOUNT min=c_size*c_size max=c_size*c_size
#pragma HLS PIPELINE II=1
    if(j == c_col) { i++; j = 0; }
    c[loc] = localC[i][j];
```

\* Please see the <u>SDAccel page</u> for detailed source code

#### Resources

- EE290-2: Hardware for Machine Learning
- MIT Eyeriss Tutorial
- Vivado HLS Design Hubs
- Parallel Programming for FPGAs
- Cornell ECE 5775: High-Level Digital Design Automation
- LegUp: Open-source HLS Compiler
- VTA design example
- <u>Vivado SDAccel design examples</u>

Questions?