OSFNTC

Open-Source Framework for Numerically-Tailored Computations

This project is “An Open-Source Framework for Efficient Numerically-Tailored Computations”, hence the acronym OSFNTC.

The title of the corresponding paper is enclosed in the quotation marks.

About the Paper
Project Structure
Installation
Usage
Evaluation
ASIC tapeout
Contribution
License
Citing
Authors

About the Paper

We introduce a flexible open-source framework specifically designed to streamline efficient and numerically-optimized Matrix-Matrix Multiplications (MMMs). This framework offers two key features: Firstly, it provides an automated pipeline for precise arithmetic datapath generation, which enables the creation of highly customized systolic MMM kernels. Secondly, it allows for the effortless integration of these generated kernels into any user code, regardless of the programming language used, without any need for modifications.

We utilize this framework within a cutting-edge platform that consists of a Power9 host, an OpenCAPI link, and a Xilinx Virtex UltraScale+ FPGA. The framework exhibits a systemic improvement in terms of accuracy per energy cost across a range of High-Performance Computing (HPC) workloads. These workloads present diverse numerical requirements, including those found in Artificial Intelligence (AI) inference and Sea Surface Height (SSH) computation.

For AI inference, we consider a variety of leading-edge neural network models. These include ResNet18, ResNet34, ResNet50, DenseNet121, DenseNet161, DenseNet169, and VGG11. We use two datasets and two computer formats in combination with 27 distinct intermediate arithmetic datapaths. Our approach consistently reduces energy consumption across all scenarios. For instance, we achieve a reduction by factors of 3.3x for IEEE754-32 and 1.4x for Bfloat16 during ImageNet inference with ResNet50, all while maintaining accuracies of 82.3% and 86%, which are comparable to the results achieved with traditional Floating-Point Units (FPUs).

In the context of SSH computation, our methodology obtains fully reproducible results using double-precision words, which exceed the accuracy of traditional double- and quad-precision arithmetic in FPUs. Our approach increases SSH computation accuracy by a minimum of 5x and 27x compared to IEEE754-64 and IEEE754-128, respectively. As a result, we achieve improvements in accuracy per power cost by 5.6x and 15.1x, respectively.

The two phases framework is depicted by the following image. On the left we observe the runtime execution flow, whereas the rightmost part depicts the a priori hardware generation flow:

The 2 phases of the framework: right, the a priori Hardware generation flow, and left, the runtime execution flow.

Project Structure

The repository is organized as follows:

OSFNTC/
├── OpenBLAS/              # Custom OpenBLAS library directory
├── PySigmoid/             # Custom PySigmoid library directory
├── SoftPosit/             # Custom SoftPosit library directory
├── eval/                  # Evaluation scripts and data
├── flopoco/               # Custom FloPoCo library directory
├── misc/                  # Miscellaneous scripts and files
├── oc-accel/              # OpenCAPI acceleration framework directory
├── ocse/                  # OpenCAPI Simulation Engine directory
├── runs/                  # Directory for run scripts and logs
├── sim_config/            # Harware Simulation configuration files
├── .gitignore             # Git ignore rules
├── Makefile               # Makefile for building the project
├── README.md              # This file, a concise overview of the project
├── LICENSE.txt            # The used license
├── requirements.txt       # Python requirements for x86 virtual env
└── requirements_P9.txt    # Python requirements for Power9 virtual env

Please refer to the individual directories for additional readme files and more detailed explanations where applicable.

Installation

The installation process for this project is complex and varies depending on your specific goals, machine, kernel, and operating system. Each subframework included here typically has its own Readme file explaining the necessary installation steps.

To set up the environment variables used by the underlying framework, execute the following script:

source ./misc/script_add_stuff_to_venv.sh

This process is iterative and may require several attempts to succeed. Testing on three different machines revealed slight variations in the required steps.

Below are specific commands and comments required to install some of the frameworks.

OCSE (OpenCAPI Simulation Engine)

The libocxl/Makefile has been modified to ensure the correct misc/ocxl version is pulled.

OpenBLAS

make USE_OPENMP=1
sudo make install #(will put on default path in /opt which is ok)

Pytorch

git submodule sync
git submodule update --init --recursive
pip install pyyaml
pip install typing_extensions
source misc/script_add_stuff_to_venv.sh
python setup.py install

Numpy

pip install cython
git submodule update --init

PySigmoid

Our modified PySigmoid to handle any kind of accumulators. Among other things, we use the library to generate input matrices.

OpenNMT Ctranslate2

# Requires OpenBLAS to be installed in /usr
git clone --recursive https://github.com/OpenNMT/CTranslate2.git
mkdir build && cd build
cmake -DWITH_MKL=OFF -DWITH_OPENBLAS=ON -DOPENMP_RUNTIME=COMP -DENABLE_CPU_DISPATCH=OFF ..
make -j8
sudo make install

Python wrapper

# Set CTRANSLATE2_ROOT to build folder
export CTRANSLATE2_ROOT="$(pwd)"
cd python
pip install -r install_requirements.txt
python setup.py bdist_wheel
pip install dist/*.whl

Runtime

pip install  OpenNMT-py sentencepiece
wget https://s3.amazonaws.com/opennmt-models/transformer-ende-wmt-pyOnmt.tar.gz
tar xf transformer-ende-wmt-pyOnmt.tar.gz
ct2-opennmt-py-converter --model_path averaged-10-epoch.pt --output_dir ende_ctranslate2
#export LD_LIBRARY_PATH to build path

Usage

This section is currently under development, but the basic process involves following the steps laid out by oc-accel to create a custom action and generate a bitstream. Once the setup is complete, you can run high-level code that invokes General Matrix Multiply (GEMM) operations, which will then be processed by the FPGA.

User-level code and numerical libraries do not need to be changed or recompiled to redirect GEMM calls to our customized Matrix-Matrix Multiplication (MMM) units. Typically, an application would allocate some virtual memory space for the input and output matrices, then call one of the GEMM subroutines (sgemm, dgemm, zgemm, cgemm). This process is illustrated in steps 1) and 2) of the overall framework figure provided in the introduction. Often, these applications are statically or dynamically linked with a Basic Linear Algebra Subprograms (BLAS) library.

The execution of the GEMM operation can be dispatched to the FPGA or executed normally on the CPU or GPU, depending on the dimensions of the matrix. This is demonstrated in the following example:

LD_LIBRARY_PATH=/opt/lib/our_openblas.lib ./gemm.py # will dispatch the GEMM execution to the FPGA (depending on matrix dimensions)
LD_LIBRARY_PATH=/opt/lib/OpenBLAS.lib ./gemm.py # will work as normal, either on CPU or GPU execution

Here’s how to call the dgemm and sgemm functions in Python:

m,n,k = 1024, 1024, 1024
# calls dgemm
A = np.random.random((m,k))
B = np.random.random((k,n))
C = np.matmul(A, B)

# calls sgemm
A = np.random.random((m,k)).astype(np.float32)
B = np.random.random((k,n)).astype(np.float32)
C = np.matmul(A, B)

Evaluation

Here, we briefly summarize the two big families of HPC code that we evaluated, each demonstrating distinct numerical requirements.

SSH (Sea Surface Height)

Sea Surface Height (SSH) is a crucial metric in ocean circulation model development, aiding in the tracking of ocean currents, eddies, and climate changes. SSH represents sea surface volume, derived from the product of integrated sea surface area and height.

Reverse Longitude First SSH Calculation

for j=128 to 1: # longitude
    for i=1 to 64:  # latitude
        sum = sum + ssh(i,j)
    end
end
print(sum)  # 32.302734375

Latitude First SSH Calculation

for i=1 to 64:  # latitude
    for j=1 to 128: # longitude
        sum = sum + ssh(i,j)
    end
end
print(sum)  # 0.6732654571533203

To reduce errors and approximate the correct result, several techniques are utilized. Self-Compensated and Double-Compensated Summations (SCS and DCS) estimate the round-off error at each step and subtract it in subsequent steps. Sorting the values in decreasing magnitude order is also effective, especially where values alternate signs.

This study focuses on the effectiveness of hardware units, comparing our Fused Dot Products (FDPs) to the double- and quad-precision FMAs found in computational systems. The units compared are the IEEE-754 double-precision FMA, the IEEE-754 quad-precision FMA, and our 91-bit FDP fed with IEEE754-64 words.

We assessed average, relative standard deviation (RSD), accuracy, and power cost per accurate bit of SSH variable for different vector sizes.

Sea Surface Height computation comparing IEEE-754 double-, quad- precision FMAs and a 91-bit FDP with four metrics.

The 64-bit and 128-bit FPUs showed decreased reproducibility as vector size increased, while our 91-bit FDP maintained reproducibility across all vector sizes. Quad-precision FPUs improved numerical quality over double-precision FPUs but didn’t offer reproducibility.

Our FDP consistently exhibits 52 correct bits, at least 5 and 27.7 times more than quad-precision and double-precision respectively. We also measured the cost of one correct bit in terms of power consumption. Our 91-bit FDP’s power cost was found to be the most efficient, providing more correct bits per wattage than quad-precision and double-precision FMAs.

In conclusion, our study demonstrates that a sufficiently precise accumulator provides reproducibility, greater accuracy in HPC workloads, and costs less than double and extended precision methods.

AI (Artificial Intelligence)

We evaluate the accuracy and power trade-offs of low-precision accumulators across various neural network models, datasets, and computer formats. Our focus lies on the inference portion of neural network computation, utilizing pre-trained neural networks in their original floating-point formats.

We employ Pytorch as a base framework and link it to our modified OpenBLAS. We use popular neural network models such as ResNet18, ResNet34, ResNet50, DenseNet121, DenseNet161, DenseNet169, and VGG11 with batch normalization, and evaluate them on the CIFAR-10 and ImageNet datasets.

The following Figure shows Top1 Score VS. the energy cost of infering the whole validation set. In some instances, we observe that adjusting the accumulator by just a few bits can save an amount of energy equivalent to that required for a 3-year-old toddler to climb a 3-meter hill.

Top1 Accuracy vs. energy cost of inferring validation datasets with various model, computer format, and accumulators.

The following Figure depicts the myriad of evaluated configurations in terms of Top1/Top5 scores and score cost.

Top1/Top5 Accuracies and Top1/Top5 Accuracy Costs for various datasets, models, computer formats, and accumulators.

ASIC Tapeout

This work is both target-agnostic and open source, which inspired us to push its boundaries by trying to make it a chip. Utilizing the exact same toolchain, we were able to successfully manufacture a functional, open-source tapeout. This achievement was made possible through a fruitful collaboration with Google, Skywater, and Efabless.

Special thanks to @mattvenn of the Zero To ASIC course for making it possible.

The image below showcases a ray-tracing render of the 3D view of the GDS file. We replaced the metal with glass to achieve this stunning, glowing visual effect. The featured design is a 3x3 Systolic Array that uses posit<8,0> arithmetic and exact accumulators, also known as Quires.

Artistic raytracing rendering of the polygons from the GDS file describing the chip.

We invite you to explore the following links for a deeper understanding of this project. They will guide you to the code that generated this chip, and offer additional valuable insights into this open-source PDK (Process Development Kit) collaboration.

Contribution

If you’re open to contributions, great! Here are some guidelines you can follow:

Fork the Repository - Click on the “Fork” button at the top-right of this page. This will create a copy of this repository in your GitHub account.
Clone the Repository - Now, go to your GitHub account, open the forked repository, click on the “Code” button and then click the “Copy to clipboard” icon.

Open a terminal and run the following git command:

git clone "url you just copied"

Create a New Branch - Change to the repository directory on your computer (if you are not already there):

cd repository-name

Now create a new branch using the git checkout command:

git checkout -b your-new-branch-name

Make Necessary Changes and Commit Those Changes - Now you can make changes in the source code. After you’ve made changes or added files, you can add those new files to your local repository, which we do with the git add . command:

Now we commit those changes with the git commit command:

git commit -m "My super contribution that does a 10x speed improvement"

Push Changes to GitHub - These changes are now in the HEAD of your local working copy. To send those changes to your remote repository, execute the following git push command:

git push origin <your-branch-name>

Submit your Changes for Review - If you go to your repository on GitHub, you’ll see a Compare & pull request button. Click on that button and you’ll be taken to a page where you can create a pull request.

License

Academic Free License (“AFL”) v. 3.0

Citing

To cite this work, please refer to the articles published in FCCM2022 and FPL2023 whose bibtex are shown below

@INPROCEEDINGS{
    ledoux2022,
    author={Ledoux, Louis and Casas, Marc},
    booktitle={2022 IEEE 30th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)},
    title={A Generator of Numerically-Tailored and High-Throughput Accelerators for Batched GEMMs},
    year={2022},
    doi={10.1109/FCCM53951.2022.9786164}
}

@SUBMITTED{
    ledoux2023,
    author={Ledoux, Louis and Casas, Marc},
    booktitle={FPL},
    title={An Open-Source Framework for Efficient Numerically-Tailored Computations},
    year={2023},
    doi={Still NONE}
}

Some Subsets of this work have also been presented in other not peer reviewed venues such as BSCSymposium23 and OpenPOWER Summit 2019, again bibtex below

@misc{ledoux:hal-04094835,
  TITLE = ,
  AUTHOR = {Ledoux, Louis and Casas, Marc},
  URL = {https://hal.science/hal-04094835},
  NOTE = {Poster},
  HOWPUBLISHED = ,
  ORGANIZATION = ,
  YEAR = {2023},
  MONTH = May,
  KEYWORDS = {GEMMs ; matrix-matrix-multiply ; full stack framework ; automated pipeline ; flopoco ; OpenCAPI ; OpenBLAS ; High Performance Computing ; approximate/trans/extended precision},
  PDF = {https://hal.science/hal-04094835/file/BSC_Symposium_10_Louis_Ledoux_final.pdf},
  HAL_ID = {hal-04094835},
  HAL_VERSION = {v1},
}

@inproceedings{ledoux:hal-04094850,
  TITLE = ,
  AUTHOR = {Ledoux, Louis and Casas, Marc},
  URL = {https://hal.science/hal-04094850},
  BOOKTITLE = ,
  ADDRESS = {Lyon, France},
  ORGANIZATION = ,
  YEAR = {2019},
  MONTH = Oct,
  KEYWORDS = {FPGA ; posit ; acceleration ; PCIE ; CAPI ; CAPI2 ; POWER9},
  HAL_ID = {hal-04094850},
  HAL_VERSION = {v1},
}

Authors

Me / Bynaryman / Louis Ledoux @Bynaryman
Marc Casas @Marccg1

Table of Contents