publications | Louis Ledoux

Theses

2024

Floating-Point Arithmetic Paradigms for High-Performance Computing: Software Algorithms and Hardware Designs

Louis Ledoux

Oct 2024

Abs Bib HTML PDF

This dissertation explores the challenges and advancements in arithmetic representations and computations within computer architectures, focusing on the limitations of the IEEE 754 standard. Modern computing demands, driven by advancements in AI, HPC, and scientific simulations, make efficient and precise numerical representations crucial. This work investigates these challenges and proposes innovative solutions, evaluating their impact on computational efficiency and accuracy. The core problem is the inefficiencies of the IEEE 754 standard for floating-point arithmetic, which do not meet the needs of modern workloads. These inefficiencies result in higher energy consumption, inadequate precision, and suboptimal performance, especially in energy-constrained environments and high-precision applications. To address these challenges, this thesis explores various facets of arithmetic computation, from algorithmic concepts to metal and silicon structures. It introduces mechanisms to improve the adaptability of numerical representations, allowing precision adjustments according to computational tasks, resulting in more efficient circuits. Focusing on improving arithmetic performance, the thesis addresses energy consumption and highlights the importance of efficient arithmetic logic units. It also shows how these solutions can be integrated into various software frameworks, revealing a correlation between numerical requirements and internal precision, highlighting an underexploited aspect of general-purpose floating-point formats. Firstly, it develops a framework for generating Posit operators in hardware, improving accuracy and performance in tasks like image classification. The Posit Operator Framework, described in SystemVerilog, enables the construction of Multi-Layer Perceptrons for inference engines, applicable in POWER9/CAPI2 environments with FPGA acceleration. Secondly, it presents a generator for Systolic Arrays optimized for Matrix-Matrix Multiplication (MMM), showing the impact of custom hardware configurations on accuracy and energy efficiency. The MMM units are fully parametrizable and adapted to the numerical specifications of the workload, facilitated by a core generator with automated pipelining. These units allow evaluations with CAPI2 on FPGA and POWER9 systems, achieving up to two Tera floating-point operations per second. They have also demonstrated success in ASIC generation. Additionally, it establishes an open-source framework to integrate MMM units into high-level software, offering energy savings and enhanced precision for applications like AI and scientific computations. The methodology involves mapping General Matrix-Matrix Multiplication calls in BLAS libraries to our accelerators via the OpenCAPI coherent link, saturating the 22 GBps bandwidth by tuning computer formats to accommodate more Processing Elements while preserving accuracy. Finally, the resurgence of vector processing leads to a reevaluation of division algorithms, revealing opportunities to use smaller and slower computing units, allowing more units within varied energy and power budgets. This approach shows a broad Design Space Exploration. We developed an open-source EDA ASIC flow, facilitating parallel generation of multiple chip designs, enabling systematic exploration of power, performance, and area across various process design kits to identify optimal configurations. These contributions form an interdisciplinary thesis that advances solutions to computing challenges from an arithmetic perspective, overcoming the "arithmetic wall."
@phdthesis{ledoux:tel-04754167, title = {{Floating-Point Arithmetic Paradigms for High-Performance Computing: Software Algorithms and Hardware Designs}}, author = {Ledoux, Louis}, url = {https://hal.science/tel-04754167}, school = {{Universitat Polit{\`e}cnica de Catalunya}}, year = {2024}, month = oct, keywords = {Arithmetic ; Matrix Multiplication ; Workload Analysis ; HPC ; FPGA - field-Programmable gate array ; ASIC ; Arithm{\'e}tique IEEE 754 virgule flottante entiers}, type = {Theses}, hal_id = {tel-04754167}, hal_version = {v2}, }

Conference Articles

2023

FPL
An Open-Source Framework for Efficient Numerically-Tailored Computations

Louis Ledoux, and Marc Casas

In 2023 33rd International Conference on Field-Programmable Logic and Applications (FPL) , Sep 2023

ISSN: 1946-1488

Abs Bib HTML

We present a versatile open-source framework designed to facilitate efficient, numerically-tailored Matrix-Matrix Multiplications (MMMs). The framework offers two primary contributions: first, a fine-tuned, automated pipeline for arithmetic datapath generation, enabling highly customizable systolic MMM kernels; second, seamless integration of the generated kernels into user code, irrespective of the programming language employed, without necessitating modifications. We employ this framework within a cutting-edge platform, comprising a Power9 host, an OpenCAPI link, and a Xilinx Virtex UltraScale+ FPGA. The framework demonstrates a systematic enhancement in accuracy per energy cost across diverse High Performance Computing (HPC) workloads displaying a variety of numerical requirements, such as Artificial Intelligence (AI) inference and Sea Surface Height (SSH) computation. For AI inference, we consider a set of state-of-the-art neural network models, namely ResNetl8, ResNet34, ResNet50, DenseNet121, DenseNet161, DenseNet169, and VGG11, in conjunction with two datasets, two computer formats, and 27 distinct intermediate arithmetic datapaths. Our approach consistently reduces energy consumption across all cases, with a notable example being the reduction by factors of 3.3x for IEEE754-32 and 1.4x for Bfloat16 during ImageNet inference with ResNet50. This is accomplished while maintaining accuracies of 82.3% and 86%, comparable to those achieved with conventional Floating-Point Units (FPUs). In the context of SSH computation, our method achieves fully-reproducible results using double-precision words, surpassing the accuracy of conventional double- and quad-precision arithmetic in FPUs. Our approach enhances SSH computation accuracy by a minimum of 5× and 27× compared to IEEE754-64 and IEEE754-128, respectively, resulting in 5.6× and 15.1 × improvements in accuracy per power cost.
@inproceedings{ledoux_framework_2023, title = {An {Open}-{Source} {Framework} for {Efficient} {Numerically}-{Tailored} {Computations}}, url = {https://ieeexplore.ieee.org/document/10296314}, doi = {10.1109/FPL60245.2023.00011}, booktitle = {2023 33rd {International} {Conference} on {Field}-{Programmable} {Logic} and {Applications} ({FPL})}, author = {Ledoux, Louis and Casas, Marc}, month = sep, year = {2023}, note = {ISSN: 1946-1488}, pages = {19--26}, google_scholar_id = {d1gkVwhDpl0C}, }

2022

FCCM
A Generator of Numerically-Tailored and High-Throughput Accelerators for Batched GEMMs

Louis Ledoux, and Marc Casas

In 2022 IEEE 30th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM) , May 2022

Abs Bib HTML

We propose a hardware generator of GEMM accelerators. Our generator produces vendor-agnostic HDL describing highly customizable systolic arrays guided by accuracy and energy efficiency goals. The generated arrays have three main novel aspects. First, the accelerators handle a large variety of computer number formats using intermediate representations based on our Sign Scale Significand (S3) format. Second, the processing elements perform all intermediate dot-product arithmetic operations required by the GEMM kernel without any intermediate rounding, which makes it possible to deliver better energy efficiency than state-of-the-art approaches while offering more accuracy and reproducible results. Third, our accelerators feature the Half-Speed Sink Down (HSSD) mechanism, which maximizes the overlap of host-accelerator data transfers with GEMM computations.We evaluate our automatically generated designs in a cutting-edge setup composed of a POWER9 host, CAPI (Coherent Accelerator Processor Interface) link, and a Virtex Ultrascale Plus FPGA. Arrays can operate at the speed of the link and saturate it to reach a 13GB/s throughput. Our fine-grain customization approach allows to cover a wide range of accuracy versus efficiency scenarios and can reach 0.65GOps/s/W while producing 1024 accurate bits or 148.7GOps/s/W with 6 accurate bits. Our configurations achieve up to 1613GOps/s system performance and power efficiencies of up to 240GOps/s/W for the FPGA. This automatic generator is the first being able to produce such a variety of designs. We improve the single-precision energy efficiency of state-of-the-art FPGA GEMM accelerators by 1.86×.
@inproceedings{ledoux_generator_2022, title = {A {Generator} of {Numerically}-{Tailored} and {High}-{Throughput} {Accelerators} for {Batched} {GEMMs}}, doi = {10.1109/FCCM53951.2022.9786164}, booktitle = {2022 {IEEE} 30th {Annual} {International} {Symposium} on {Field}-{Programmable} {Custom} {Computing} {Machines} ({FCCM})}, author = {Ledoux, Louis and Casas, Marc}, month = may, year = {2022}, pages = {1--10}, google_scholar_id = {u5HHmVD_uO8C}, }

Posters

2024

DATE

The Grafted Superset Approach: Bridging Python to Silicon with Asynchronous Compilation and Beyond

Louis Ledoux, and Marc Casas

Mar 2024

Poster

Bib HTML PDF

@misc{ledoux_suf_2024,
  title = {{The Grafted Superset Approach: Bridging Python to Silicon with Asynchronous Compilation and Beyond}},
  author = {Ledoux, Louis and Casas, Marc},
  url = {https://hal.science/hal-04587458},
  note = {Poster},
  howpublished = {{DATE}},
  year = {2024},
  month = mar,
  keywords = {EDA ; ASIC ; flow ; Open Source Silicon ; Arithmetic ; Floating-Point ; Automated Pipeline ; FloPoCo},
  hal_id = {hal-04587458},
  hal_version = {v1},
}

LLMMMM: Large Language Models Matrix-Matrix Multiplications Characterization on Open Silicon

Louis Ledoux, and Marc Casas

May 2024

Poster

Bib HTML PDF

@misc{ledoux_llmmmm_2024,
  title = {{LLMMMM: Large Language Models Matrix-Matrix Multiplications Characterization on Open Silicon}},
  author = {Ledoux, Louis and Casas, Marc},
  url = {https://hal.science/hal-04592229},
  note = {Poster},
  howpublished = {{11th BSC Symposium}},
  organization = {{Barcelona Supercomputing Center}},
  volume = {11},
  year = {2024},
  month = may,
  keywords = {Multi-Perspective Layout visualizations and Power Performance Area (PPA) Analysis introduction of PPAA (Power Performance Area Accuracy) metric PDK vs. Computer format ASAP7: 4096 PEs 8.2 TFLOP/s <1 mm 2 Congestion heatmap 64 PEs e4m3-$\beta$ 8 ; Multi-Perspective Layout visualizations and Power ; Performance ; Area (PPA) Analysis introduction of PPAA (Power ; Area ; Accuracy) metric PDK vs. Computer format ASAP7: 4096 PEs ; 8.2 TFLOP/s ; <1 mm 2 Congestion heatmap ; 64 PEs ; e4m3-$\beta$ 8},
  hal_id = {hal-04592229},
  hal_version = {v1},
}

2023

An Open-Source Framework for Efficient Numerically-Tailored Computations

Louis Ledoux, and Marc Casas

May 2023

Poster

Bib HTML PDF

@misc{ledoux_osfntc_poster_2023,
  title = {{An Open-Source Framework for Efficient Numerically-Tailored Computations}},
  author = {Ledoux, Louis and Casas, Marc},
  url = {https://hal.science/hal-04094835},
  note = {Poster},
  howpublished = {{BSCSymposium23}},
  organization = {{Barcelona Supercomputing Center}},
  year = {2023},
  month = may,
  keywords = {GEMMs ; matrix-matrix-multiply ; full stack framework ; automated pipeline ; flopoco ; OpenCAPI ; OpenBLAS ; High Performance Computing ; approximate/trans/extended precision},
  hal_id = {hal-04094835},
  hal_version = {v1},
}

Invited talks and Seminars

2024

The Walls and the Dark Silicon Era: An Arithmetic Perspective

Louis Ledoux

Dec 2024

Abs Bib

The memory wall, power wall, dark silicon, and even the adage "communication dominates arithmetic" are persistent challenges for computer architects and the High Performance Computing (HPC) community. The term "computer," derived from the Latin computare ("to calculate"), implies that computation is central to its purpose. However, HPC systems and supercomputers perform numerous non-arithmetic tasks, such as synchronization, data movement, and operating system management. Given the general-purpose nature of these HPC tasks and the emphasis on efficiency, careful attention must be paid to the design of arithmetic units. This creates a high-dimensional parameter space, and the sparsity of numerical requirements in scientific workloads further broadens the design space, allowing for extensive exploration of arithmetic architectures and number representations. This presentation addresses software and hardware solutions aimed at optimizing these architectural choices while tackling the challenges posed by the various computational barriers. Specifically, we focus on two major trade-offs: (1) accuracy (precision) versus energy efficiency, considering numerical requirements, and (2) parallelism versus latency in the context of SIMD/vector floating-point operations. Additionally, we may explore topics such as custom arithmetic ASIC design flows or the use of Multi-Level Intermediate Representations (MLIR) to enhance flexibility and efficiency across this design space.
@techreport{ledoux_taran_dark_era_2024, address = {Rennes, France}, title = {The Walls and the Dark Silicon Era: An Arithmetic Perspective}, author = {Ledoux, Louis}, month = dec, year = {2024}, keywords = {dark silicon, arithmetic, float, posit, llm, gemm, accuracy, energy}, }

2019

OpenPOWER
Accelerating DL inference with (Open)CAPI and posit numbers

Louis Ledoux, and Marc Casas

Oct 2019

Abs Bib HTML

In order to tackle the communication-bound problems in heterogeneous systems, Louis discussed how intrinsic arithmetics can have significant impacts on energy and throughputs. Thanks to the denser representation offered by the posit arithmetic BSC can: - significantly divide bandwidth needed (PCIE: HOST - FPGA) - maintain all weights on chip - save power consumption
@techreport{ledoux_accelerating_2019, address = {Lyon, France}, title = {Accelerating {DL} inference with ({Open}){CAPI} and posit numbers}, url = {https://hal.science/hal-04094850}, booktitle = {{OpenPOWER} summit 2019}, publisher = {linux foundation}, author = {Ledoux, Louis and Casas, Marc}, month = oct, year = {2019}, keywords = {acceleration, CAPI, CAPI2, FPGA, PCIE, posit, POWER9}, }