Skip to content

mbalabanski/flash-attn-impl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Flash Attention Implementation + Benchmarking

Final project for the course ECE 4122 at Georgia Tech: Advanced Programming Techniques for Engineering Applications.

Description

This project aims to implement Flash Attention, an optimization on the transformer deep learning architecture powering many of machine learning’s recent advances (i.e. ChatGPT - with GPT standing for Generative Pretrained Transformer). Flash Attention claims up to 4× performance improvements in attention computation, significant reductions in GPU memory usage, and improved scalability to longer sequence lengths. For this course’s final project, I would like to investigate how this implementation can speed up training a small transformer model by implementing a CUDA-based Flash Attention kernel and comparing it to a standard attention implementation.

The project will focus on implementing the attention mechanism of a transformer using GPU programming with CUDA. A baseline version of scaled dot-product attention will first be implemented using standard matrix operations. Then, an optimized Flash Attention version will be implemented that performs attention computation in tiled blocks to reduce GPU memory reads and writes. This approach avoids materializing the full attention matrix in memory, which improves memory efficiency and performance.

To structure the project, several custom C++ classes will be created. For example, a TransformerAttention class will manage attention computation, while a CudaFlashAttention class will implement the GPU kernel responsible for the optimized attention algorithm. These classes will encapsulate GPU memory allocation, kernel launches, and data transfer between CPU and GPU memory.

The project will satisfy the requirement of using GPU programming (CUDA), one of the approved special topics for the final project. The CUDA kernels will be used to parallelize attention score computation and softmax normalization across GPU threads.

To evaluate the effectiveness of the implementation, measurable performance metrics will be collected. Specifically, the following comparisons will be made:

  • Execution time of standard attention vs. Flash Attention
  • GPU memory usage during attention computation
  • Scaling behavior with increasing sequence lengths

Benchmarks will be performed on several sequence lengths (for example 128, 256, 512, and 1024 tokens). The results will be measured using CUDA timing utilities and reported as runtime speedup and memory usage differences.

The final deliverables will include the full source code, a README file explaining how to compile and run the program, and either a video demonstration or instructions to compile and run the project on the PACE-ICE system. The project output will include benchmark results showing the performance differences between the two implementations.

The expected outcome of this project is a working CUDA implementation of Flash Attention that demonstrates measurable improvements over a baseline attention implementation, along with a clear comparison of performance and memory usage.

Build

Build using CMake.
Please ensure that CUDA and NVIDIA NSight profiling is installed.

$ mkdir build && cd build
$ cmake ..
$ make

Usage

Run ./attention_benchmark from the build directory to launch a comparison of the Naive implementation, Naive with CuBLAS, and Flash Attention runtimes.

To run NSight, run tools/profile_nsys.sh and tools/open_nsys_ui.sh to view profiling.

About

Flash Attention Implementation for ECE 4122 Final Project

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors