Flash Attention Implementation + Benchmarking

Final project for the course ECE 4122 at Georgia Tech: Advanced Programming Techniques for Engineering Applications.

Description

This project aims to implement Flash Attention, an optimization on the transformer deep learning architecture powering many of machine learning’s recent advances (i.e. ChatGPT - with GPT standing for Generative Pretrained Transformer). Flash Attention claims up to 4× performance improvements in attention computation, significant reductions in GPU memory usage, and improved scalability to longer sequence lengths. For this course’s final project, I would like to investigate how this implementation can speed up training a small transformer model by implementing a CUDA-based Flash Attention kernel and comparing it to a standard attention implementation.

The project will focus on implementing the attention mechanism of a transformer using GPU programming with CUDA. A baseline version of scaled dot-product attention will first be implemented using standard matrix operations. Then, an optimized Flash Attention version will be implemented that performs attention computation in tiled blocks to reduce GPU memory reads and writes. This approach avoids materializing the full attention matrix in memory, which improves memory efficiency and performance.

To structure the project, several custom C++ classes will be created. For example, a TransformerAttention class will manage attention computation, while a CudaFlashAttention class will implement the GPU kernel responsible for the optimized attention algorithm. These classes will encapsulate GPU memory allocation, kernel launches, and data transfer between CPU and GPU memory.

The project will satisfy the requirement of using GPU programming (CUDA), one of the approved special topics for the final project. The CUDA kernels will be used to parallelize attention score computation and softmax normalization across GPU threads.

To evaluate the effectiveness of the implementation, measurable performance metrics will be collected. Specifically, the following comparisons will be made:

Execution time of standard attention vs. Flash Attention
GPU memory usage during attention computation
Scaling behavior with increasing sequence lengths

Benchmarks will be performed on several sequence lengths (for example 128, 256, 512, and 1024 tokens). The results will be measured using CUDA timing utilities and reported as runtime speedup and memory usage differences.

The final deliverables will include the full source code, a README file explaining how to compile and run the program, and either a video demonstration or instructions to compile and run the project on the PACE-ICE system. The project output will include benchmark results showing the performance differences between the two implementations.

The expected outcome of this project is a working CUDA implementation of Flash Attention that demonstrates measurable improvements over a baseline attention implementation, along with a clear comparison of performance and memory usage.

Build

Build using CMake.
Please ensure that CUDA and NVIDIA NSight profiling is installed.

$ mkdir build && cd build
$ cmake ..
$ make

Usage

Run ./attention_benchmark from the build directory to launch a comparison of the Naive implementation, Naive with CuBLAS, and Flash Attention runtimes.

To run NSight, run tools/profile_nsys.sh and tools/open_nsys_ui.sh to view profiling.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.vscode		.vscode
include		include
media		media
src		src
tools		tools
CMakeLists.txt		CMakeLists.txt
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Flash Attention Implementation + Benchmarking

Description

Build

Usage

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Flash Attention Implementation + Benchmarking

Description

Build

Usage

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages