Writing Speed-of-Light Flash Attention for 5090 in CUDA C++