Let's reproduce GPT-2 (1.6B): one 8XH100 node, 24 hours, $672, in llm.c · karpathy/llm.c · Discussion #677
by karpathy (Andrej)
In this post we are reproducing GPT-2 in llm.c. This is "the GPT-2", the full, 1558M parameter version that was introduced in OpenAI's blog post Better Language Models and their Implications in February 14, 2019. llm.c does so directly in C/CUDA (total of ~5,000 lines of code), without the typical training stack that would involve the Python interpreter and a significantly more complex deep learning library like PyTorch/JAX, huggingface/transformers, or etc. In 2019, training GPT-2 was an involved project from an entire team and considered a big model run but, ~5 years later, due to improvements in compute (H100 GPUs), software (CUDA, cuBLAS, cuDNN, FlashAttention) and data (e.g. the FineWeb-Edu dataset), we can reproduce this model on a single 8XH100 node in 24 hours, and for $672, which is quite incredible
Feed | LinkedIn by Aleksa Gordić | LinkedIn
Andrej Karpathy - Wikipedia
Andrej Karpathy (born 23 October 1986[2]) is a Slovak-Canadian computer scientist who served as the director of artificial intelligence and Autopilot Vision at Tesla. He co-founded and formerly worked at OpenAI,[3][4][5] where he specialized in deep learning and computer vision