All the papers cited in Andrej Karpathy's YouTube videos

I’ve been following Andrej Karpathy’s YouTube channel for a long time, and he made me want to dive deep into AI, Neural Networks and LLMs. His videos are interesting, precise and engaging, and they pushed me to read primary sources, i.e. papers and articles. I’ve collected here all the papers that were mentioned in the various videos, as a personal reminder to read them all.

Since I’ve already seen all the videos and didn’t want to watch them all again just to find the papers, I’ve downloaded the subtitle with the command below, and searched for all the occurrences of the term “paper”, then looked at the corresponding timestamps.

yt-dlp --skip-download --write-subs --write-auto-subs --sub-lang en --sub-format ttml --convert-subs srt --output "transcript_%(id)s_%(title)s.%(ext)s" ${URL}

Here is the list of videos, papers and pdf files:

The spelled-out intro to neural networks and backpropagation: building micrograd

https://www.youtube.com/watch?v=VMj-3S1tku0 No papers mentioned.

The spelled-out intro to language modeling: building makemore

https://www.youtube.com/watch?v=PaCmpygFfXo No papers mentioned.

Building makemore Part 2: MLP

https://www.youtube.com/watch?v=TCH_1BHY58I

bengio et al 2003 - A Neural Probabilistic Language Model

https://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf

Building makemore Part 3: Activations & Gradients, BatchNorm

https://www.youtube.com/watch?v=P6sfmUTpUmc

Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification

https://arxiv.org/abs/1502.01852

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

https://arxiv.org/abs/1502.03167

Building makemore Part 4: Becoming a Backprop Ninja

https://www.youtube.com/watch?v=q8SA3rM6ckI

Reducing the Dimensionality of Data with Neural Networks

https://www.cs.toronto.edu/~hinton/absps/science.pdf

Deep Fragment Embeddings for Bidirectional Image Sentence Mapping

https://arxiv.org/abs/1406.5679

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

https://arxiv.org/abs/1502.03167

Bessel’s Correction

https://mathcenter.oxford.emory.edu/site/math117/besselCorrection/

Building makemore Part 5: Building a WaveNet

https://www.youtube.com/watch?v=t3YJ5hKiMQ0

WaveNet: A Generative Model for Raw Audio

https://arxiv.org/abs/1609.03499

bengio et al 2003 - A Neural Probabilistic Language Model

https://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf

Let’s build GPT: from scratch, in code, spelled out.

https://www.youtube.com/watch?v=kCc8FmEb1nY

Attention is All You Need

https://arxiv.org/abs/1706.03762

Layer Normalization

https://arxiv.org/abs/1607.06450

Dropout: A Simple Way to Prevent Neural Networks from Overfitting

https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf

Language Models are Few-Shot Learners

https://arxiv.org/abs/2005.14165

Introducing ChatGPT

https://openai.com/index/chatgpt/

[1hr Talk] Intro to Large Language Models

https://www.youtube.com/watch?v=zjkBMFhNj_g

Llama 2: Open Foundation and Fine-Tuned Chat Models

https://arxiv.org/abs/2307.09288

Training language models to follow instructions with human feedback

https://arxiv.org/abs/2203.02155

Tree of Thoughts: Deliberate Problem Solving with Large Language Models, Yao et al. 2023

https://arxiv.org/abs/2305.10601

Training Compute-Optimal Large Language Models

https://arxiv.org/abs/2203.15556

Sparks of Artificial General Intelligence: Early experiments with GPT-4, Bubuck et al. 2023

https://arxiv.org/abs/2303.12712

https://www.researchgate.net/publication/292074166_Mastering_the_game_of_Go_with_deep_neural_networks_and_tree_search

Jailbroken: How Does LLM Safety Training Fail?

https://arxiv.org/abs/2307.02483

Universal and Transferable Adversarial Attacks on Aligned Language Models

https://arxiv.org/abs/2307.15043

Visual Adversarial Examples Jailbreak Aligned Large Language Models

https://arxiv.org/abs/2306.13213

Not what you’ve signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection

https://arxiv.org/abs/2302.12173

Hacking Google Bard - From Prompt Injection to Data Exfiltration

https://embracethered.com/blog/posts/2023/google-bard-data-exfiltration/

Poisoning Language Models During Instruction Tuning

https://arxiv.org/abs/2305.00944

Poisoning Web-Scale Training Datasets is Practical

https://arxiv.org/abs/2302.10149

OWASP Top 10 for LLM Applications

https://owasp.org/www-project-top-10-for-large-language-model-applications/

Let’s build the GPT Tokenizer

https://www.youtube.com/watch?v=zduSFxRajkE

Language Models are Unsupervised Multitask Learners

https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf

Llama 2: Open Foundation and Fine-Tuned Chat Models

https://arxiv.org/abs/2307.09288

A Programmer’s Introduction to Unicode

https://www.reedbeta.com/blog/programmers-intro-to-unicode/

MEGABYTE: Predicting Million-byte Sequences with Multiscale Transformers

https://arxiv.org/abs/2305.07185

Efficient Training of Language Models to Fill in the Middle

https://arxiv.org/abs/2207.14255

Learning to Compress Prompts with Gist Tokens

https://arxiv.org/abs/2304.08467

Taming Transformers for High-Resolution Image Synthesis

https://arxiv.org/abs/2012.09841 https://compvis.github.io/taming-transformers/

Video generation models as world simulators

https://openai.com/index/video-generation-models-as-world-simulators/

Integer tokenization is insane

https://www.beren.io/2023-02-04-Integer-tokenization-is-insane/

SolidGoldMagikarp (plus, prompt generation)

https://www.lesswrong.com/posts/aPeJE8bSo6rAFoLqg/solidgoldmagikarp-plus-prompt-generation

Let’s reproduce GPT-2 (124M)

https://www.youtube.com/watch?v=l8pRSuU81PU

Language Models are Unsupervised Multitask Learners

https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf

Language Models are Few-Shot Learners

https://arxiv.org/abs/2005.14165

Attention is All You Need

https://arxiv.org/abs/1706.03762

Gaussian Error Linear Units (GELUs)

https://arxiv.org/abs/1606.08415

Using the Output Embedding to Improve Language Models

https://arxiv.org/abs/1608.05859

NVIDIA A100 Tensor Core GPU Architecture

https://images.nvidia.com/aem-dam/en-zz/Solutions/data-center/nvidia-ampere-architecture-whitepaper.pdf

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

https://arxiv.org/abs/2205.14135

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

https://arxiv.org/abs/2307.08691

Online normalizer calculation for softmax

https://arxiv.org/abs/1805.02867

HellaSwag: Can a Machine Really Finish Your Sentence?

https://arxiv.org/abs/1905.07830

Deep Dive into LLMs like ChatGPT

https://www.youtube.com/watch?v=7xTGNNLPyMI

Language Models are Unsupervised Multitask Learners

https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf

The Llama 3 Herd of Models

https://arxiv.org/abs/2407.21783

Training language models to follow instructions with human feedback

https://arxiv.org/abs/2203.02155

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

https://arxiv.org/abs/2501.12948

Mastering the Game of Go without Human Knowledge

https://discovery.ucl.ac.uk/id/eprint/10045895/1/agz_unformatted_nature.pdf

Fine-Tuning Language Models from Human Preferences

https://arxiv.org/abs/1909.08593

How I use LLMs

https://www.youtube.com/watch?v=EWvNQjAaOHw

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

https://arxiv.org/abs/2501.12948