All the papers cited in Andrej Karpathy's YouTube videos
I’ve been following Andrej Karpathy’s YouTube channel for a long time, and he made me want to dive deep into AI, Neural Networks and LLMs. His videos are interesting, precise and engaging, and they pushed me to read primary sources, i.e. papers and articles. I’ve collected here all the papers that were mentioned in the various videos, as a personal reminder to read them all.
Since I’ve already seen all the videos and didn’t want to watch them all again just to find the papers, I’ve downloaded the subtitle with the command below, and searched for all the occurrences of the term “paper”, then looked at the corresponding timestamps.
yt-dlp --skip-download --write-subs --write-auto-subs --sub-lang en --sub-format ttml --convert-subs srt --output "transcript_%(id)s_%(title)s.%(ext)s" ${URL}
Here is the list of videos, papers and pdf files:
The spelled-out intro to neural networks and backpropagation: building micrograd
https://www.youtube.com/watch?v=VMj-3S1tku0 No papers mentioned.
The spelled-out intro to language modeling: building makemore
https://www.youtube.com/watch?v=PaCmpygFfXo No papers mentioned.
Building makemore Part 2: MLP
https://www.youtube.com/watch?v=TCH_1BHY58I
bengio et al 2003 - A Neural Probabilistic Language Model
https://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf
Building makemore Part 3: Activations & Gradients, BatchNorm
https://www.youtube.com/watch?v=P6sfmUTpUmc
Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification
https://arxiv.org/abs/1502.01852
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
https://arxiv.org/abs/1502.03167
Building makemore Part 4: Becoming a Backprop Ninja
https://www.youtube.com/watch?v=q8SA3rM6ckI
Reducing the Dimensionality of Data with Neural Networks
https://www.cs.toronto.edu/~hinton/absps/science.pdf
Deep Fragment Embeddings for Bidirectional Image Sentence Mapping
https://arxiv.org/abs/1406.5679
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
https://arxiv.org/abs/1502.03167
Bessel’s Correction
https://mathcenter.oxford.emory.edu/site/math117/besselCorrection/
Building makemore Part 5: Building a WaveNet
https://www.youtube.com/watch?v=t3YJ5hKiMQ0
WaveNet: A Generative Model for Raw Audio
https://arxiv.org/abs/1609.03499
bengio et al 2003 - A Neural Probabilistic Language Model
https://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf
Let’s build GPT: from scratch, in code, spelled out.
https://www.youtube.com/watch?v=kCc8FmEb1nY
Attention is All You Need
https://arxiv.org/abs/1706.03762
Layer Normalization
https://arxiv.org/abs/1607.06450
Dropout: A Simple Way to Prevent Neural Networks from Overfitting
https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf
Language Models are Few-Shot Learners
https://arxiv.org/abs/2005.14165
Introducing ChatGPT
https://openai.com/index/chatgpt/
[1hr Talk] Intro to Large Language Models
https://www.youtube.com/watch?v=zjkBMFhNj_g
Llama 2: Open Foundation and Fine-Tuned Chat Models
https://arxiv.org/abs/2307.09288
Training language models to follow instructions with human feedback
https://arxiv.org/abs/2203.02155
Tree of Thoughts: Deliberate Problem Solving with Large Language Models, Yao et al. 2023
https://arxiv.org/abs/2305.10601
Training Compute-Optimal Large Language Models
https://arxiv.org/abs/2203.15556
Sparks of Artificial General Intelligence: Early experiments with GPT-4, Bubuck et al. 2023
https://arxiv.org/abs/2303.12712
Mastering the game of Go with deep neural networks and tree search
Jailbroken: How Does LLM Safety Training Fail?
https://arxiv.org/abs/2307.02483
Universal and Transferable Adversarial Attacks on Aligned Language Models
https://arxiv.org/abs/2307.15043
Visual Adversarial Examples Jailbreak Aligned Large Language Models
https://arxiv.org/abs/2306.13213
Not what you’ve signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection
https://arxiv.org/abs/2302.12173
Hacking Google Bard - From Prompt Injection to Data Exfiltration
https://embracethered.com/blog/posts/2023/google-bard-data-exfiltration/
Poisoning Language Models During Instruction Tuning
https://arxiv.org/abs/2305.00944
Poisoning Web-Scale Training Datasets is Practical
https://arxiv.org/abs/2302.10149
OWASP Top 10 for LLM Applications
https://owasp.org/www-project-top-10-for-large-language-model-applications/
Let’s build the GPT Tokenizer
https://www.youtube.com/watch?v=zduSFxRajkE
Language Models are Unsupervised Multitask Learners
Llama 2: Open Foundation and Fine-Tuned Chat Models
https://arxiv.org/abs/2307.09288
A Programmer’s Introduction to Unicode
https://www.reedbeta.com/blog/programmers-intro-to-unicode/
MEGABYTE: Predicting Million-byte Sequences with Multiscale Transformers
https://arxiv.org/abs/2305.07185
Efficient Training of Language Models to Fill in the Middle
https://arxiv.org/abs/2207.14255
Learning to Compress Prompts with Gist Tokens
https://arxiv.org/abs/2304.08467
Taming Transformers for High-Resolution Image Synthesis
https://arxiv.org/abs/2012.09841 https://compvis.github.io/taming-transformers/
Video generation models as world simulators
https://openai.com/index/video-generation-models-as-world-simulators/
Integer tokenization is insane
https://www.beren.io/2023-02-04-Integer-tokenization-is-insane/
SolidGoldMagikarp (plus, prompt generation)
https://www.lesswrong.com/posts/aPeJE8bSo6rAFoLqg/solidgoldmagikarp-plus-prompt-generation
Let’s reproduce GPT-2 (124M)
https://www.youtube.com/watch?v=l8pRSuU81PU
Language Models are Unsupervised Multitask Learners
Language Models are Few-Shot Learners
https://arxiv.org/abs/2005.14165
Attention is All You Need
https://arxiv.org/abs/1706.03762
Gaussian Error Linear Units (GELUs)
https://arxiv.org/abs/1606.08415
Using the Output Embedding to Improve Language Models
https://arxiv.org/abs/1608.05859
NVIDIA A100 Tensor Core GPU Architecture
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
https://arxiv.org/abs/2205.14135
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
https://arxiv.org/abs/2307.08691
Online normalizer calculation for softmax
https://arxiv.org/abs/1805.02867
HellaSwag: Can a Machine Really Finish Your Sentence?
https://arxiv.org/abs/1905.07830
Deep Dive into LLMs like ChatGPT
https://www.youtube.com/watch?v=7xTGNNLPyMI
Language Models are Unsupervised Multitask Learners
The Llama 3 Herd of Models
https://arxiv.org/abs/2407.21783
Training language models to follow instructions with human feedback
https://arxiv.org/abs/2203.02155
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
https://arxiv.org/abs/2501.12948
Mastering the Game of Go without Human Knowledge
https://discovery.ucl.ac.uk/id/eprint/10045895/1/agz_unformatted_nature.pdf
Fine-Tuning Language Models from Human Preferences
https://arxiv.org/abs/1909.08593
How I use LLMs
https://www.youtube.com/watch?v=EWvNQjAaOHw