Learn AI

April 14, 2023

A creative robot

(last edit: Mar 26, 2025)

Any sufficiently advanced technology is indistinguishable from magic.

Modern AIs and ChatGPT in particular look like magic to many people. This can lead to misunderstanding about their strengths and weaknesses, and a lot of unsubstantiated hype.

Learning about a technology is the best antidote. For curious computer scientists, software engineers and anyone else who isn't afraid of digging a bit deeper, I've compiled a list of useful resources on the topic.

This is basically my reading/watching list, organized from more fundamental or beginner friendly to the latest advances in the field. It's not exhaustive, but should give you (and me) enough knowledge to continue exploring and experimenting on your own.

General overviews

If you don't have a lot of time or don't know if you want to dedicate effort in learning the ins-and-outs of modern AIs, watch these first to give you a general overview:

Deep Dive into LLMs like ChatGPT – for text-generation and chatbot AIs
Diffusion models explained in 4 difficulty levels – for image generation AIs

Fundamentals of neural networks

The videos here provide both teorethical and hands-on introduction to the fundamentals of neural networks.

MIT Introduction to Deep Learning

A good theoretical intro is MIT's 6.S191 class lectures, especially the Introduction to Deep Learning and Recurrent Neural Networks, Transformers and Attention.

These overview lectures briefly introduce all the major elements and algorithms involved in creating and training neural networks. I don't think they works on their own (unless you're a student there, do all the in-class excercises, etc), but it's a good place to start with.

The topics discussed here will probably make your head spin and it won't be clear at all how to apply them in real life, but this will give you the lay of the land and prepare you for practical dive-in with, for example, Andrej's “Zero to Hero”.

The example code slides use TensorFlow. Since Andrej's course uses PyTorch, going through both sets of lectures will expose you to two most popular deep learning libraries.

Neural Networks: Zero to Hero

An awesome practical intro is the Neural Networks: Zero to Hero course by Andrej Karpathy (he also did the busy person's intro to LLMs linked above).

Andrej starts out slowly, by spelling out the computation involved in forward and backward passes of the neural network, and then gradually builds up to a single neuron, a single-layer network, multi-layer perceptron, deep networks and finally transformers (like GPT).

Throughout this, he introduces and uses tools like PyTorch (library for writing neural networks), and Jupyter Notebook, and Google Collab. Importantly, he first introduces and implements a concept manually, and only later switches to a PyTorch API that provides the same thing.

The only part where things look a bit rushed is the (currently) last – “Let's build GPT from scratch”. There's so much ground to cover there that Andrej skips over some parts (like the Adam optimization algorithm) and quickly goes over the others (self-attention, cross-attention).

The latest video in the series (not yet on the site as of this writing) is Let's Reproduce GPT-2, which can be also viewed standalone. In the video, he implements the full GPT-2 model (as described in the original paper), using PyTorch. By training on a newer, higher quality dataset, the model even approaches GPT-3 level of intelligence!

Overall a great guide. You only need to know the basics of Python, not be afraid of math (the heaviest of which is matrix multiplication which is spelled out), and do the excercises (code along the videos) without skipping the videos that don't seem exciting.

Understanding embeddings

Both the MIT and Andrej's lectures touch on embeddings (the way to turn words into numbers that a neural net can use) only lightly. To deepen your understanding, What Are Embeddings by Vicki Boykis will teach you everyhing (and I mean everything) about embeddings.

If you don't want to read an entire book but still dive deep, Illustrated word2vec article explains word2vec, a popular word embedding algorithm, step by step. It also features a video explanation for those that prefer it to text.

Another good lecture on the topic is Understanding Word2vec.

CNNs, autoencoders and GANs

The MIT lectures mention earlier also contain lessons on Convolutional Neural Networks, autoencoders and GANs, which are important building blocks in neural networks used in vision.

Again, these are high level overviews and although formulas are present, the lectures more give an overview of the algorithms without going into too much detail. That makes them ideal prequel to the Practical Deep Learning course by Fast.ai.

Diffusion models

Diffusion models build on top of CNNs to create image-generating and manipulating AI models. Beyond the general overview linked earlier, the Introduction to Diffusion Models for Machine Learning is a deep dive into exactly how they work.

Coding Stable Diffusion from scratch in PyTorch is a hands-on implementation of the stable diffusion paper, similar in style to Karpathy's.

Practical Deep Learning

Practical Deep Learning is a free course by Fast.ai that has (current count) 25 lectures covering both high-level practical parts of neural networks and the underlying fundamentals.

In particular in Part 2 they cover “zero to hero” on Stable Diffusion, a powerful image-generation AI model.

Large Language Models

These resources go in-depth about constructing and using large language models (like GPT):

Transformers

Andrej's course goes over the transformer (building blocks of GPT) architecture, but the complexity makes it easy to get lost at first pass. To solidify your understanding of the topic, these two are super useful:

The Illustrated Transformer describes the transformer (building blocks of GPT) in detail while avoiding tedious math or programming details. It provides a good intuition into what's going on (and there's even an accompanying video you may want to watch as a gentler intro).

Follow that up with The Annotated Transformer which describes the scientific paper that introduced Transformers and implements it in PyTorch. Since it's 1:1 annotation of the paper, you need a lot of understanding already so only attempt going through this once you've watched both Andrej's course and once you've read and understood the Illustrated Transformer.

Reinforcement Learning through Human Feedback

Language models are good at predicting and generating text, which is different from answering questions or having a conversation. RLHF is used to fine tune the models to be able to communicate in this way.

The way RLHF works is that people score (a limited number of) outputs from the model. These outputs and scores are then used to train a separate “reward model”, which proxies for human judgement. The reward model is then used to train the LLM.

Illustrating Reinforcement Learning through Human Feedback from folks at HuggingFace (an open source AI community) provides a good overview of RLHF. They also did a webinar based on it (video is the complete webinar, link jumps directly to start of RLHF description) based on the blog post.

If you want to dive deeper, here's the InstructGPT paper from OpenAI, which basically describes the method they used to create ChatGPT out of GPT3 (InstructGPT was a research precursor to ChatGPT).

Fine-tuning

Fine-tuning allows us to refine or specialize an already (pre)-trained LLM to be better at a specific task (RLHF is one example).

Sebastian Rashka's Finetuning Large Language Models explains a few common aproaches to finetuning, with code examples using PyTorch. He follows that up with Understanding Parameter-Efficient LLM Finetuning, a blog post discussing ways to lower the number of parameters required, and an in-depth article about Parameter-Efficient LLM Finetuning with Low-Rank Adaptation (LoRA).

While fine-tuning excels at making the model behave differently, RAG (Retrieval-Augumented Generation) is often a better choice for giving additional information or context to the LLM. A good overview of use cases for both technologies is A Survey of Techniques for Maximizing LLM Performance by OpenAI.

Reasoning

Newer models like OpenAIs o1 and o3 and DeepSeek R1, are reasoning models. This means they're optimized to spend more time/tokens at inference time (while answering a question) to “think through” before answering.

Jay Alammar (from “Illustrated *” series) wrote a good overview of the R1 process at The Illustrated DeepSeek-R1. Notably, R1 uses Reinforcement Learning (RL), not RLHF (see above) – the feedback on LLM output is based on objective non-human factors (functions, heuristics, etc).

Another approach is s1: Simple test-time scaling (PDF) paper, which is pretty technical but the major point (forcing LLM to continue by injecting “wait” into its output) is simple and straightforward.

Check also Sebastian Raschka's Understanding Reasoning LLMs for an overview of the reasoning LLM methods.

Agents

Agents are LLM-driven components that can use tools (for example access the web) and communicate with other agents in the system to collaboratively solve a complex problem. Here's a good overview of agentic workflows by Andrew Ng, followed by a deeper dive by Harrison Chase (both videos from the Sequoia “AI Ascent” conference).

Building Effective Agents is a practical guide for building agents, workflows and other agentic AI patterns.

Non-LLM neural networks

Aside from the increasingly dominant Large Language Models, there are many specialized neural networks (some also using transformer architecture, or part of it) optimized for specific domains. Here are some of them:

Time series

Time series models attempt to undertand the patterns of numerical data series and forecast future ones. Typically the neural network is trained on a large number of publicly-available time series. At inference, the network is given time series as a context and outputs the forecast. Some models are base models and can be further fine-tuned to better model a specific domain.

Many modern time series models are transformer based:

Transformers are effective for time-series forecasting gives an overview of transformer-based time series model architecture
Moirai introduces an open source time series model by Salesforme, and also includes a link to the research paper describing it in great detail
Chronos and Chronos Bolt are based on T5 (Text-to-Text Transfer Transformer) architecture
TimesFM by Google and TimeGPT by Nixtla are other notable foundation models

A video overview of Chronos is also a good starting point to dive into the time series models in general. Decoding time series: the role of transformers in forecasting is a good overview with some benchmarks of the models.

Full courses

If you want a really deep dive (undergrad or higher level), follow these courses including doing the excercises / playing around with the code:

Neural Networks Zero to Hero (Andrej Karpathy)
Introduction to Deep Learning (MIT)
Introduction to Deep Learning (Sebastian Raschka)
Practical Deep Learning for Coders (Fast.ai)
What Are Embeddings (Vicki Boykis)

This is a living document (ie. it's a work in progress and always will be). Come back in a few weeks and check if there's anything new.