TIL: Things I've Learned About AI

By Baraban| Saturday, February 28, 2026

A running collection of rules of thumb, mental models, and non-obvious insights from working with AI systems.

Grokking: Delayed Generalization in Transformers

Models can appear to “memorize” training data for a long time before suddenly generalizing — this is called grokking. It happens because generalization requires the model to discover a compact internal algorithm, which takes many more gradient steps than simple memorization. The implication: don’t stop training too early, and watch validation loss long after training loss has plateaued. Weight decay helps grokking happen faster.

Research: grokking conditions →

Weight Decay

Weight decay pushes weights toward zero each step, which biases the model toward simpler solutions and helps generalization. Trade-off: it also makes the model forget faster — useful in some continual learning setups, problematic in others.

LR Warmup & Cosine Decay

LR warmup starts with a tiny learning rate and ramps up over the first few thousand steps. Early in training gradients are noisy and a large LR causes instability, so you ease in.

Cosine decay slowly reduces the learning rate toward zero following a cosine curve. The reason: a high LR late in training keeps the model bouncing around the loss landscape instead of settling. Bringing it down lets the model commit to a solution. Trade-off: without it the model can stagnate; with it the model may learn slower than it otherwise could.

Known gotcha: if you restore from a checkpoint mid-training, the learning rate resets to wherever the schedule says it should be at that step — but the model weights are already adapted to a different LR. This mismatch causes a visible jump in training and validation loss.

Research: three-way comparison →

Learning vs. Retaining: Models Forget When They Learn New Things

Training a model on new knowledge tends to degrade previously learned knowledge — catastrophic forgetting. This creates a fundamental tension: a model optimized to learn a new task quickly will often overwrite representations it needs to retain old ones. Understanding this trade-off is essential before designing any continual learning or fine-tuning strategy.

Research: sequential modular eval →

Last modified February 28, 2026: Minor tweak (f28e17e)

Citations

Baraban (2026). TIL: Things I've Learned About AI.https://KintaroAI.com/blog/2026/02/28/til-things-ive-learned-about-ai/ (KintaroAI)

@misc{baraban2026tilthingsivelearnedaboutai,
    author = {Baraban},
    title = {TIL: Things I've Learned About AI},
    year = {2026},
    url = {https://KintaroAI.com/blog/2026/02/28/til-things-ive-learned-about-ai/},
}