Skip to main content

Command Palette

Search for a command to run...

Understanding the Core Concepts of AI: Part 1

A Deep Dive into Large Language Models and Their Building Blocks

Updated
5 min read
Understanding the Core Concepts of AI: Part 1
R

Passionate software developer crafting elegant solutions through code and innovation.

I've been working as a software engineer at a startup for quite some time, and now I'm excited to move into the AI field. There are so many topics to explore, and it can feel overwhelming. To make it easier, I started by learning the basics of Large Language Models to understand how they work. I found a lot of interesting topics, so I decided to write a series of blog posts about them. This series will cover the key Building Blocks of AI.

Large Language Model

A Large Language Model (LLM) is a large Neural Network made up of many transformer layers. It is trained to predict the next token in a sequence of input. The model breaks down the user input into tokens and represents them in vector format. Each transformer layer has multiple sub-layers, allowing each token to compare itself mathematically to all other words. This process is repeated thousands of times, and eventually, the model generates a probability distribution for the next token.

For example, if we type “All that glitters is not …” into Chat GPT or Gemini, it predicts “gold.” Another example is if we ask a well-read person about a book related to a ship sinking, they will immediately suggest the Titanic.

Tokenization

User text → TokenizerToken IDs

As mentioned earlier, the user's query is broken down into smaller parts (tokens) that AI can understand. This process is called tokenization. For example, if the user writes "All that glitters," the LLM can split it into tokens like “All,” “the,” “glit,” “ers.” Another example includes "eating," "dancing," "singing." Tokens are not words; they are IDs that represent pieces of text. A Neural Network cannot process raw characters.

Tokenization is important because words can vary a lot: "run," "running," "runners" are all different words but have similar meanings. It creates a fixed-size vocabulary that can represent any text. The final query might return something like [72, 1632, 9872, 3123, …], which are then sent to the embedding layer.

Vectorization

Text → TokensToken IDsVectors (Embeddings)Transformer Layers

The token IDs are fed into the embedding layer, which converts them into high-dimensional vectors. These vectors then pass through the attention layer, feed-forward layers, and transformer layers.

Words with similar meanings are placed close to each other. For example, "happy" and "joy" are positioned mathematically near each other. This vector represents the token’s meaning, context, and relationship with other tokens.

This process is essential because a Neural Network requires continuous values (floating-point numbers) to learn, and the meanings should be mathematically compressed. "Run" and "jog" should be close together, while "run" and "sofa" should be far apart. Vectors are learned during training and are not calculated by a formula.

Attention

Tokens → Token IDsVectorsTransformer LayerAttentionFeed-forward → Next layer → Repeat

Attention is a mathematical tool that allows each token to determine which other tokens are important and to what extent. It examines nearby tokens to clear up any confusion and calculates how much "attention" one token should give to others. For example, in the phrase “Apple’s Revenue,” the model focuses on “revenue” to understand that Apple refers to the company, not the fruit. This mechanism aids in understanding context.

The LLMs we see today exist because of the Attention mechanism discussed in the well-known paper Attention is All You Need by Google engineers in 2017. Before Attention was introduced, LLMs relied on Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM), which processed information from left to right and often lost the context from earlier tokens.

For example, in the sentence “The dog that chased the cat was hungry,” to understand "was hungry," the model needs to connect it back to "dog." RNNs had difficulty with this. Thanks to Attention, even if two words are 10,000 tokens apart, they can be linked, and it processes all tokens simultaneously, making it very fast. Nothing in LLMs functions without Attention; it is the core engine of intelligence.

Self-Supervised Learning

Vectors → Transformer → Predict next token → Compute lossBackpropagation → Update weights

Self-supervised learning is a training method where the model learns from unlabeled data by creating its own training labels, instead of relying on humans to label everything manually. It hides parts of the data and tries to guess what’s missing. For example, if the sentence is “the sky is ___” and the model answers "red," this is considered a loss, and the weight for this prediction is reduced. When it answers "blue," it is rewarded.

Each time it predicts the next token, embeddings become more refined, attention weights are adjusted, and the multilayer representation improves. Gradually, the model learns that “cat” often appears near “fur,” “pet,” “animal,” so it creates the vector embedding accordingly. This intelligence comes from compressing patterns; Self-Supervision is essentially a large pattern compression engine.

Transformer

Transformer is an architecture used in modern LLMs that employs the Attention mechanism to process all tokens simultaneously through self-attention and feed-forward networks. This forms the foundation of all modern LLMs. The “T” in GPT stands for Transformer.

Each layer of a transformer consists of two main components:

  • Multi-Head Self-Attention: Generates Q, K, V vectors and calculates relevance.

  • Feed-Forward Neural Network (FFNN): Once attention provides context, the FFNN further transforms the vector.

In addition to these two components, each transformer layer also includes normalization layers and residual connections. These ensure that information flows smoothly through very deep networks without vanishing or exploding, allowing transformers to scale to hundreds of layers. Residual connections help the model “remember” the original input signal while still applying complex transformations. Layer normalization stabilizes training and improves convergence, making transformers far more efficient and scalable than older architectures like RNNs or LSTMs.

This is the first step in my deep dive into how modern AI systems actually work. I’ll keep exploring the remaining building blocks: training, inference, optimizers, quantization, fine-tuning, and more and publish them as the next parts of this series. If you want to follow the full breakdown end-to-end, the upcoming posts will connect everything together.