Raghav's Blog

Understanding the Core Concepts of AI: Part 1

Raghav Shukla — Wed, 26 Nov 2025 11:41:16 GMT

I've been working as a software engineer at a startup for quite some time, and now I'm excited to move into the AI field. There are so many topics to explore, and it can feel overwhelming. To make it easier, I started by learning the basics of Large Language Models to understand how they work. I found a lot of interesting topics, so I decided to write a series of blog posts about them. This series will cover the key Building Blocks of AI.

Large Language Model

A Large Language Model (LLM) is a large Neural Network made up of many transformer layers. It is trained to predict the next token in a sequence of input. The model breaks down the user input into tokens and represents them in vector format. Each transformer layer has multiple sub-layers, allowing each token to compare itself mathematically to all other words. This process is repeated thousands of times, and eventually, the model generates a probability distribution for the next token.

For example, if we type “All that glitters is not …” into Chat GPT or Gemini, it predicts “gold.” Another example is if we ask a well-read person about a book related to a ship sinking, they will immediately suggest the Titanic.

Tokenization

User text → Tokenizer → Token IDs

As mentioned earlier, the user's query is broken down into smaller parts (tokens) that AI can understand. This process is called tokenization. For example, if the user writes "All that glitters," the LLM can split it into tokens like “All,” “the,” “glit,” “ers.” Another example includes "eating," "dancing," "singing." Tokens are not words; they are IDs that represent pieces of text. A Neural Network cannot process raw characters.

Tokenization is important because words can vary a lot: "run," "running," "runners" are all different words but have similar meanings. It creates a fixed-size vocabulary that can represent any text. The final query might return something like [72, 1632, 9872, 3123, …], which are then sent to the embedding layer.

Vectorization

Text → Tokens → Token IDs → Vectors (Embeddings) → Transformer Layers

The token IDs are fed into the embedding layer, which converts them into high-dimensional vectors. These vectors then pass through the attention layer, feed-forward layers, and transformer layers.

Words with similar meanings are placed close to each other. For example, "happy" and "joy" are positioned mathematically near each other. This vector represents the token’s meaning, context, and relationship with other tokens.

This process is essential because a Neural Network requires continuous values (floating-point numbers) to learn, and the meanings should be mathematically compressed. "Run" and "jog" should be close together, while "run" and "sofa" should be far apart. Vectors are learned during training and are not calculated by a formula.

Attention

Tokens → Token IDs → Vectors → Transformer Layer → Attention → Feed-forward → Next layer → Repeat

Attention is a mathematical tool that allows each token to determine which other tokens are important and to what extent. It examines nearby tokens to clear up any confusion and calculates how much "attention" one token should give to others. For example, in the phrase “Apple’s Revenue,” the model focuses on “revenue” to understand that Apple refers to the company, not the fruit. This mechanism aids in understanding context.

The LLMs we see today exist because of the Attention mechanism discussed in the well-known paper Attention is All You Need by Google engineers in 2017. Before Attention was introduced, LLMs relied on Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM), which processed information from left to right and often lost the context from earlier tokens.

For example, in the sentence “The dog that chased the cat was hungry,” to understand "was hungry," the model needs to connect it back to "dog." RNNs had difficulty with this. Thanks to Attention, even if two words are 10,000 tokens apart, they can be linked, and it processes all tokens simultaneously, making it very fast. Nothing in LLMs functions without Attention; it is the core engine of intelligence.

Self-Supervised Learning

Vectors → Transformer → Predict next token → Compute loss → Backpropagation → Update weights

Self-supervised learning is a training method where the model learns from unlabeled data by creating its own training labels, instead of relying on humans to label everything manually. It hides parts of the data and tries to guess what’s missing. For example, if the sentence is “the sky is ___” and the model answers "red," this is considered a loss, and the weight for this prediction is reduced. When it answers "blue," it is rewarded.

Each time it predicts the next token, embeddings become more refined, attention weights are adjusted, and the multilayer representation improves. Gradually, the model learns that “cat” often appears near “fur,” “pet,” “animal,” so it creates the vector embedding accordingly. This intelligence comes from compressing patterns; Self-Supervision is essentially a large pattern compression engine.

Transformer

Transformer is an architecture used in modern LLMs that employs the Attention mechanism to process all tokens simultaneously through self-attention and feed-forward networks. This forms the foundation of all modern LLMs. The “T” in GPT stands for Transformer.

Each layer of a transformer consists of two main components:

Multi-Head Self-Attention: Generates Q, K, V vectors and calculates relevance.
Feed-Forward Neural Network (FFNN): Once attention provides context, the FFNN further transforms the vector.

In addition to these two components, each transformer layer also includes normalization layers and residual connections. These ensure that information flows smoothly through very deep networks without vanishing or exploding, allowing transformers to scale to hundreds of layers. Residual connections help the model “remember” the original input signal while still applying complex transformations. Layer normalization stabilizes training and improves convergence, making transformers far more efficient and scalable than older architectures like RNNs or LSTMs.

This is the first step in my deep dive into how modern AI systems actually work. I’ll keep exploring the remaining building blocks: training, inference, optimizers, quantization, fine-tuning, and more and publish them as the next parts of this series. If you want to follow the full breakdown end-to-end, the upcoming posts will connect everything together.

System Design 101 : Scale from Zero to Millions of Users

Raghav Shukla — Tue, 26 Aug 2025 17:58:02 GMT

In the past three years, I’ve mostly worked with startups, and the common approach was always “build fast, optimize later.” It definitely helped us move quickly, but when the user base started growing, we ran into scaling issues that were tough to handle in production.

That’s when I realized—it’s not about choosing between speed and scale. The real trick is finding a balance: shipping fast while also putting just enough foundation in place so the system can handle growth. After all, every startup dreams of reaching a million users, and being a little prepared early makes the ride much smoother.

Recently, while learning more about system design and reading the ByteByteGo book, I picked up a few lessons that I think are worth sharing.

Single Server Setup

To keep things simple, we’ll start by running everything on a single server. When a user accesses the website by entering a domain name like raghavdev.in, the DNS server returns the corresponding IP address. Once the IP is received, the browser sends an HTTP request to our server, which then responds with either HTML or JSON for rendering. This forms the initial setup of our system.
This is how the initial setup will look like.

Database

As we discussed earlier, the single server setup works well in the beginning, but it quickly becomes a limitation once the user base starts growing. To scale more effectively, we need to separate the web server and the database so that each can grow independently.

Now big question arises

Which Database to choose?

Basically there are 2 types of database :

SQL (Relational Databases) → Store data in tables with rows and columns. Popular examples are MySQL and PostgreSQL.
NoSQL (Non-Relational Databases) → Store data in more flexible formats like key-value pairs, documents, graphs, or columns. Examples include MongoDB, Cassandra, Neo4j, CouchDB, and Amazon DynamoDB.

The right choice depends on the project’s requirements, and in many cases, teams even combine multiple databases to meet different use cases.

Load Balancer

Now that the web server and database are separated, a new problem arises when many users access the website at the same time. Once the server limit is reached, responses slow down, which is something we don’t want. To handle this, we add a Load Balancer that evenly distributes incoming traffic across multiple servers.

Now the users connect only to the Load Balancer (public IP) and not directly to the servers. It communicates with the web servers through private IPs within the same network, making the servers unreachable from outside.

A Load Balancer also prevents downtime if one server goes offline. If Server 1 fails, all requests are automatically redirected to Server 2, ensuring users don’t face interruptions.As traffic grows, we can keep adding more servers, and the load balancer will use its algorithm to distribute requests evenly among them. This makes the system both scalable and highly available.

Database Replication

With the Load Balancer in place, we solved the issue of servers going offline. But what if the same thing happens to our database? A single database failure would still bring the whole system down, which we need to avoid.

To keep the database safe, we can use Database Replication, where databases follow a master–slave relationship. The Master DB holds the original data and handles all write operations (insert, update, delete), while the Slave DBs maintain copies of that data and serve read operations. Since most systems perform more reads than writes, the number of slave databases is usually greater than the number of masters.

As we discussed earlier, the Load Balancer improves system availability for servers, and replication does the same for databases. If a Slave DB goes offline, the reads can temporarily go to the master, and if the Master DB fails, one of the slaves can be promoted to master so the system continues to run without downtime.

Cache

By now we’ve achieved good availability, but the next challenge is improving response time. To do this, we add a caching layer on top of the database. A cache is a high-speed storage layer that keeps frequently accessed or recently used data so it can be served much faster than fetching it directly from the database, disk, or an external API.

When a request comes in, the web server first checks the cache. If the data is found, it’s returned immediately; if not, the server fetches it from the database, stores it in the cache, and then sends it to the client. This is called a read-through cache. Depending on the use case, different caching strategies can be applied, and we can store things like JSON responses, JS files, or other static content to speed up performance.

When to use caching ?

Frequent Reads – The same data is requested repeatedly, so caching avoids recomputing or re-fetching it.
Slow Data Source – The original source (like a database, disk, or external API) is slower than memory, so caching speeds things up.
High Latency – If accessing the main source takes noticeable time, caching reduces response times.
Performance Improvement – By serving data from cache, you reduce load on the server or database.
Cost Efficiency – If data queries or API calls are expensive, caching lowers usage and costs.

Content delivery network (CDN)

A CDN is a global network of servers that cache and deliver static content like images, CSS, JS, or videos. Instead of always fetching files from the origin server, users get them from the nearest CDN server, which makes websites load much faster. For example, a user in Delhi will get content quicker from a Mumbai CDN server than from one in the US.

User Request – User A requests an image via a CDN URL (e.g., CloudFront or Akamai).
Cache Miss – If the CDN doesn’t have it, it fetches the file from the origin server/storage (e.g., Amazon S3).
Origin Response – The origin sends the file with a TTL (how long it should stay cached).
CDN Cache – The CDN stores the file and delivers it to User A.
Another Request – User B requests the same image.
Cache Hit – The CDN serves the image directly from its cache (until TTL expires).

Stateless Architecture

In a stateful architecture, the server remembers client data from one request to the next. For example, in an online banking system, the server keeps track of your session—like login details, account info, and transactions—across multiple steps. While in stateless architecture the HTTP request can be shared to any of the server and it will not maintain the cleint data.

By moving state data out of the web servers, we make auto-scaling much easier. Now, servers can be added or removed based on traffic load without worrying about losing session data, making the system more flexible and scalable.

Data Centers

As the user base grows globally, a single server location is no longer enough. To reduce latency, we add multiple data centers around the world. Suppose a website has Data Center 1 in Mumbai and Data Center 2 in New York.

A user in Delhi is routed to the Mumbai data center → faster response.
A user in San Francisco is routed to the New York data center.

When a user request comes in, it flows through DNS, then to the nearest CDN, and finally reaches the Load Balancer, which uses geo-routing to direct it to the closest data center. Inside, the web servers work with caches and databases to serve the response. If one data center fails, traffic is automatically rerouted to a healthy one, while data synchronization ensures consistency across all centers.

Message Queues

A Message Queue is a system that stores messages and lets services communicate asynchronously. It helps decouple producers and consumers, making applications more scalable and reliable, since messages can still be processed even if one side is temporarily unavailable. Producers publish messages to the queue, and consumers pick them up whenever they’re ready to process them.

A message queue is like a waiting line where tasks (messages) are stored until someone picks them up.Ex: When we place an order on an e-commerce platform, the inventory update and report generation don’t happen instantly but are handled in background queues.

Logs , Metrics , Automation

As our website is grown now and we need to invest in logging and metrics

Logging → Keeps a record of what’s happening in the system (errors, requests, events). Needed for debugging, audits, and finding issues fast.
Metrics → Numbers that show system health (CPU, memory, response time, traffic). Needed to measure performance and know when to scale or fix something.
Automation → Automatically handles deployments, scaling, monitoring, and recovery. Needed to reduce human error, speed up processes, and keep systems reliable.

Database scaling

As the data grows bigger now , it’ll get overloaded and we need some ways to fix this issue. We can implement following approaches.

Vertical scaling

It means improving a single server’s capacity by adding more resources like CPU, RAM, or storage Example: Upgrading memory from 8 GB to 32 GB allows the server to handle more traffic.

Horizontal scaling

It means adding more servers instead of making one server stronger. For instance, you can deploy 5 servers and use a Load Balancer to spread the traffic among them.

Sharding

It is a way of splitting a large database into smaller pieces (called shards), where each shard holds a portion of the data.Example: Instead of one database storing data for all users, you split it so users A–M are stored in Shard 1, and users N–Z are stored in Shard 2.

After implementing these steps, our architecture can gracefully handle millions of users and beyond. But system design is never truly “finished” , it’s an iterative process where we continuously refine, decouple layers, add more caching strategies, and adjust components as the system grows.

Thanks for reading! 🎉 A lot of these learnings are inspired by the amazing content from ByteByteGo.If you’re serious about system design, I highly recommend checking out their course.

How GZIP Compression Made My Spring Boot App 80% Lighter

Raghav Shukla — Sun, 18 May 2025 12:52:12 GMT

Recently, while working on my Spring Boot backend, I stumbled on one of those optimizations that gives huge results for minimal effort: GZIP compression.

No complex setups. No code changes. Just a few config tweaks and boom—API payloads dropped by 80% in size.
Let me walk you through what GZIP is, how I enabled it, and the massive impact it had on my app’s performance.

🤔 What Even Is GZIP?

Think of GZIP like a vacuum pack for your data—shrinks it down, zips it across the wire, and the browser puffs it back up instantly.

GZIP is a lossless compression algorithm that reduces the size of your HTTP responses before they're sent to the client.

It's like zipping a file before sending it. The browser automatically unzips it, so the user receives the same data, just more quickly.

Why GZIP Rocks:

📉 Smaller payloads = faster APIs
📱 Better experience on slow networks
📈 Boosts SEO (Google loves fast apps)
💰 Reduces bandwidth costs

⚙️ Setting Up GZIP in Spring Boot

You don’t need a library or dependency—just flip a few switches in application.properties:

# Enable compression
server.compression.enabled=true

# Only compress responses larger than 1KB
server.compression.min-response-size=1024

# Compress these content types
server.compression.mime-types=application/json,application/xml,text/html,text/xml,text/plain

Here’s what each line does:

enabled=true: Turns on compression
mime-types: Targets responses like JSON, HTML, text
min-response-size=1024: Compress only if it’s bigger than 1KB

That’s it. Restart your app, and you’re good to go.

🔍 Testing It Out

To confirm it’s working:

Postman: Look for Content-Encoding: gzip in the response headers.
Chrome DevTools → Network tab: Same thing, check the response headers.
Or use curl:

curl -H "Accept-Encoding: gzip" -I http://localhost:8080/api/your-endpoint

📊 Before vs After: The Real Impact

I tested few of my endpoints returning a big JSON array. Here’s what changed:

Before Gzip

After Gzip

Using GZIP compression reduces the time it takes to send data over the network because it makes the data smaller. This smaller size means it travels faster, improving load times. However, because GZIP is a lossless compression method, the original data size remains unchanged once decompressed, ensuring that no data is lost during the process. This allows for faster data transfer without sacrificing data integrity.

💡 Few Tips

🔁 Pair with caching (GZIP + Redis = ultra-fast)
📶 Test on slow networks (e.g., Chrome’s 3G throttle)
🖥️ Watch CPU usage—compression takes some cycles

🔗 References

🎯 Wrapping It Up

GZIP gave my Spring Boot app a major speed boost for practically no effort. It’s one of those tiny tweaks that should be in every dev’s toolbox.
Got an API? Turn on GZIP. You’ll thank yourself later. 😄

🗣️ Got any other backend speed hacks? I’d love to hear them—drop them in the comments or hit me up!

Reddit Backend Project Using Microservice Architecture

Raghav Shukla — Thu, 17 Oct 2024 18:30:00 GMT

What are Microservices

Microservices represent a software architecture approach where a complex application is built as a collection of small, independent services, each running in its process and communicating with others through well-defined APIs. The key principle is to decouple various functionalities into modular services, promoting flexibility, scalability, and ease of maintenance. Each microservice is responsible for a specific business capability and can be developed, deployed, and scaled independently.

Advantages

Enhanced modularity: Microservices break down applications into smaller, independent services.
Scalability: Each microservice can be scaled independently to meet specific demands.
Parallel development: Developers can work on individual components without affecting the entire system.
Flexibility: Easy integration of new features and updates without disrupting the entire application.
Resource optimization: Efficient utilization of resources by scaling specific microservices as needed.

Overview

Project Architecture

The project consists of 6 microservices. Each service is independent and loosely coupled so it doesn't affect other services.

Java version: 17, Dependency Management: Maven, Springboot Version: 3.2+

Spring Cloud Config

Spring Cloud Config is a tool in the Spring Cloud ecosystem that centralizes and manages application configuration settings. Instead of hardcoding configurations in every service we can fetch all the configurations from a central repository which eases the management and updating of config.
It also helps in having different configurations for different environments by creating multiple YML files for each environment. It also facilitates dynamic updates without the need for service redeployment.

I have used the following dependency to utilize the Spring Cloud config

<dependency>
        <groupId>org.springframework.cloudgroupId>
        <artifactId>spring-cloud-dependenciesartifactId>
        <version>${spring-cloud.version}version>
dependency>

Discovery Service

Using Eureka Discovery service here. helps microservices find and communicate with each other in a distributed system.each service registers itself with Eureka, and other services can look up its location when they need to interact with it.

Automatically locate and connect to services without hardcoding their locations, making the system more flexible and scalable.

LB:Distribute incoming requests among multiple instances of a service to improve performance and reliability.

API Gateway Service

An API gateway acts as a central hub for managing and securing interactions between clients and backend services. It streamlines tasks like authentication, authorization, and routing, enhancing security, performance, and scalability. By providing a unified entry point, it simplifies client access and enables monitoring and analytics for better insights and decision-making. In summary, API gateways play a vital role in modern architecture by optimizing communication and abstracting complexities.

User Service

Manages user-related operations such as user creation, retrieval, update, deletion, status change, searching, filtering, and association with subreddits, posts, and comments.

Subreddit Service

Manages subreddits, offering functionalities for creating, retrieving, updating, and deleting subreddits. Additionally, it handles user membership management within subreddits, association of posts, and interaction with comments.

Post Service

Comprising controllers for comments, posts, and votes, this service manages the core functionalities of the Reddit clone project. It facilitates the creation, retrieval, updating, and deletion of comments and posts, as well as the handling of voting interactions, ensuring a robust and engaging user experience.

Check out my project on GitHub

https://github.com/Raghav-byte/Reddit-Clone-Backend