3 Things That Matter When Building A RAG Pipeline

Issue #2 - Build AI With Me

Oct 14, 2025

Model selection is only part of the equation (covered in the last blog). The real magic of a RAG system happens in the details: how you chunk your data, format your prompts, structure your retrieval, and evaluate your results. Here’s everything I learned about some of these tweaks you can make.

Data Chunking Strategy

Your retriever can only find what you’ve chunked. If your chunks are too large, you’ll retrieve irrelevant information. Too small, and you’ll lose critical context.

Strategy 1: Fixed-Size Chunking

# Simple fixed-size chunks
chunk_size = 512  # tokens
overlap = 0

chunks = [text[i:i+chunk_size] for i in range(0, len(text), chunk_size)]

When this works:

Highly structured data (databases, tables)
When every section is independent
Technical documentation with clear sections

Strategy 2: Fixed-Size with Overlap

chunk_size = 512
overlap = 50  # tokens

chunks = []
for i in range(0, len(text), chunk_size - overlap):
    chunk = text[i:i+chunk_size]
    chunks.append(chunk)

Trade-offs:

✅ Better context preservation
✅ More forgiving if retrieval misses optimal chunk
❌ Increased storage (chunks overlap)
❌ Potential redundancy in retrieved context

Strategy 3: Semantic Chunking

# Split by semantic boundaries (paragraphs, sections, topics)
# Using sentence transformers to find natural breakpoints

from sentence_transformers import SentenceTransformer
import numpy as np

def semantic_chunking(text, similarity_threshold=0.7):
    sentences = text.split(’. ‘)
    embeddings = model.encode(sentences)
    
    chunks = []
    current_chunk = [sentences[0]]
    
    for i in range(1, len(sentences)):
        similarity = cosine_similarity(
            embeddings[i-1], 
            embeddings[i]
        )
        
        if similarity > similarity_threshold:
            current_chunk.append(sentences[i])
        else:
            chunks.append(’. ‘.join(current_chunk))
            current_chunk = [sentences[i]]
    
    return chunks

Best context preservation, but more complex to implement and tune.

When this works:

Long-form content (articles, reports, books)
Narrative or flowing text
When topic shifts matter

Recommendations

The chunk size should match your typical query complexity.

If users ask simple factual questions → smaller chunks (256-384 tokens) If users ask complex analytical questions → larger chunks (512-1024 tokens)

I discovered this by analyzing my queries: most review analysis questions needed 2-3 data points to answer properly, so 512-token chunks with overlap worked perfectly.

Data Format Alignment

Think of it this way: LLMs are pattern-matching machines trained on internet text. When they see familiar patterns, they “know” what comes next and how to interpret the information.

If your data format resembles what the model saw during training, you’re essentially speaking its native language. If it doesn’t, you’re making the model work harder to understand what you’re asking for. The closer your chunked data matches these formats, the better the model performs.

For example,

Look at the 2 ways the customer review data is formatted here.

# Poor formatting
chunk = “Great product. Fast shipping. Would buy again.”

# Better formatting (mimics review structure)
chunk = “”“
Review: 5 stars
Product: XYZ Widget
Pros: Great quality, fast shipping, good value
Summary: Customer highly recommends and would purchase again.
“”“

Why this works: The structured format matched how product reviews appear in the model’s training data. The model recognizes the pattern and extracts information more accurately.

Prompt Engineering

Every model is tuned differently. What works for GPT-4 might fail for Llama. Here’s what I learned. Here are some classic principles of prompt engineering:

Be explicit about constraints
- “Use only the provided context”
- “Do not make assumptions.”
- “If information is missing, state it.”
Define the role clearly
- “You are a technical support agent.”
- “You are analyzing financial report.s”
- “You are a medical information assistant.”
Specify output format

   “Provide your answer in this format:
   - Summary: [one sentence]
   - Details: [2-3 bullet points]
   - Confidence: [High/Medium/Low]”

Also, different models respond to different prompt styles.

Llama models: Like structured, explicit instructions

“Based on the following context, provide a detailed answer...”

Flan-T5: Prefers concise, imperative prompts

“Answer: {query}\nContext: {context}”

GPT models: Handle conversational, nuanced prompts well

“Given the context below, I need you to...”

Final Thoughts

Trial and Error Beats All Other Learning

No guide can tell you exactly what will work for your specific:

Domain (medical vs. customer reviews vs. legal documents)
Data format (structured vs. unstructured)
Query types (factual vs. analytical vs. comparative)
Hardware constraints

Let me know if this helps, and I’m curious to hear what are tweaks you make that help you get better results!

Build AI With Me

Discussion about this post

Ready for more?