Smart Retail Navigator: RAG + LLM + Annoy for Intelligent Product Search

The Problem: Why Retail Needs Intelligent Product Search

Modern retail generates massive amounts of data: product catalogs with millions of SKUs, customer reviews, sales transactions, inventory movements, and pricing information. When a business user asks "What products are trending among millennials in the electronics category?", traditional search systems fall short:

Keyword Limitations — Traditional search relies on exact matches. A query for "trending electronics" misses products described as "popular gadgets" or "best-selling tech"
Context Blindness — Conventional systems cannot understand nuanced queries like "products similar to what sold well last quarter"
Data Silos — Sales data, customer feedback, and inventory information exist in separate systems, making unified insights impossible
Scale Challenges — Searching through millions of product vectors in real-time requires specialized indexing that most databases cannot provide

Retailers need a system that understands natural language, retrieves semantically relevant information from multiple data sources, and generates coherent, contextual responses. This is where the Smart Retail Navigator comes in.

The Solution: RAG Architecture with Annoy Indexing

The Smart Retail Navigator combines three technologies that each solve a piece of the puzzle. Together, they create a system that understands queries like humans do, retrieves relevant data at scale, and generates actionable insights.

Smart Retail Navigator Architecture

1

Query Understanding

LLM parses natural language intent and context

→

2

Vector Embedding

Query converted to dense vector representation

→

3

Annoy Search

Find nearest neighbors in product vector space

→

4

Context Retrieval

Fetch relevant documents and metadata

→

5

Response Generation

LLM synthesizes contextual answer

Why These Three Technologies?

Each component addresses a specific challenge:

RAG (Retrieval-Augmented Generation) — Grounds LLM responses in actual data, preventing hallucination and ensuring accuracy
LLM (Large Language Models) — Provides natural language understanding and human-like response generation
Annoy (Approximate Nearest Neighbors) — Enables sub-millisecond similarity search across millions of vectors

How It Works: Vector Similarity and Retrieval-Augmented Generation

Vector Embeddings: The Foundation

Every piece of retail data — product descriptions, customer reviews, sales summaries — gets converted into a dense vector representation. These vectors capture semantic meaning, so "wireless headphones" and "Bluetooth earbuds" end up close together in vector space.

Vector Embedding Process

# Convert text to vector representation
def embed_product(product_description):
    # Use pre-trained model to generate embeddings
    embedding = model.encode(product_description)
    return embedding  # Returns 768-dimensional vector

# Example: Similar products cluster together
headphones_vec = embed_product("Wireless noise-canceling headphones")
earbuds_vec = embed_product("Bluetooth earbuds with ANC")
# cosine_similarity(headphones_vec, earbuds_vec) ≈ 0.87

Annoy: Fast Approximate Nearest Neighbor Search

Annoy (Approximate Nearest Neighbors Oh Yeah) uses random projection trees to partition the vector space. Instead of comparing a query against every product vector (O(n) complexity), it traverses trees to find approximate nearest neighbors in O(log n) time.

Annoy Index Construction

# Build Annoy index from product embeddings
from annoy import AnnoyIndex

dimension = 768  # Embedding dimension
index = AnnoyIndex(dimension, 'angular')  # Use cosine similarity

# Add all product vectors
for product_id, embedding in product_embeddings.items():
    index.add_item(product_id, embedding)

# Build index with 10 trees (more trees = better accuracy)
index.build(n_trees=10)

# Search: Find 10 most similar products in ~1ms
similar_ids = index.get_nns_by_vector(query_vector, n=10)

RAG: Grounding Generation in Reality

Retrieval-Augmented Generation solves the hallucination problem. Instead of relying solely on the LLM's training data, we retrieve relevant documents and include them in the prompt. The LLM generates responses based on actual, current information.

RAG Pipeline

# RAG: Retrieve relevant context, then generate
def answer_retail_query(user_query):
    # Step 1: Embed the query
    query_embedding = embed_query(user_query)

    # Step 2: Find relevant documents via Annoy
    relevant_ids = annoy_index.get_nns_by_vector(query_embedding, n=5)
    context_docs = fetch_documents(relevant_ids)

    # Step 3: Build augmented prompt
    augmented_prompt = f"""
    Based on the following retail data:
    {context_docs}

    Answer this question: {user_query}
    """

    # Step 4: Generate response with LLM
    response = llm.generate(augmented_prompt)
    return response

Dual LLM Strategy

The system employs two specialized models for optimal performance:

Model	Specialization	Use Case
eCeLLM	E-commerce domain expertise	Complex product queries, category analysis, trend detection
DistilGPT-2	Real-time processing	Quick responses, simple queries, high-throughput scenarios

eCeLLM, trained specifically on e-commerce data, excels at understanding retail terminology and product relationships. DistilGPT-2, a distilled version of GPT-2, provides faster inference for time-sensitive queries.

The Mathematics: Cosine Similarity and Random Projections

Cosine Similarity for Semantic Matching

The system uses cosine similarity to measure how semantically related two pieces of text are. This metric is ideal for comparing embeddings because it focuses on direction (meaning) rather than magnitude.

Similarity Computation

// Cosine similarity between query (q) and document (d) vectors
cos(θ) = (q · d) / (||q|| × ||d||)

// Range: [-1, 1], where 1 = identical meaning
// In practice, retail vectors typically range [0.3, 0.95]

// Example similarity scores:
query: "wireless earbuds"
"Bluetooth headphones"     → 0.89
"USB charging cable"       → 0.31
"Running shoes"            → 0.12

Random Projection Trees in Annoy

Annoy builds a forest of random projection trees. Each tree recursively splits the vector space using random hyperplanes until each leaf contains a small number of items.

Annoy Tree Structure

// Building a random projection tree
1. Select random hyperplane through data points
2. Split points into "left" and "right" based on which side they fall
3. Recursively split until leaf size < threshold

// Search traversal
1. For each tree, descend to leaf containing query point
2. Collect candidate neighbors from all trees
3. Compute exact distances for candidates
4. Return top-k closest

// Complexity
Build: O(n × t × log n)  where t = number of trees
Search: O(t × log n)     near-constant time for large n

System Architecture

Layer	Technologies	Purpose
Data Layer	Mock Data Generators	Sales, customer feedback, inventory simulation
Embedding Layer	Sentence Transformers	Convert text to 768-dimensional vectors
Index Layer	Annoy	Fast approximate nearest neighbor search
Intelligence Layer	eCeLLM, DistilGPT-2	Query understanding and response generation
Orchestration	Jupyter Notebook	Pipeline coordination and experimentation

Data Flow

The system processes three primary data streams that feed into the unified search index:

Sales Data — Transaction records, revenue metrics, seasonal patterns
Customer Feedback — Reviews, ratings, sentiment signals
Inventory Information — Stock levels, supplier data, availability status

Retail Intelligence Use Cases

Product Discovery

"Find products similar to our top sellers from last quarter" — semantic search across catalog

Trend Analysis

"What categories are gaining momentum with Gen-Z customers?" — cross-reference sales and demographics

Inventory Insights

"Which products need restocking based on current sales velocity?" — predictive inventory queries

Customer Sentiment

"Summarize recent feedback for our electronics category" — aggregate review analysis

Performance Characteristics

The combination of Annoy indexing and dual-LLM architecture delivers both speed and accuracy:

<10ms Vector Search Latency

768 Embedding Dimensions

O(log n) Search Complexity

Why Annoy Over Alternatives?

Memory Efficiency — Index can be memory-mapped, allowing indexes larger than RAM
Static Index — Once built, index is immutable and thread-safe for concurrent reads
Simple API — Minimal setup compared to distributed solutions like Milvus or Pinecone
Proven Scale — Used in production at Spotify for music recommendations

Implementation Highlights

Mock Data Generation

The project includes data generators that simulate realistic retail scenarios, enabling experimentation without requiring production data access.

Data Generation Pipeline

# Generate synthetic retail data
def generate_retail_dataset(n_products=10000):
    products = []
    for i in range(n_products):
        product = {
            'id': i,
            'name': generate_product_name(),
            'description': generate_description(),
            'category': random.choice(CATEGORIES),
            'price': generate_price(),
            'reviews': generate_reviews(n=random.randint(5, 50))
        }
        products.append(product)
    return products

# Index all products
dataset = generate_retail_dataset()
for product in dataset:
    embedding = embed_product(product['description'])
    annoy_index.add_item(product['id'], embedding)

Query Processing Pipeline

The orchestration layer coordinates the flow from user query to final response:

Complete Query Pipeline

# End-to-end query processing
class RetailNavigator:
    def __init__(self):
        self.annoy_index = load_annoy_index()
        self.ecellm = load_ecellm()
        self.distilgpt = load_distilgpt()

    def process_query(self, query, mode='accurate'):
        # Select LLM based on mode
        llm = self.ecellm if mode == 'accurate' else self.distilgpt

        # Embed and search
        query_vec = embed_query(query)
        relevant_ids = self.annoy_index.get_nns_by_vector(query_vec, 10)
        context = self.fetch_context(relevant_ids)

        # Generate response with RAG
        prompt = self.build_rag_prompt(query, context)
        response = llm.generate(prompt)

        return response

Explore the Code

The complete implementation is available on GitHub as a Jupyter notebook with documentation and examples for building your own retail intelligence system.

View on GitHub