Semantic Data Catalog: Ontology-Based Vector Search for Data Discovery

The Problem: Why Traditional Data Catalogs Fail

Traditional data catalogs rely on string matching. They use frameworks like Lucene or Elasticsearch to index metadata and return results based on keyword overlap. This approach has a fundamental flaw: it does not understand meaning.

Vocabulary Mismatch — Searching for "customer" misses assets labeled "client" or "patron" even though they represent the same concept
Context Blindness — String matching cannot distinguish between "bank" (financial institution) and "bank" (river edge)
Relationship Ignorance — Traditional search does not understand that "revenue" is related to "sales" and "income"
Data Mesh Complexity — As organizations adopt Data Mesh architectures, data products spread across domains, making keyword-based discovery increasingly inadequate

Consider a data analyst searching for "pizza toppings data" in an enterprise catalog. A traditional system would only return results containing those exact terms. But what about datasets labeled "ingredient catalog" or "menu items database"? These are semantically relevant but lexically invisible.

The Solution: Semantic Search Through Ontologies

The Semantic Data Catalog solves this by combining traditional data catalogs with semantic search capabilities. The formula is straightforward:

Core Concept

Semantic Data Catalog = Data Catalog + Semantic Search

Where Semantic Search =
    Ontology-based concept modeling +
    Vector embeddings for meaning representation +
    Similarity search for contextual retrieval

Instead of matching strings, we match meanings. Each data asset is described using an ontology — a formal representation of concepts and their relationships. These ontologies are then converted into numerical vectors that capture semantic meaning. When users search, their queries are also converted to vectors, and we find assets with similar meaning rather than similar spelling.

Semantic Data Catalog Pipeline

1

Ontology Catalog

Store structured ontologies for each data asset

→

2

Embedding Generation

Convert ontologies to vectors via OWL2Vec2

→

3

Vector Indexing

Load embeddings into FAISS for fast retrieval

→

4

Query Processing

Convert queries to vectors, find similar concepts

How It Works: The Technical Deep Dive

Understanding Ontologies and Knowledge Graphs

An ontology is a formal specification of a conceptualization. In practical terms, it is a structured way to describe what things exist in a domain and how they relate to each other. We represent ontologies as 5-tuples:

Ontology Structure

Ontology O = (C, R, F, I, A)

Where:
    C = Concepts      (e.g., Pizza, Topping, Customer)
    R = Relationships (e.g., hasTopping, orderedBy)
    F = Functions     (e.g., calculatePrice, validateOrder)
    I = Instances     (e.g., Margherita, Pepperoni)
    A = Axioms        (e.g., VegetarianPizza ⊑ Pizza ⊓ ∀hasTopping.VegetableTopping)

This structure enables the system to understand that a Margherita pizza is a type of pizza, which has toppings, and those toppings are vegetables. When someone searches for "vegetarian options," the system can return Margherita even if the word "vegetarian" never appears in its metadata.

OWL2Vec2: Converting Ontologies to Vectors

OWL2Vec2, developed at the University of Oxford, is an embedding model specifically designed for OWL ontologies. It converts the rich semantic structure of ontologies into high-dimensional numerical vectors.

Vector Representation

# Words with similar meanings get similar vectors
dog   = [1.6, -0.3, 7.2, 19.6, 3.1, ..., 20.6]
puppy = [1.5, -0.4, 7.2, 19.5, 3.2, ..., 20.8]

# The distance between these vectors is small
# indicating semantic similarity

# Unrelated concepts have distant vectors
car   = [8.2, 14.1, -3.4, 2.1, 9.7, ..., -5.3]
# Large distance from dog/puppy vectors

The embedding process captures not just individual concepts but also the relationships between them. If "topping" has a "partOf" relationship with "pizza" in the ontology, this relationship is encoded in the vector space — concepts connected in the ontology will be positioned near each other in vector space.

FAISS: Efficient Vector Search at Scale

Once we have vector representations, we need to search them efficiently. With millions of data assets, comparing every query against every vector is computationally prohibitive. FAISS (Facebook AI Similarity Search) solves this through Approximate Nearest Neighbors (ANN) algorithms.

FAISS Search Implementation

# Initialize FAISS index with clustering
dimension = 300  # embedding dimension
nlist = 100      # number of clusters

quantizer = faiss.IndexFlatL2(dimension)
index = faiss.IndexIVFFlat(quantizer, dimension, nlist)

# Train on existing ontology embeddings
index.train(ontology_vectors)
index.add(ontology_vectors)

# Search for similar concepts
query_vector = embed_query("pizza toppings")
distances, indices = index.search(query_vector, k=10)

# Returns top 10 semantically similar data assets

FAISS uses inverted file indexing with clustering. It first groups similar vectors into clusters, then only searches within relevant clusters during query time. This reduces search complexity from O(n) to O(log n) while maintaining high accuracy.

Semantic Reasoners: Inferring New Knowledge

The system uses the HermiT reasoner (with alternatives like ELK, Ontop, or Pellet) to infer logical consequences from ontological axioms. This enables automated classification and validation of concept models.

Reasoning Example

# Given axioms in the ontology:
# 1. VegetarianPizza ⊑ Pizza ⊓ ∀hasTopping.VegetableTopping
# 2. Margherita hasTopping Tomato
# 3. Margherita hasTopping Mozzarella
# 4. Tomato ⊑ VegetableTopping
# 5. Mozzarella ⊑ CheeseTopping
# 6. CheeseTopping ⊑ VegetableTopping

# The reasoner can infer:
# Margherita ⊑ VegetarianPizza

# This inference happens automatically, enriching search results
# without manual classification

Similarity Metrics: Measuring Semantic Distance

When comparing vectors, we use distance metrics to quantify similarity. The two primary options are cosine similarity and Euclidean distance:

Distance Metrics

# Cosine Similarity
# Measures angle between vectors (ignores magnitude)
cos(θ) = (A · B) / (||A|| × ||B||)

# Range: -1 to 1 (1 = identical direction, 0 = orthogonal)

# Euclidean Distance
# Measures straight-line distance in vector space
d(A, B) = √(Σ(Ai - Bi)²)

# Smaller distance = more similar

# Example:
query = "margherita and onion"
query_vector = embed(query)

# Find assets where:
# cos(query_vector, asset_vector) → 1 (high similarity)
# OR
# euclidean(query_vector, asset_vector) → 0 (low distance)

Cosine similarity is typically preferred for text embeddings because it focuses on direction rather than magnitude, making it robust to documents of different lengths.

System Architecture

Component	Technology	Purpose
Ontology Management	OWL, Protege	Define and maintain semantic models
Embedding Engine	OWL2Vec2, Python	Convert ontologies to vectors
Vector Store	FAISS	Index and search embeddings
Semantic Reasoner	HermiT, ELK, Pellet	Infer relationships and validate models
Query Interface	Python, REST API	Process natural language queries
Catalog Backend	Configurable	Store metadata and asset information

The architecture separates concerns cleanly: ontology management handles the semantic modeling, the embedding engine handles vectorization, FAISS handles efficient search, and the reasoner handles inference. This modularity allows each component to be scaled and optimized independently.

Evaluation: Measuring Search Quality

We evaluate semantic search quality using standard information retrieval metrics:

MRR Mean Reciprocal Rank

Hit@K Hit Rate at K Results

P@K Precision at K

Evaluation Metrics

# Mean Reciprocal Rank (MRR)
# Average of reciprocal ranks of first relevant result
MRR = (1/|Q|) × Σ(1/rank_i)

# Hit Rate at K
# Proportion of queries with relevant result in top K
Hit@K = |{q : relevant_result ∈ top_k(q)}| / |Q|

# These metrics help tune:
# - Embedding model hyperparameters
# - FAISS index configuration
# - Ontology granularity

Continuous evaluation is essential because embedding quality depends on ontology training and model configuration. The Pizza OWL ontology from Protege Stanford serves as our proof-of-concept benchmark, but production deployments require domain-specific ontologies and evaluation datasets.

Practical Considerations

The Metadata Balance

More metadata is not always better. Per Zipf's Law, excessive metadata can degrade search performance by introducing noise. The key is curating metadata that captures essential semantic relationships without overwhelming the embedding space.

Index Maintenance

Semantic catalogs require ongoing maintenance as data assets evolve. New assets need ontology mappings, embeddings must be regenerated when ontologies change, and FAISS indices need periodic rebuilding to maintain query performance.

Two Modes of Search

Semantic search enables two complementary capabilities:

Searching for Data — Finding relevant datasets based on conceptual queries
Searching in Data — Understanding what concepts exist within a dataset and how they relate to other assets

Benefits: Why This Matters

Improved Discovery

Find relevant data even when terminology differs. Contextual understanding dramatically increases search recall without sacrificing precision.

Enhanced Governance

Clear semantic relationships facilitate better data management. Understand lineage, ownership, and compliance through formal concept models.

Data Mesh Ready

Handle distributed data products across domains. Semantic search provides unified discovery regardless of how teams label their assets.

Reduced Time-to-Insight

Data analysts spend less time searching and more time analyzing. The right data surfaces faster through meaning-based retrieval.

Explore the Code

The complete implementation is available on GitHub, including the OWL2Vec2 integration, FAISS indexing, and the Pizza ontology demonstration.

View on GitHub