The Problem: Why Traditional Data Catalogs Fail
Traditional data catalogs rely on string matching. They use frameworks like Lucene or Elasticsearch to index metadata and return results based on keyword overlap. This approach has a fundamental flaw: it does not understand meaning.
- Vocabulary Mismatch — Searching for "customer" misses assets labeled "client" or "patron" even though they represent the same concept
- Context Blindness — String matching cannot distinguish between "bank" (financial institution) and "bank" (river edge)
- Relationship Ignorance — Traditional search does not understand that "revenue" is related to "sales" and "income"
- Data Mesh Complexity — As organizations adopt Data Mesh architectures, data products spread across domains, making keyword-based discovery increasingly inadequate
Consider a data analyst searching for "pizza toppings data" in an enterprise catalog. A traditional system would only return results containing those exact terms. But what about datasets labeled "ingredient catalog" or "menu items database"? These are semantically relevant but lexically invisible.
The Solution: Semantic Search Through Ontologies
The Semantic Data Catalog solves this by combining traditional data catalogs with semantic search capabilities. The formula is straightforward:
Semantic Data Catalog = Data Catalog + Semantic Search
Where Semantic Search =
Ontology-based concept modeling +
Vector embeddings for meaning representation +
Similarity search for contextual retrieval
Instead of matching strings, we match meanings. Each data asset is described using an ontology — a formal representation of concepts and their relationships. These ontologies are then converted into numerical vectors that capture semantic meaning. When users search, their queries are also converted to vectors, and we find assets with similar meaning rather than similar spelling.
Ontology Catalog
Store structured ontologies for each data asset
Embedding Generation
Convert ontologies to vectors via OWL2Vec2
Vector Indexing
Load embeddings into FAISS for fast retrieval
Query Processing
Convert queries to vectors, find similar concepts
How It Works: The Technical Deep Dive
Understanding Ontologies and Knowledge Graphs
An ontology is a formal specification of a conceptualization. In practical terms, it is a structured way to describe what things exist in a domain and how they relate to each other. We represent ontologies as 5-tuples:
Ontology O = (C, R, F, I, A)
Where:
C = Concepts (e.g., Pizza, Topping, Customer)
R = Relationships (e.g., hasTopping, orderedBy)
F = Functions (e.g., calculatePrice, validateOrder)
I = Instances (e.g., Margherita, Pepperoni)
A = Axioms (e.g., VegetarianPizza ⊑ Pizza ⊓ ∀hasTopping.VegetableTopping)
This structure enables the system to understand that a Margherita pizza is a type of pizza, which has toppings, and those toppings are vegetables. When someone searches for "vegetarian options," the system can return Margherita even if the word "vegetarian" never appears in its metadata.
OWL2Vec2: Converting Ontologies to Vectors
OWL2Vec2, developed at the University of Oxford, is an embedding model specifically designed for OWL ontologies. It converts the rich semantic structure of ontologies into high-dimensional numerical vectors.
# Words with similar meanings get similar vectors
dog = [1.6, -0.3, 7.2, 19.6, 3.1, ..., 20.6]
puppy = [1.5, -0.4, 7.2, 19.5, 3.2, ..., 20.8]
# The distance between these vectors is small
# indicating semantic similarity
# Unrelated concepts have distant vectors
car = [8.2, 14.1, -3.4, 2.1, 9.7, ..., -5.3]
# Large distance from dog/puppy vectors
The embedding process captures not just individual concepts but also the relationships between them. If "topping" has a "partOf" relationship with "pizza" in the ontology, this relationship is encoded in the vector space — concepts connected in the ontology will be positioned near each other in vector space.
FAISS: Efficient Vector Search at Scale
Once we have vector representations, we need to search them efficiently. With millions of data assets, comparing every query against every vector is computationally prohibitive. FAISS (Facebook AI Similarity Search) solves this through Approximate Nearest Neighbors (ANN) algorithms.
# Initialize FAISS index with clustering
dimension = 300 # embedding dimension
nlist = 100 # number of clusters
quantizer = faiss.IndexFlatL2(dimension)
index = faiss.IndexIVFFlat(quantizer, dimension, nlist)
# Train on existing ontology embeddings
index.train(ontology_vectors)
index.add(ontology_vectors)
# Search for similar concepts
query_vector = embed_query("pizza toppings")
distances, indices = index.search(query_vector, k=10)
# Returns top 10 semantically similar data assets
FAISS uses inverted file indexing with clustering. It first groups similar vectors into clusters, then only searches within relevant clusters during query time. This reduces search complexity from O(n) to O(log n) while maintaining high accuracy.
Semantic Reasoners: Inferring New Knowledge
The system uses the HermiT reasoner (with alternatives like ELK, Ontop, or Pellet) to infer logical consequences from ontological axioms. This enables automated classification and validation of concept models.
# Given axioms in the ontology:
# 1. VegetarianPizza ⊑ Pizza ⊓ ∀hasTopping.VegetableTopping
# 2. Margherita hasTopping Tomato
# 3. Margherita hasTopping Mozzarella
# 4. Tomato ⊑ VegetableTopping
# 5. Mozzarella ⊑ CheeseTopping
# 6. CheeseTopping ⊑ VegetableTopping
# The reasoner can infer:
# Margherita ⊑ VegetarianPizza
# This inference happens automatically, enriching search results
# without manual classification
Similarity Metrics: Measuring Semantic Distance
When comparing vectors, we use distance metrics to quantify similarity. The two primary options are cosine similarity and Euclidean distance:
# Cosine Similarity
# Measures angle between vectors (ignores magnitude)
cos(θ) = (A · B) / (||A|| × ||B||)
# Range: -1 to 1 (1 = identical direction, 0 = orthogonal)
# Euclidean Distance
# Measures straight-line distance in vector space
d(A, B) = √(Σ(Ai - Bi)²)
# Smaller distance = more similar
# Example:
query = "margherita and onion"
query_vector = embed(query)
# Find assets where:
# cos(query_vector, asset_vector) → 1 (high similarity)
# OR
# euclidean(query_vector, asset_vector) → 0 (low distance)
Cosine similarity is typically preferred for text embeddings because it focuses on direction rather than magnitude, making it robust to documents of different lengths.
System Architecture
| Component | Technology | Purpose |
|---|---|---|
| Ontology Management | OWL, Protege | Define and maintain semantic models |
| Embedding Engine | OWL2Vec2, Python | Convert ontologies to vectors |
| Vector Store | FAISS | Index and search embeddings |
| Semantic Reasoner | HermiT, ELK, Pellet | Infer relationships and validate models |
| Query Interface | Python, REST API | Process natural language queries |
| Catalog Backend | Configurable | Store metadata and asset information |
The architecture separates concerns cleanly: ontology management handles the semantic modeling, the embedding engine handles vectorization, FAISS handles efficient search, and the reasoner handles inference. This modularity allows each component to be scaled and optimized independently.
Evaluation: Measuring Search Quality
We evaluate semantic search quality using standard information retrieval metrics:
# Mean Reciprocal Rank (MRR)
# Average of reciprocal ranks of first relevant result
MRR = (1/|Q|) × Σ(1/rank_i)
# Hit Rate at K
# Proportion of queries with relevant result in top K
Hit@K = |{q : relevant_result ∈ top_k(q)}| / |Q|
# These metrics help tune:
# - Embedding model hyperparameters
# - FAISS index configuration
# - Ontology granularity
Continuous evaluation is essential because embedding quality depends on ontology training and model configuration. The Pizza OWL ontology from Protege Stanford serves as our proof-of-concept benchmark, but production deployments require domain-specific ontologies and evaluation datasets.
Practical Considerations
The Metadata Balance
More metadata is not always better. Per Zipf's Law, excessive metadata can degrade search performance by introducing noise. The key is curating metadata that captures essential semantic relationships without overwhelming the embedding space.
Index Maintenance
Semantic catalogs require ongoing maintenance as data assets evolve. New assets need ontology mappings, embeddings must be regenerated when ontologies change, and FAISS indices need periodic rebuilding to maintain query performance.
Two Modes of Search
Semantic search enables two complementary capabilities:
- Searching for Data — Finding relevant datasets based on conceptual queries
- Searching in Data — Understanding what concepts exist within a dataset and how they relate to other assets
Benefits: Why This Matters
Improved Discovery
Find relevant data even when terminology differs. Contextual understanding dramatically increases search recall without sacrificing precision.
Enhanced Governance
Clear semantic relationships facilitate better data management. Understand lineage, ownership, and compliance through formal concept models.
Data Mesh Ready
Handle distributed data products across domains. Semantic search provides unified discovery regardless of how teams label their assets.
Reduced Time-to-Insight
Data analysts spend less time searching and more time analyzing. The right data surfaces faster through meaning-based retrieval.
Explore the Code
The complete implementation is available on GitHub, including the OWL2Vec2 integration, FAISS indexing, and the Pizza ontology demonstration.
View on GitHub