The Problem: Why Traditional Analytics Is Broken
Enterprise analytics teams face a fundamental architectural problem. Despite investing heavily in data lakes and warehouses, they remain stuck in a pattern that does not scale:
- Centralized Bottlenecks — A single data team handles all requests from across the organization, creating queue times measured in weeks
- Report Factories — Teams spend their time building one-off reports instead of reusable data products that compound in value
- No Self-Service — Business users cannot access data without going through gatekeepers, stifling innovation and decision velocity
- Siloed Knowledge — Each dashboard or report exists in isolation, with duplicated logic and inconsistent metrics across teams
The result? Analytics departments become report-building factories rather than enablers of data-driven decisions. Data Mesh emerged as an answer to these problems, but implementing it requires thinking about data products as services, not just tables in a lake.
The Insight: Data Products Need Service Interfaces
Data Mesh proposes domain-oriented, decentralized data ownership. But ownership alone does not solve the consumption problem. A data product sitting in a Delta table is like an API without documentation or endpoints. The missing piece is the service layer.
micro-service + micro-frontend = agility
end-to-end automation = simplicity
Smart Analytics = Data Product + Service Interface + Visualization
This equation captures the core insight: applying proven software engineering practices (microservices, APIs, CI/CD) to analytics transforms how organizations consume data. Instead of building reports, you build applications.
The Solution: Databricks SQL Endpoints as the Service Layer
Our architecture exposes data products through Databricks SQL Endpoints, providing a standardized service interface with enterprise-grade capabilities built in:
Data Product Layer
Delta tables in Lakehouse with versioning and time-travel
Service Layer
Databricks SQL Endpoint with OAuth and RBAC
Micro-Frontend
Streamlit or Dash application for visualization
Deployment
Containerized app on Fargate or Kubernetes
Why Databricks SQL Endpoints?
The SQL Endpoint acts as the API gateway for your data products. It handles the cross-cutting concerns that every data service needs:
- Authentication — Native OAuth and Okta integration means no custom auth code
- Authorization — Fine-grained access control at the table and column level
- Performance — Serverless compute scales automatically with query load
- Governance — Unity Catalog provides lineage, discovery, and compliance
How It Works: Building a Data Product Service
Step 1: Define Your Data Product in Delta Lake
Data products live as Delta tables in your Lakehouse. Delta Lake provides ACID transactions, schema evolution, and time-travel queries out of the box.
-- Create managed Delta table for NYC Taxi data product
CREATE TABLE IF NOT EXISTS smart_analytics.taxi_trips (
pickup_datetime TIMESTAMP,
dropoff_datetime TIMESTAMP,
pickup_location_id INT,
dropoff_location_id INT,
passenger_count INT,
trip_distance DOUBLE,
fare_amount DOUBLE,
tip_amount DOUBLE,
total_amount DOUBLE,
pickup_latitude DOUBLE,
pickup_longitude DOUBLE
)
USING DELTA
PARTITIONED BY (DATE(pickup_datetime))
COMMENT 'NYC Taxi trip data product - updated daily';
Step 2: Connect via Databricks SQL Endpoint
The Python application connects to the data product through the SQL Endpoint. Authentication is handled via environment variables, keeping credentials secure.
from databricks import sql
import os
def get_connection():
"""Establish connection to Databricks SQL Endpoint."""
return sql.connect(
server_hostname=os.getenv("DATABRICKS_SERVER_HOSTNAME"),
http_path=os.getenv("DATABRICKS_HTTP_PATH"),
access_token=os.getenv("DATABRICKS_TOKEN")
)
def query_taxi_data(date_filter: str, limit: int = 10000):
"""Query taxi trips data product."""
with get_connection() as conn:
with conn.cursor() as cursor:
cursor.execute(f"""
SELECT
pickup_latitude,
pickup_longitude,
fare_amount,
tip_amount,
trip_distance
FROM smart_analytics.taxi_trips
WHERE DATE(pickup_datetime) = '{date_filter}'
LIMIT {limit}
""")
return cursor.fetchall_arrow().to_pandas()
Step 3: Build the Micro-Frontend with Streamlit
Streamlit transforms the data product into an interactive application. Users can filter, explore, and visualize without writing SQL.
import streamlit as st
import pydeck as pdk
from data_product import query_taxi_data
st.set_page_config(page_title="NYC Taxi Analytics", layout="wide")
st.title("NYC Taxi Trip Analysis")
# User controls
selected_date = st.date_input("Select Date", value=datetime.today())
metric = st.selectbox("Color by", ["fare_amount", "tip_amount", "trip_distance"])
# Fetch data from data product
df = query_taxi_data(str(selected_date))
# Render map visualization
st.pydeck_chart(pdk.Deck(
map_style="mapbox://styles/mapbox/dark-v10",
initial_view_state=pdk.ViewState(
latitude=40.7128,
longitude=-74.0060,
zoom=11,
pitch=45
),
layers=[
pdk.Layer(
"HexagonLayer",
data=df,
get_position=["pickup_longitude", "pickup_latitude"],
radius=100,
elevation_scale=4,
elevation_range=[0, 1000],
pickable=True,
extruded=True
)
]
))
Step 4: Containerize and Deploy
The application is packaged as a Docker container, enabling deployment to any serverless platform.
FROM python:3.9-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
EXPOSE 9999
CMD ["streamlit", "run", "app.py", "--server.port=9999", "--server.address=0.0.0.0"]
DATABRICKS_SERVER_HOSTNAME=your-workspace.cloud.databricks.com
DATABRICKS_HTTP_PATH=/sql/1.0/warehouses/your-warehouse-id
DATABRICKS_TOKEN=your-access-token
MAPBOX_TOKEN=your-mapbox-token
Architecture Deep Dive
| Layer | Technology | Purpose |
|---|---|---|
| Storage | Delta Lake on S3/ADLS | ACID transactions, time-travel, schema evolution |
| Compute | Databricks SQL Endpoint | Serverless query execution with auto-scaling |
| Security | OAuth, Unity Catalog | Authentication, authorization, governance |
| Visualization | Streamlit, Dash | Python-native micro-frontends |
| Deployment | Docker, Fargate, DAPR | Containerized serverless deployment |
Why This Architecture Works
This architecture embodies the Data Mesh principles while adding the service layer that makes self-serve possible:
- Domain Ownership — Each team owns their data products as Delta tables with clear schemas and SLAs
- Self-Serve Infrastructure — The SQL Endpoint and container platform are shared infrastructure that teams use without managing
- Federated Governance — Unity Catalog enforces policies across all data products while allowing domain autonomy
- Product Thinking — Each data product has an interface (SQL), documentation (catalog), and application (micro-frontend)
Extending to Real-Time: Streaming Data Products
The architecture extends naturally to streaming scenarios. Delta Lake supports streaming writes, enabling near real-time data products.
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
spark = SparkSession.builder.appName("TaxiStreaming").getOrCreate()
# Read from Kafka stream
stream_df = (spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "kafka:9092")
.option("subscribe", "taxi-events")
.load()
.select(from_json(col("value").cast("string"), schema).alias("data"))
.select("data.*")
)
# Write to Delta table (data product)
(stream_df
.writeStream
.format("delta")
.outputMode("append")
.option("checkpointLocation", "/checkpoints/taxi")
.table("smart_analytics.taxi_trips")
)
The micro-frontend can poll the data product or use Streamlit's auto-refresh to provide near real-time dashboards without additional infrastructure.
Impact: From Report Factory to Smart Analytics
Implementing this architecture transforms how analytics teams operate:
The shift is cultural as much as technical. Teams move from asking "Can you build me a report?" to "What data products can I use?" This is the promise of Data Mesh realized through thoughtful service architecture.
Where This Pattern Applies
Financial Services
Risk dashboards, trading analytics, regulatory reporting with audit trails via Delta time-travel
Retail & E-commerce
Inventory visibility, customer analytics, demand forecasting with real-time streaming updates
Healthcare
Patient journey analytics, operational dashboards, clinical research data products
Manufacturing
IoT sensor analytics, quality control, supply chain visibility with geospatial visualization
Getting Started
To run the reference implementation locally:
# Clone the repository
git clone https://github.com/mgorav/data-product-as-service.git
cd data-product-as-service
# Configure your environment
cp .env.example .env
# Edit .env with your Databricks and Mapbox credentials
# Build and run with Docker
make docker-build
make docker-run
# Access the application
open http://localhost:9999
Prerequisites
- Databricks workspace with SQL Endpoint enabled
- Mapbox token (free tier works for development)
- Docker and Make installed locally
- Sample data loaded into Delta table
Explore the Code
The complete implementation is available on GitHub with setup instructions, sample data, and deployment guides.
View on GitHub