Multi-Cloud LLM Training Architecture: GCP, AWS & RunPod GPU Infrastructure
Complete reference architecture for LLM fine-tuning with data on GCP/AWS and training on RunPod GPUs. Covers networking options, CIDR planning, Docker registry integration, and secure data transfer patterns.
Table of Contents
Introduction
Training large language models (LLMs) requires substantial GPU compute resources that traditional cloud providers may not always offer at competitive prices. RunPod has emerged as a popular alternative, providing access to high-end GPUs like A100 and H100 at significantly lower costs than AWS or GCP.
However, many enterprises already have their training data stored in cloud platforms like AWS S3 or Google Cloud Storage. This creates a common architectural challenge: how do you efficiently and securely transfer data from your existing cloud infrastructure to external GPU compute providers like RunPod?
This article provides a complete reference architecture for multi-cloud LLM training, covering:
- RunPod Networking Options - Cloud Sync, Global Networking, and optional WireGuard
- Cloud Data Storage - GCP and AWS data lake patterns
- Docker Image Management - Where to store and pull training container images
- Network Architecture - CIDR planning and secure connectivity
- LLM Fine-Tuning Data Flow - End-to-end training pipeline
Key Finding from Research: RunPod does NOT support native VPC peering with external clouds. Data transfer happens via Cloud Sync (TLS/HTTPS) or optional self-hosted WireGuard tunnels.
Multi-Cloud Training Architecture Overview
Multi-Cloud LLM Training Architecture
Architecture Components
| Component | Purpose | Location |
|---|---|---|
| GCS / S3 | Training data storage | GCP / AWS |
| Artifact Registry / ECR | Docker image storage | GCP / AWS |
| Cloud Sync | Data transfer to RunPod | RunPod built-in |
| Global Networking | Pod-to-Pod communication | RunPod internal |
| GPU Pods | LLM fine-tuning compute | RunPod |
RunPod Networking: What You Need to Know
RunPod offers several networking features, but it's crucial to understand their capabilities and limitations.
Global Networking (Internal)
RunPod Global Networking (Internal)
RunPod's Global Networking creates a secure, private network connecting all your Pods within your account:
| Feature | Specification |
|---|---|
| Speed | 100 Mbps between Pods |
| DNS | <podid>.runpod.internal |
| Isolation | Complete isolation from external networks |
| Availability | NVIDIA GPU Pods only |
| Regions | 17+ data centers worldwide |
Connectivity Options Comparison
RunPod Connectivity Options
| Option | Pros | Cons | Use Case |
|---|---|---|---|
| Cloud Sync (Default) | Easy setup, built-in, TLS encrypted | Uses public internet | Standard training workflows |
| WireGuard (Advanced) | Private connectivity, no public ports | Self-hosted, more complex | High-security requirements |
GCP Network Architecture
When your training data lives in GCP, implement a hub-spoke VPC architecture for optimal security and scalability.
GCP Hub-Spoke Network (CIDR Planning)
GCP CIDR Planning
| VPC | CIDR Block | Purpose |
|---|---|---|
| Hub VPC | 10.0.0.0/16 | Centralized egress, VPN gateway |
| Data Spoke | 10.1.0.0/16 | GCS, BigQuery data storage |
| ML Spoke | 10.2.0.0/16 | Artifact Registry, ML services |
| Shared Services | 10.3.0.0/16 | Vault, secrets management |
GCP Best Practices
- Private Service Connect - Access GCS without public IPs
- VPC Service Controls - Prevent data exfiltration
- Cloud NAT - Controlled egress for RunPod API calls
- Cloud VPN - Optional WireGuard gateway
AWS Network Architecture
For AWS-based training data, use a multi-subnet VPC with VPC endpoints for private access.
AWS VPC Network (CIDR Planning)
AWS CIDR Planning
| Subnet | CIDR Block | Purpose |
|---|---|---|
| VPC | 172.16.0.0/16 | Main VPC |
| Public Subnet | 172.16.0.0/24 | NAT Gateway, bastion |
| Data Subnet | 172.16.1.0/24 | S3/ECR endpoints |
| ML Subnet | 172.16.2.0/24 | VPN endpoint |
AWS Best Practices
- S3 Gateway Endpoint - Free private S3 access
- ECR Interface Endpoint - Private Docker pulls
- Security Groups - Least privilege access
- VPN Endpoint - Optional WireGuard connectivity
Docker Image Storage
Docker Image Storage & Distribution
Registry Comparison
| Registry | Cloud | Features | RunPod Access |
|---|---|---|---|
| Artifact Registry | GCP | Vulnerability scanning, IAM | Public or VPC endpoint |
| ECR | AWS | Lifecycle policies, cross-account | Public or PrivateLink |
| Docker Hub | Neutral | Easy public access | Direct pull |
| RunPod Registry | RunPod | S3-compatible API | Native integration |
Docker Image Strategy
# CI/CD Pipeline for Multi-Registry Push
name: Build and Push
on:
push:
branches: [main]
paths: ['docker/**']
jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Build Image
run: docker build -t llm-trainer:${{ github.sha }} .
# Push to GCP Artifact Registry
- name: Push to GCR
run: |
gcloud auth configure-docker us-central1-docker.pkg.dev
docker tag llm-trainer:${{ github.sha }} \
us-central1-docker.pkg.dev/project/repo/llm-trainer:${{ github.sha }}
docker push us-central1-docker.pkg.dev/project/repo/llm-trainer:${{ github.sha }}
# Push to AWS ECR
- name: Push to ECR
run: |
aws ecr get-login-password | docker login --username AWS --password-stdin \
${{ secrets.AWS_ACCOUNT }}.dkr.ecr.us-east-1.amazonaws.com
docker tag llm-trainer:${{ github.sha }} \
${{ secrets.AWS_ACCOUNT }}.dkr.ecr.us-east-1.amazonaws.com/llm-trainer:${{ github.sha }}
docker push ${{ secrets.AWS_ACCOUNT }}.dkr.ecr.us-east-1.amazonaws.com/llm-trainer:${{ github.sha }}
LLM Fine-Tuning Data Flow
LLM Fine-Tuning Data Flow
Training Pipeline Stages
| Stage | Location | Tool |
|---|---|---|
| Raw Data | GCP/AWS | GCS/S3 bucket |
| Tokenization | Cloud | Dataflow/Glue |
| Data Sharding | Cloud | Custom script |
| Secure Transfer | Network | Cloud Sync (TLS) |
| Data Loading | RunPod | PyTorch DataLoader |
| Fine-Tuning | RunPod | DeepSpeed/FSDP |
| Checkpoint Save | RunPod → Cloud | Cloud Sync |
| Model Export | Cloud | GCS/S3 |
Cloud Sync Configuration
# RunPod Cloud Sync for LLM Training
import runpod
from runpod.cloud_sync import CloudSync
# Initialize sync with GCS
gcs_sync = CloudSync(
provider="gcs",
credentials_json=os.environ["GCS_SERVICE_ACCOUNT"],
bucket="training-data-bucket",
local_path="/workspace/data"
)
# Sync training data
gcs_sync.download(
source_path="datasets/llm-finetune/",
include_patterns=["*.parquet", "*.jsonl"]
)
# After training, sync checkpoints back
s3_sync = CloudSync(
provider="aws",
access_key=os.environ["AWS_ACCESS_KEY"],
secret_key=os.environ["AWS_SECRET_KEY"],
region="us-east-1",
bucket="model-checkpoints-bucket",
local_path="/workspace/checkpoints"
)
s3_sync.upload(
destination_path=f"run-{run_id}/",
include_patterns=["*.pt", "*.safetensors"]
)
CIDR Planning for Multi-Cloud
Multi-Cloud CIDR Block Planning
Complete CIDR Allocation
| Environment | CIDR | Notes |
|---|---|---|
| GCP Hub | 10.0.0.0/16 | 65,536 IPs |
| GCP Data | 10.1.0.0/16 | Storage VPC |
| GCP ML | 10.2.0.0/16 | Registry, ML services |
| AWS VPC | 172.16.0.0/16 | Non-overlapping with GCP |
| AWS Public | 172.16.0.0/24 | NAT, bastion |
| AWS Private | 172.16.1.0/24 | Data subnets |
| WireGuard Tunnel | 10.100.0.0/24 | Optional overlay |
| RunPod Internal | Managed | *.runpod.internal |
Secure Network Path
Secure Network Path (No Internet Hop)
Security Layers
| Layer | GCP | AWS | RunPod |
|---|---|---|---|
| Data at Rest | CMEK encryption | KMS encryption | AES-256 |
| Data in Transit | TLS 1.3 | TLS 1.3 | TLS 1.3 |
| Access Control | IAM + VPC-SC | IAM + SG | API keys |
| Network Isolation | Private endpoints | PrivateLink | Global Networking |
When to Use WireGuard vs Cloud Sync
Based on our research, here's the decision framework:
Use Cloud Sync (Default) When:
- Standard LLM training workflows
- Data is not extremely sensitive
- You want simple, managed connectivity
- You're okay with 100 Mbps throughput
Use WireGuard When:
- Regulatory requirements prohibit public endpoints
- You need private IP connectivity
- You want to avoid exposing any ports publicly
- You're running sensitive workloads
WireGuard is FREE and Open Source
WireGuard is licensed under:
- Kernel components: GPLv2
- User-space tools: GPL-2.0, MIT, BSD, Apache 2.0
You can deploy WireGuard at no licensing cost—you only pay for the VM/bandwidth.
Best Practices Summary
1. Data Storage
- Use private endpoints for GCS/S3 access
- Implement VPC Service Controls (GCP) or VPC endpoints (AWS)
- Enable encryption at rest with customer-managed keys
2. Networking
- Single cloud? Use native VPC networking
- Multi-cloud? Consider AWS-GCP Interconnect (new in 2025)
- External GPU? Use Cloud Sync or WireGuard
3. Docker Images
- Push to both Artifact Registry and ECR
- Use VPC endpoints for private pulls
- Implement image signing and vulnerability scanning
4. Security
- Never hard-code API keys in scripts
- Use environment variables or secrets managers
- Implement least privilege IAM policies
- Enable audit logging for all data access
Conclusion
Building a multi-cloud LLM training architecture requires understanding the networking capabilities and limitations of each platform. Key takeaways:
- RunPod uses Cloud Sync for external data transfer—there's no native VPC peering
- Global Networking provides 100 Mbps Pod-to-Pod connectivity within RunPod
- WireGuard is optional for cases requiring private connectivity without public ports
- Docker images should be stored in cloud registries (Artifact Registry, ECR) with VPC endpoints
- CIDR planning must avoid overlaps between GCP, AWS, and tunnel networks