- Python 97.2%
- HTML 1.7%
- Shell 1%
|
|
||
|---|---|---|
| .github | ||
| benchmarks | ||
| config | ||
| docs | ||
| examples | ||
| kubernetes | ||
| scripts | ||
| src/sollol | ||
| SynapticLlamas@24be170a2d | ||
| systemd | ||
| tests | ||
| .dockerignore | ||
| .gitignore | ||
| ARCHITECTURE.md | ||
| BACKENDS.md | ||
| BATCH_API.md | ||
| BENCHMARKING.md | ||
| BENCHMARKS.md | ||
| CODE_WALKTHROUGH.md | ||
| compare_discovery_modes.py | ||
| COMPLETE_SUMMARY.md | ||
| config.yml | ||
| CONFIGURATION.md | ||
| CONTRIBUTING.md | ||
| COORDINATOR_REUSE_INVESTIGATION.md | ||
| dashboard.html | ||
| DEPLOY_GPU_REPORTER.md | ||
| DEPLOYMENT.md | ||
| DEPLOYMENT_AWARE_RESOLUTION.md | ||
| DISTRIBUTED_INFERENCE_STATUS.md | ||
| docker-compose.test.yml | ||
| docker-compose.yml | ||
| DOCKER_IP_RESOLUTION.md | ||
| DOCKER_SETUP.md | ||
| Dockerfile | ||
| FLOCKPARSER_FEATURES_ANALYSIS.md | ||
| GPU_MONITORING_GUIDE.md | ||
| GPU_MONITORING_SETUP.md | ||
| gpu_reporter.py | ||
| GRAFANA_SETUP.md | ||
| HONEST_STATUS.md | ||
| INTEGRATION_BASIC.md | ||
| INTEGRATION_COMPLETE.md | ||
| INTEGRATION_GUIDE.md | ||
| KNOWN_ISSUES.md | ||
| KNOWN_LIMITATIONS.md | ||
| LEGACY_GPU_SUPPORT.md | ||
| LICENSE | ||
| llamacpp_github_issue.md | ||
| MANIFEST.in | ||
| mkdocs.yml | ||
| MULTI_APP_ARCHITECTURE.md | ||
| ollama_discussion_discord_final.md | ||
| ollama_discussion_draft.md | ||
| ollama_discussion_draft_v2.md | ||
| ollama_discussion_final.md | ||
| ollama_discussion_followup.md | ||
| ollama_github_issue.md | ||
| PHASE1_IMPLEMENTATION_COMPLETE.md | ||
| PHASE2_COMPLETE.md | ||
| PHASE2_PROGRESS.md | ||
| PRODUCTION_READINESS.md | ||
| prometheus.yml | ||
| PUBLISH.md | ||
| PUBLISHING.md | ||
| PYPI_PUBLICATION_SUCCESS.md | ||
| pyproject.toml | ||
| QUICK_START.md | ||
| README.md | ||
| REDIS_SETUP.md | ||
| REMOTE_ACCESS_SETUP_GUIDE.md | ||
| REMOTE_ACCESS_STATUS.md | ||
| ROUTING_LOGS.md | ||
| ROUTING_STRATEGIES.md | ||
| RPC_BACKEND_FIX.md | ||
| SECURITY.md | ||
| SESSION_SUMMARY.md | ||
| setup.py | ||
| setup_llama_cpp.py | ||
| test_activity.py | ||
| test_batch_api.py | ||
| test_connection_reuse.py | ||
| test_dashboard.py | ||
| test_dashboard_fallback_simple.py | ||
| test_dashboards.py | ||
| test_dask_adaptive.py | ||
| test_dask_batch.py | ||
| test_dask_comparison.py | ||
| test_docker.sh | ||
| test_embed_batch.py | ||
| test_failure_recovery.py | ||
| test_full_network_discovery.py | ||
| test_multi_app_dashboard.py | ||
| test_new_features.py | ||
| test_observer_debug.py | ||
| test_optimizations.py | ||
| test_ray_features.py | ||
| test_resilience.py | ||
| TEST_RESULTS.md | ||
| test_routing_log.py | ||
| test_routing_strategies.py | ||
| test_rpc_backend_fix.py | ||
| test_vram_monitoring.py | ||
| test_websocket_client.py | ||
| UNIVERSAL_DASHBOARD.md | ||
| verify_dashboards.py | ||
SOLLOL - Production-Ready Orchestration for Local LLM Clusters
Open-source orchestration layer that combines intelligent task routing with distributed model inference for local LLM clusters.
Quick Start • Features • Architecture • Documentation • Examples
🎯 What is SOLLOL?
SOLLOL (Super Ollama Load balancer & Orchestration Layer) transforms your collection of Ollama nodes into an intelligent AI cluster with adaptive routing and automatic failover—all running on your own hardware.
The Problem
You have multiple machines with GPUs running Ollama, but:
- ❌ Manual node selection for each request
- ❌ No way to run models larger than your biggest GPU
- ❌ Can't distribute multi-agent workloads efficiently
- ❌ No automatic failover or load balancing
- ❌ Zero visibility into cluster performance
The SOLLOL Solution
SOLLOL provides:
- ✅ Intelligent routing that learns which nodes work best for each task
- ✅ Model sharding to run 70B+ models across multiple machines
- ✅ Parallel agent execution for multi-agent frameworks
- ✅ Auto-discovery of all nodes and capabilities
- ✅ Built-in observability with real-time metrics
- ✅ Zero-config deployment - just point and go
⚡ Quickstart (3 Commands)
# 1. Install SOLLOL
pip install sollol
# 2. Start the dashboard (optional but recommended)
python3 -m sollol.dashboard_service &
# 3. Run your first query
python3 -c "from sollol import OllamaPool; pool = OllamaPool.auto_configure(); print(pool.chat(model='llama3.2', messages=[{'role': 'user', 'content': 'Hello!'}])['message']['content'])"
What just happened?
- ✅ SOLLOL auto-discovered all Ollama nodes on your network
- ✅ Intelligently routed your request to the best available node
- ✅ Dashboard live at
http://localhost:8080(shows routing decisions, metrics, logs)
Expected output:
Discovering Ollama nodes...
Found 3 nodes: 10.9.66.45:11434, 10.9.66.154:11434, localhost:11434
Selected node: 10.9.66.45:11434 (GPU, 12ms latency)
Hello! How can I help you today?
Next steps:
- Visit
http://localhost:8080to see the dashboard - Check Full Quick Start for production setup
- Read Examples for multi-agent, batch, and sharding patterns
🚀 Full Quick Start
Installation
pip install sollol
Basic Usage
from sollol import OllamaPool
# Auto-discover nodes and start routing
pool = OllamaPool.auto_configure()
# Make requests - SOLLOL routes intelligently
response = pool.chat(
model="llama3.2",
messages=[{"role": "user", "content": "Hello!"}]
)
Enable Real-Time GPU Monitoring
For accurate VRAM-aware routing, install the GPU reporter on each node:
# On each Ollama node, run:
sollol install-gpu-reporter --redis-host <redis-server-ip>
# Example:
sollol install-gpu-reporter --redis-host 10.9.66.154
What this does:
- Installs vendor-agnostic GPU monitoring (NVIDIA/AMD/Intel via
gpustat) - Publishes real-time VRAM stats to Redis every 5 seconds
- SOLLOL uses this data for intelligent routing decisions
- See GPU Monitoring Guide for details
Without GPU monitoring: SOLLOL falls back to estimates which may be inaccurate.
📸 Screenshots
Dashboard Overview
Real-time monitoring with P50/P95/P99 latency metrics, network nodes, RPC backends, and active applications
Ray & Dask Integration
Embedded Ray and Dask dashboards for distributed task monitoring
Activity Monitoring
Live request/response activity streams from Ollama nodes and RPC backends
Applications & Traces
Applications, distributed traces, and Ollama activity logs with real-time request/response tracking
🔥 Why SOLLOL?
1. Two Distribution Modes in One System
SOLLOL combines both task distribution and model sharding:
📊 Task Distribution (Horizontal Scaling)
Distribute multiple requests across your cluster in parallel:
# Run 10 agents simultaneously across 5 nodes
pool = OllamaPool.auto_configure()
responses = await asyncio.gather(*[
pool.chat(model="llama3.2", messages=[...])
for _ in range(10)
])
# Parallel execution across available nodes
🧩 Model Sharding (Vertical Scaling)
Run single large models that don't fit on one machine:
# Run larger models across multiple nodes
# Note: Verified with 13B across 2-3 nodes; larger models not extensively tested
router = HybridRouter(
enable_distributed=True,
num_rpc_backends=4
)
response = await router.route_request(
model="llama3:70b", # Sharded automatically
messages=[...]
)
Use them together! Small models use task distribution, large models use sharding.
2. Intelligent, Not Just Balanced
SOLLOL doesn't just distribute requests randomly—it learns and optimizes:
| Feature | Simple Load Balancer | SOLLOL |
|---|---|---|
| Routing | Round-robin | Context-aware scoring |
| Learning | None | Adapts from performance history |
| Resource Awareness | None | GPU/CPU/memory-aware |
| Task Optimization | None | Routes by task type complexity |
| Failover | Manual | Automatic with health checks |
| Priority | FIFO | Priority queue with fairness |
Example: SOLLOL automatically routes:
- Heavy generation tasks → GPU nodes with 24GB VRAM
- Fast embeddings → CPU nodes or smaller GPUs
- Critical requests → Fastest, most reliable nodes
- Batch processing → Lower priority, distributed load
3. Production-Ready from Day One
from sollol import SOLLOL, SOLLOLConfig
# Literally 3 lines to production
config = SOLLOLConfig.auto_discover()
sollol = SOLLOL(config)
sollol.start() # ✅ Gateway running on :8000
Out of the box:
- Auto-discovery of Ollama nodes
- Health monitoring and failover
- Prometheus metrics
- Web dashboard
- Connection pooling
- Request hedging
- Priority queuing
4. Unified Observability for Your Entire AI Network
SOLLOL provides a single pane of glass to monitor every application and every node in your distributed AI network.
- ✅ Centralized Dashboard: One web interface shows all applications, nodes, and RPC backends.
- ✅ Multi-App Tracking: See which applications (e.g., SynapticLlamas, custom agents) are using the cluster in real-time.
- ✅ Network-Wide Visibility: The dashboard runs as a persistent service, discovering and monitoring all components even if no applications are running.
- ✅ Zero-Config: Applications automatically appear in the dashboard with no extra code required.
This moves beyond per-application monitoring to provide true, centralized observability for your entire infrastructure.
🏗️ Architecture
High-Level Overview
┌────────────────────────────────────────────────────────┐
│ Your Application │
│ (SynapticLlamas, custom agents, etc.) │
└──────────────────────┬─────────────────────────────────┘
│
▼
┌────────────────────────────────────────────────────────┐
│ SOLLOL Gateway (:8000) │
│ ┌──────────────────────────────────────────────────┐ │
│ │ Intelligent Routing Engine │ │
│ │ • Analyzes: task type, complexity, resources │ │
│ │ • Scores: all nodes based on context │ │
│ │ • Learns: from performance history │ │
│ │ • Routes: to optimal node │ │
│ └──────────────────────────────────────────────────┘ │
│ ┌──────────────────────────────────────────────────┐ │
│ │ Priority Queue + Failover │ │
│ └──────────────────────────────────────────────────┘ │
└────────┬─────────────────────────┬─────────────────────┘
│ │
▼ ▼
┌─────────────┐ ┌──────────────┐
│ Task Mode │ │ Shard Mode │
│ Ray Cluster │ │ llama.cpp │
└──────┬──────┘ └──────┬───────┘
│ │
▼ ▼
┌────────────────────────────────────────────────────────┐
│ Your Heterogeneous Cluster │
│ GPU (24GB) │ GPU (16GB) │ CPU (64c) │ GPU (8GB) │... │
└────────────────────────────────────────────────────────┘
How Routing Works
# 1. Request arrives
POST /api/chat {
"model": "llama3.2",
"messages": [{"role": "user", "content": "Complex analysis task..."}],
"priority": 8
}
# 2. SOLLOL analyzes
task_type = "generation" # Auto-detected
complexity = "high" # Token count analysis
requires_gpu = True # Based on task
estimated_duration = 3.2s # From history
# 3. SOLLOL scores all nodes
Node A (GPU 24GB, load: 0.2, latency: 120ms) → Score: 185.3 ✓ WINNER
Node B (GPU 8GB, load: 0.6, latency: 200ms) → Score: 92.1
Node C (CPU only, load: 0.1, latency: 80ms) → Score: 41.2
# 4. Routes to Node A, monitors execution, learns for next time
Scoring Algorithm:
Score = 100.0 (baseline)
× success_rate (0.0-1.0)
÷ (1 + latency_penalty)
× gpu_bonus (1.5x if GPU available & needed)
÷ (1 + load_penalty)
× priority_alignment
× task_specialization
📦 Installation
Quick Install (PyPI)
pip install sollol
From Source
git clone https://github.com/BenevolentJoker-JohnL/SOLLOL.git
cd SOLLOL
pip install -e .
⚡ Quick Start
1. Synchronous API (No async/await needed!)
New in v0.3.6: SOLLOL now provides a synchronous API for easier integration with non-async applications.
from sollol.sync_wrapper import OllamaPool
from sollol.priority_helpers import Priority
# Auto-discover and connect to all Ollama nodes
pool = OllamaPool.auto_configure()
# Make requests - SOLLOL routes intelligently
# No async/await needed!
response = pool.chat(
model="llama3.2",
messages=[{"role": "user", "content": "Hello!"}],
priority=Priority.HIGH, # Semantic priority levels
timeout=60 # Request timeout in seconds
)
print(response['message']['content'])
print(f"Routed to: {response.get('_sollol_routing', {}).get('host', 'unknown')}")
Key features of synchronous API:
- ✅ No async/await syntax required
- ✅ Works with synchronous agent frameworks
- ✅ Same intelligent routing and features
- ✅ Runs async code in background thread automatically
2. Async API (Original)
For async applications, use the original async API:
from sollol import OllamaPool
# Auto-discover and connect to all Ollama nodes
pool = await OllamaPool.auto_configure()
# Make requests - SOLLOL routes intelligently
response = await pool.chat(
model="llama3.2",
messages=[{"role": "user", "content": "Hello!"}]
)
print(response['message']['content'])
print(f"Routed to: {response['_sollol_routing']['host']}")
print(f"Task type: {response['_sollol_routing']['task_type']}")
3. Priority-Based Multi-Agent Execution
New in v0.3.6: Use semantic priority levels and role-based mapping.
from sollol.sync_wrapper import OllamaPool
from sollol.priority_helpers import Priority, get_priority_for_role
pool = OllamaPool.auto_configure()
# Define agents with different priorities
agents = [
{"name": "Researcher", "role": "researcher"}, # Priority 8
{"name": "Editor", "role": "editor"}, # Priority 6
{"name": "Summarizer", "role": "summarizer"}, # Priority 5
]
for agent in agents:
priority = get_priority_for_role(agent["role"])
response = pool.chat(
model="llama3.2",
messages=[{"role": "user", "content": f"Task for {agent['name']}"}],
priority=priority
)
# User-facing agents get priority, background tasks wait
Priority levels available:
Priority.CRITICAL(10) - Mission-criticalPriority.URGENT(9) - Fast response neededPriority.HIGH(7) - Important tasksPriority.NORMAL(5) - DefaultPriority.LOW(3) - Background tasksPriority.BATCH(1) - Can wait
4. Model Sharding with llama.cpp (Large Models)
Run models larger than your biggest GPU by distributing layers across multiple machines.
When to Use Model Sharding
Use model sharding when:
- ✅ Model doesn't fit on your largest GPU (e.g., 70B models on 16GB GPUs)
- ✅ You have multiple machines with network connectivity
- ✅ You can tolerate slower inference for capability
Don't use sharding when:
- ❌ Model fits on a single GPU (use task distribution instead)
- ❌ You need maximum inference speed
- ❌ Network latency is high (>10ms between machines)
Quick Start: Auto-Setup (Easiest)
from sollol.sync_wrapper import HybridRouter, OllamaPool
# SOLLOL handles all setup automatically
router = HybridRouter(
ollama_pool=OllamaPool.auto_configure(),
enable_distributed=True, # Enable model sharding
auto_setup_rpc=True, # Auto-configure RPC backends
num_rpc_backends=3 # Distribute across 3 machines
)
# Use large model that doesn't fit on one machine
response = router.route_request(
model="llama3.1:70b", # Automatically sharded across backends
messages=[{"role": "user", "content": "Explain quantum computing"}]
)
print(response['message']['content'])
What happens automatically:
- SOLLOL discovers available RPC backends on your network
- Extracts the GGUF model from Ollama storage
- Starts llama-server coordinator with optimal settings
- Distributes model layers across backends
- Routes your request to the coordinator
RPC Server Auto-Installation
SOLLOL can automatically clone, build, and start llama.cpp RPC servers for you!
One-line installation:
from sollol.rpc_auto_setup import auto_setup_rpc_backends
# Automatically: clone → build → start RPC servers
backends = auto_setup_rpc_backends(num_backends=2)
# Output: [{'host': '127.0.0.1', 'port': 50052}, {'host': '127.0.0.1', 'port': 50053}]
What this does:
- ✅ Scans network for existing RPC servers
- ✅ If none found: clones llama.cpp to
~/llama.cpp - ✅ Builds llama.cpp with RPC support (
cmake -DGGML_RPC=ON) - ✅ Starts RPC servers on ports 50052-50053
- ✅ Returns ready-to-use backend list
CLI installation:
# Full automated setup (clone + build + install systemd service)
python3 -m sollol.setup_llama_cpp --all
# Or step by step
python3 -m sollol.setup_llama_cpp --clone # Clone llama.cpp
python3 -m sollol.setup_llama_cpp --build # Build with RPC support
python3 -m sollol.setup_llama_cpp --start # Start RPC server
Docker IP Resolution:
SOLLOL automatically resolves Docker container IPs to accessible host IPs:
# If Docker container reports IP 172.17.0.5:11434
# SOLLOL automatically resolves to:
# → 127.0.0.1:11434 (published port mapping)
# → host IP (if accessible)
# → Docker host gateway
from sollol import is_docker_ip, resolve_docker_ip
# Check if IP is Docker internal
is_docker = is_docker_ip("172.17.0.5") # True
# Resolve Docker IP to accessible IP
accessible_ip = resolve_docker_ip("172.17.0.5", port=11434)
# Returns: "127.0.0.1" or host IP
Network Discovery with Docker Support:
from sollol import OllamaPool
# Auto-discover nodes (automatically resolves Docker IPs)
pool = OllamaPool.auto_configure()
# Manual control
from sollol.discovery import discover_ollama_nodes
nodes = discover_ollama_nodes(auto_resolve_docker=True)
Multi-Node Production Setup:
For distributed clusters, use systemd services on each node:
# On each RPC node
sudo systemctl enable llama-rpc@50052.service
sudo systemctl start llama-rpc@50052.service
See SOLLOL_RPC_SETUP.md for complete installation guide.
Architecture: How It Works
┌────────────────────────────────────────────┐
│ Llama 3.1 70B Model (40GB total) │
│ Distributed Sharding │
└────────────────────────────────────────────┘
│
┌────────────┼────────────┐
│ │ │
▼ ▼ ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Machine 1 │ │ Machine 2 │ │ Machine 3 │
│ Layers 0-26 │ │ Layers 27-53 │ │ Layers 54-79 │
│ (~13GB) │ │ (~13GB) │ │ (~13GB) │
│ RPC Backend │ │ RPC Backend │ │ RPC Backend │
└──────────────┘ └──────────────┘ └──────────────┘
▲ ▲ ▲
└────────────┼────────────┘
│
┌──────────┴──────────┐
│ llama-server │
│ Coordinator │
│ (Port 18080) │
└─────────────────────┘
Manual Setup (Advanced)
For explicit control over RPC backends:
from sollol.llama_cpp_coordinator import LlamaCppCoordinator
from sollol.rpc_registry import RPCBackendRegistry
# 1. Register RPC backends explicitly
registry = RPCBackendRegistry()
registry.add_backend("rpc_1", "grpc://10.9.66.45:50052")
registry.add_backend("rpc_2", "grpc://10.9.66.46:50052")
registry.add_backend("rpc_3", "grpc://10.9.66.47:50052")
# 2. Create coordinator
coordinator = LlamaCppCoordinator(
coordinator_port=18080,
rpc_backends=registry.get_all_backends(),
context_size=4096,
gpu_layers=-1 # Use all available GPU layers
)
# 3. Start and use
await coordinator.start(model_name="llama3.1:70b")
response = await coordinator.generate(
prompt="Explain the theory of relativity",
max_tokens=500
)
Performance Expectations
| Model Size | Single GPU | Sharded (3 nodes) | Trade-off |
|---|---|---|---|
| 13B | ✅ 20 tok/s | ✅ 5 tok/s | -75% speed, works on 3×smaller GPUs |
| 70B | ❌ OOM | ⚠️ 3-5 tok/s (est.) | Enables model that won't run otherwise |
Trade-offs:
- 🐌 Startup: 2-5 minutes (model distribution + loading)
- 🐌 Inference: ~4x slower than local (network overhead)
- ✅ Capability: Run models that won't fit on single GPU
Learn More:
- 📖 Complete llama.cpp Guide - Setup, optimization, troubleshooting
- 💻 Working Examples - 5 complete examples including conversation, batch processing, error handling
5. Batch Processing API
New in v0.7.0: RESTful API for asynchronous batch job management.
Submit large-scale batch operations (thousands of embeddings, bulk inference) and track progress via job IDs:
import requests
# Submit batch embedding job (up to 10,000 documents)
response = requests.post("http://localhost:11434/api/batch/embed", json={
"model": "nomic-embed-text",
"documents": ["Document 1", "Document 2", ...], # Can be thousands
"metadata": {"source": "knowledge_base"} # Optional metadata
})
job_id = response.json()["job_id"]
print(f"Job submitted: {job_id}")
# Poll for job status
import time
while True:
status = requests.get(f"http://localhost:11434/api/batch/jobs/{job_id}").json()
progress = status["progress"]["percent"]
print(f"Progress: {progress}%")
if status["status"] == "completed":
break
time.sleep(1)
# Get results
results = requests.get(f"http://localhost:11434/api/batch/results/{job_id}").json()
embeddings = results["results"] # List of embedding vectors
print(f"Processed {len(embeddings)} documents in {status['duration_seconds']}s")
Available Batch Endpoints:
POST /api/batch/embed- Submit batch embedding jobGET /api/batch/jobs/{job_id}- Get job statusGET /api/batch/results/{job_id}- Get job resultsGET /api/batch/jobs?limit=100- List recent jobsDELETE /api/batch/jobs/{job_id}- Cancel job
Use cases:
- Embedding large document collections (thousands of documents)
- Bulk inference for batch predictions
- Background processing without blocking
- Long-running operations with progress tracking
6. SOLLOL Detection
New in v0.3.6: Detect if SOLLOL is running vs native Ollama.
import requests
def is_sollol(url="http://localhost:11434"):
"""Check if SOLLOL is running at the given URL."""
# Method 1: Check X-Powered-By header
response = requests.get(url)
if response.headers.get("X-Powered-By") == "SOLLOL":
return True
# Method 2: Check health endpoint
response = requests.get(f"{url}/api/health")
data = response.json()
if data.get("service") == "SOLLOL":
return True
return False
# Use it
if is_sollol("http://localhost:11434"):
print("✓ SOLLOL detected - using intelligent routing")
else:
print("Native Ollama detected")
Why this matters:
- Enables graceful fallback in client applications
- Makes SOLLOL a true drop-in replacement
- Clients can auto-detect and use SOLLOL features when available
7. Production Gateway
from sollol import SOLLOL, SOLLOLConfig
# Full production setup
config = SOLLOLConfig(
ray_workers=4,
dask_workers=2,
hosts=["gpu-1:11434", "gpu-2:11434", "cpu-1:11434"],
gateway_port=8000,
metrics_port=9090
)
sollol = SOLLOL(config)
sollol.start() # Blocks and runs gateway
# Access via HTTP:
# curl http://localhost:8000/api/chat -d '{...}'
# curl http://localhost:8000/api/stats
# curl http://localhost:8000/api/dashboard
🎓 Use Cases
1. Multi-Agent AI Systems (SynapticLlamas, CrewAI, AutoGPT)
Problem: Running 10 agents sequentially takes 10x longer than necessary.
Solution: SOLLOL distributes agents across nodes in parallel.
# Before: Sequential execution on one node
# After: Parallel execution with SOLLOL
pool = OllamaPool.auto_configure()
agents = await asyncio.gather(*[
pool.chat(model="llama3.2", messages=agent_prompts[i])
for i in range(10)
])
# Speedup depends on number of available nodes and their capacity
2. Large Model Inference
Problem: Your model doesn't fit in available VRAM.
Solution: SOLLOL can shard models across multiple machines via llama.cpp.
# Distribute model across multiple nodes
# Note: Verified with 13B models; larger models not extensively tested
router = HybridRouter(
enable_distributed=True,
num_rpc_backends=4
)
# Trade-off: Slower startup/inference but enables running larger models
3. Mixed Workloads
Problem: Different tasks need different resources.
Solution: SOLLOL routes each task to the optimal node.
pool = OllamaPool.auto_configure()
# Heavy generation → GPU node
chat = pool.chat(model="llama3.2:70b", messages=[...])
# Fast embeddings → CPU node
embeddings = pool.embed(model="nomic-embed-text", input=[...])
# SOLLOL automatically routes each to the best available node
4. High Availability Production
Problem: Node failures break your service.
Solution: SOLLOL auto-fails over and recovers.
# Node A fails mid-request
# ✅ SOLLOL automatically:
# 1. Detects failure
# 2. Retries on Node B
# 3. Marks Node A as degraded
# 4. Periodically re-checks Node A
# 5. Restores Node A when healthy
Simulate Failure & Recovery
Want to see SOLLOL's automatic failover in action? Run the included simulation:
python test_failure_recovery.py
What the simulation does:
- Starts 3 mock Ollama nodes
- Sends baseline requests (all nodes healthy)
- Kills node #1 mid-execution
- Continues sending requests (SOLLOL routes around failed node)
- Restores node #1
- Resumes sending requests (traffic returns to recovered node)
Expected output:
STEP 1: Starting Mock Nodes
✅ Started 3 mock nodes
BASELINE: Requests with all nodes healthy
Request 1: ✓ Routed to localhost:21434
Request 2: ✓ Routed to localhost:21435
...
STEP 3: Simulating Node Failure (killing node 0)
Killing node on port 21434...
✅ Node 21434 terminated
STEP 4: Requests after node failure (observe failover)
Request 1: ✓ Routed to localhost:21435 ← Automatically avoided dead node
Request 2: ✓ Routed to localhost:21436
...
STEP 5: Simulating Node Recovery
✅ Node 21434 recovered successfully
✅ Key Observations:
1. Requests succeeded even after node failure
2. SOLLOL automatically routed around the dead node
3. Node recovered and rejoined the pool
4. Traffic resumed to recovered node
This demonstrates SOLLOL's production-grade resilience without needing real infrastructure.
📊 Performance & Benchmarks
Validation Status
What's Been Validated ✅
- Single-node baseline performance measured
- Code exists and is reviewable (75+ modules)
- Tests pass in CI (57 tests, coverage tracked)
- Architecture implements intelligent routing
What Needs Validation ⚠️
- Comparative benchmarks (SOLLOL vs round-robin)
- Multi-node performance improvements
- Real-world latency/throughput gains
📖 See BENCHMARKING.md for complete validation roadmap and how to run comparative tests.
Measured Baseline Performance
Single Ollama Node (llama3.2-3B, 50 requests, concurrency=5):
- ✅ Success Rate: 100%
- ⚡ Throughput: 0.51 req/s
- 📈 Average Latency: 5,659 ms
- 📈 P95 Latency: 11,299 ms
- 📈 P99 Latency: 12,259 ms
Hardware: Single Ollama instance with 75+ models loaded
Data: See benchmarks/results/ for raw JSON
Run Your Own:
# Baseline test (no cluster needed)
python benchmarks/simple_ollama_benchmark.py llama3.2 50
# Comparative test (requires docker-compose)
docker-compose up -d
python benchmarks/run_benchmarks.py --sollol-url http://localhost:8000 --duration 60
Projected Performance (Unvalidated)
Note: These are architectural projections, not measured results. Requires multi-node cluster setup for validation.
Theory: With N nodes and parallelizable workload:
- Task distribution can approach N× parallelization (limited by request rate)
- Intelligent routing should reduce tail latencies vs random selection
- Resource-aware placement reduces contention and failures
Reality: Requires multi-node cluster validation. See BENCHMARKING.md for test procedure and CODE_WALKTHROUGH.md for implementation details.
Model Sharding Performance
| Model | Single 24GB GPU | SOLLOL (3×16GB) | Status |
|---|---|---|---|
| 13B | ✅ ~20 tok/s | ✅ ~5 tok/s | ✅ Verified working |
| 70B | ❌ OOM | ⚠️ Estimated ~3-5 tok/s | ⚠️ Not extensively tested |
When to use sharding: When model doesn't fit on your largest GPU. You trade speed for capability.
Performance trade-offs: Distributed inference is 2-5 minutes slower to start and ~4x slower for inference compared to local. Use only when necessary.
Overhead
- Routing decision: ~5-10ms (tested with 5-10 nodes)
- Network overhead: Varies by network (typically 5-20ms)
- Total added latency: ~20-50ms
- Benefit: Better resource utilization + automatic failover
🛠️ Advanced Configuration
Custom Routing Strategy
from sollol import OllamaPool
pool = OllamaPool(
nodes=[
{"host": "gpu-1.local", "port": 11434, "priority": 10}, # Prefer this
{"host": "gpu-2.local", "port": 11434, "priority": 5},
{"host": "cpu-1.local", "port": 11434, "priority": 1}, # Last resort
],
enable_intelligent_routing=True,
enable_hedging=True, # Duplicate critical requests
max_queue_size=100
)
Priority-Based Scheduling
# Critical user-facing request
response = pool.chat(
model="llama3.2",
messages=[...],
priority=10 # Highest priority
)
# Background batch job
response = pool.chat(
model="llama3.2",
messages=[...],
priority=1 # Lowest priority
)
# SOLLOL ensures high-priority requests jump the queue
Observability & Monitoring
Zero-Config Auto-Registration 🎯
SOLLOL provides automatic observability with zero configuration required. All applications automatically register with the dashboard when they create an OllamaPool:
from sollol import OllamaPool
# Creates pool AND auto-registers with dashboard (if running)
pool = OllamaPool.auto_configure()
# ✅ Application automatically appears in dashboard at http://localhost:8080
How it works:
OllamaPoolautomatically detects if a dashboard is running on port 8080- Auto-discovers RPC backends and Ollama nodes
- Registers application with metadata (node count, GPU info, etc.)
- Sends periodic heartbeats to maintain "alive" status
- No manual
DashboardClientsetup needed!
Architecture:
- ONE persistent dashboard service runs independently
- Multiple applications (SynapticLlamas, FlockParser, etc.) auto-register
- Dashboard survives application exits
- Zero-config auto-discovery of nodes and RPC backends
Custom Application Names 🏷️
By default, applications register as "OllamaPool (hostname)". To give your application a custom name in the dashboard:
from sollol import OllamaPool
# Register with custom application name
pool = OllamaPool(
nodes=[{"host": "localhost", "port": 11434}],
enable_intelligent_routing=True,
app_name="MyApplication" # Shows as "MyApplication" in dashboard
)
Example - Multi-application setup:
# Application 1: FlockParser
from sollol import OllamaPool
pool = OllamaPool.auto_configure(app_name="FlockParser")
# Dashboard shows: "FlockParser"
# Application 2: SynapticLlamas
from sollol.dashboard_client import DashboardClient
dashboard_client = DashboardClient(
app_name="SynapticLlamas",
router_type="IntelligentRouter",
version="1.0.0",
dashboard_url="http://localhost:8080",
metadata={"agents": 3, "distributed": True},
auto_register=True
)
# Dashboard shows: "SynapticLlamas"
Why use custom names?
- Distinguish between multiple applications using SOLLOL
- Better visibility in multi-tenant environments
- Easier debugging and monitoring
- Professional dashboard presentation
Manual/Programmatic Registration 🔧
For applications that don't use OllamaPool or need custom registration logic, use DashboardClient directly:
from sollol.dashboard_client import DashboardClient
# Create dashboard client with custom metadata
dashboard_client = DashboardClient(
app_name="CustomApplication",
router_type="CustomRouter", # Or "OllamaPool", "HybridRouter", etc.
version="1.0.0",
dashboard_url="http://localhost:8080",
metadata={
# Custom metadata shown in dashboard
"nodes": 5,
"distributed": True,
"custom_field": "value"
},
auto_register=True # Registers immediately
)
# Dashboard client automatically sends heartbeats every 5 seconds
# to keep application status as "active"
# When application exits, clean up:
dashboard_client.close() # Stops heartbeat thread
Advanced: Custom Heartbeat Logic
from sollol.dashboard_client import DashboardClient
import time
# Create client without auto-registration
dashboard_client = DashboardClient(
app_name="BackgroundWorker",
router_type="WorkerPool",
version="2.0.0",
dashboard_url="http://localhost:8080",
metadata={"worker_count": 10},
auto_register=False # Don't register yet
)
# Register when ready
dashboard_client.register()
# Update metadata dynamically
dashboard_client.update_metadata({"worker_count": 15, "status": "processing"})
# Send manual heartbeat
dashboard_client.heartbeat()
# Application logic here...
time.sleep(60)
# Deregister when done
dashboard_client.deregister()
dashboard_client.close()
Use cases for manual registration:
- Custom routers or load balancers
- Background workers or daemons
- Applications that need dynamic metadata updates
- Testing and debugging
- Applications without OllamaPool
Registration Methods Comparison 📊
| Method | Use Case | Complexity | Customization |
|---|---|---|---|
| Auto-registration | Standard SOLLOL applications | ✅ Zero config | Limited (app_name only) |
| Custom app_name | Multiple apps, better naming | ✅ One parameter | App name |
| Manual DashboardClient | Custom applications | ⚠️ More code | Full control |
Quick decision guide:
- Using
OllamaPool? → Useapp_nameparameter - Need custom metadata? → Use
DashboardClientdirectly - Need dynamic updates? → Use
DashboardClientwith manual heartbeats - Just want it to work? → Use auto-registration (default)
Persistent Dashboard Service
Start the persistent dashboard once (survives application exits):
# Start dashboard service (runs until stopped)
python3 -m sollol.dashboard_service --port 8080 --redis-url redis://localhost:6379
# Or run in background
nohup python3 -m sollol.dashboard_service --port 8080 --redis-url redis://localhost:6379 > /tmp/dashboard_service.log 2>&1 &
Features:
- 📊 Real-time metrics: System status, latency, success rate, GPU memory, Ray workers
- 📜 Live log streaming: WebSocket-based log tailing (via Redis pub/sub)
- 🌐 Activity monitoring: Ollama server and llama.cpp RPC activity
- 🔷 Embedded Ray dashboard: Task-level distributed tracing
- 📈 Embedded Dask dashboard: Performance profiling and task graphs
- 🔍 Auto-discovery: Automatically discovers Ollama nodes and RPC backends when no router context
Embedded Dashboard (Alternative)
Applications can also start their own embedded dashboards:
from sollol import run_unified_dashboard
import threading
# Start embedded dashboard with router context
dashboard_thread = threading.Thread(
target=run_unified_dashboard,
kwargs={
"router": pool, # Provides node/backend context
"dashboard_port": 8080,
"host": "0.0.0.0",
"enable_dask": False
},
daemon=True
)
dashboard_thread.start()
Environment Variables (configure before initializing):
# Disable dashboard (default: true)
export SOLLOL_DASHBOARD=false
# Change dashboard port (default: 8080)
export SOLLOL_DASHBOARD_PORT=9090
# Disable Dask dashboard integration (default: true)
export SOLLOL_DASHBOARD_DASK=false
Multi-Application Pattern ✨
The persistent dashboard service enables multiple applications to share observability:
# Terminal 1: Start persistent dashboard
python3 -m sollol.dashboard_service --port 8080 --redis-url redis://localhost:6379
# Terminal 2: Start application 1
python my_app1.py # Auto-registers with dashboard
# Terminal 3: Start application 2
python my_app2.py # Also auto-registers
# Visit http://localhost:8080 to see both applications!
Benefits:
- Single dashboard for all SOLLOL-based applications
- Dashboard stays running when applications exit
- Aggregated logs from all applications (via Redis pub/sub)
- Centralized observability for distributed systems
Programmatic Stats Access
# Get detailed stats
stats = pool.get_stats()
print(f"Total requests: {stats['total_requests']}")
print(f"Average latency: {stats['avg_latency_ms']}ms")
print(f"Success rate: {stats['success_rate']:.2%}")
# Per-node breakdown
for host, metrics in stats['hosts'].items():
print(f"{host}: {metrics['latency_ms']}ms, {metrics['success_rate']:.2%}")
Prometheus Metrics
# Prometheus metrics endpoint
curl http://localhost:9090/metrics
# sollol_requests_total{host="gpu-1:11434",model="llama3.2"} 1234
# sollol_latency_seconds{host="gpu-1:11434"} 0.234
# sollol_success_rate{host="gpu-1:11434"} 0.98
🔌 Integration Examples
🔗 Integration with SynapticLlamas & FlockParser
SOLLOL is the distributed inference platform for the complete AI ecosystem, powering both SynapticLlamas (multi-agent orchestration) and FlockParser (document RAG).
The Complete Stack
┌─────────────────────────────────────────────────────────────┐
│ SynapticLlamas (v0.1.0+) │
│ Multi-Agent System & Orchestration │
│ • Research agents • Editor agents • Storyteller agents │
└───────────┬────────────────────────────────────┬───────────┘
│ │
│ RAG Queries │ Distributed
│ (with pre-computed embeddings) │ Inference
│ │
┌──────▼──────────┐ ┌─────────▼────────────┐
│ FlockParser │ │ SOLLOL │
│ API (v1.0.4+) │ │ Load Balancer │
│ Port: 8000 │ │ (v0.9.31+) │
└─────────────────┘ └──────────────────────┘
│ │
│ ChromaDB │ Intelligent
│ Vector Store │ GPU/CPU Routing
│ │
┌──────▼──────────┐ ┌─────────▼────────────┐
│ Knowledge Base │ │ Ollama Nodes │
│ 41 Documents │ │ (Distributed) │
│ 6,141 Chunks │ │ GPU + CPU │
└─────────────────┘ └──────────────────────┘
Why This Integration Matters
| Component | Role | Key Feature |
|---|---|---|
| SOLLOL | Distributed Inference | Intelligent GPU/CPU routing with load balancing |
| SynapticLlamas | Multi-Agent Orchestration | Research, Editor, Storyteller agents |
| FlockParser | Document RAG & Knowledge Base | ChromaDB vector store with 6,141+ chunks |
Quick Start: Complete Ecosystem
# Install all three packages (auto-installs dependencies)
pip install synaptic-llamas # Pulls in flockparser>=1.0.4 and sollol>=0.9.31
# Start FlockParser API
flockparse
# Run SynapticLlamas with SOLLOL + FlockParser integration
synaptic-llamas --interactive --distributed
Integration Example: Load Balanced RAG
from sollol import OllamaPool
from flockparser_adapter import FlockParserAdapter
# Initialize SOLLOL for distributed inference
sollol = OllamaPool.auto_configure()
# Initialize FlockParser adapter
flockparser = FlockParserAdapter("http://localhost:8000", remote_mode=True)
# Step 1: Generate embedding using SOLLOL (load balanced!)
user_query = "What does research say about quantum entanglement?"
embedding = sollol.embed(
model="mxbai-embed-large",
input=user_query
)
# SOLLOL routes to fastest GPU automatically
# Step 2: Query FlockParser with pre-computed embedding
rag_results = flockparser.query_remote(
query=user_query,
embedding=embedding, # Skip FlockParser's embedding generation
n_results=5
)
# FlockParser returns relevant chunks from 41 documents
# Performance gain: 2-5x faster when SOLLOL has faster nodes!
Production Integrations
SOLLOL is actively used in production by:
-
FlockParser - Document RAG Intelligence with distributed processing. FlockParser's legacy load balancing code was refactored and became core SOLLOL logic. FlockParser now uses SOLLOL directly via
OllamaPoolfor intelligent routing across document embeddings and LLM queries. -
SynapticLlamas - Multi-agent collaborative research framework. Uses SOLLOL's
HybridRouterfor distributed agent execution with RAG-enhanced research capabilities via FlockParser integration.
Related Projects:
- SynapticLlamas - Multi-Agent Orchestration
- FlockParser - Document RAG Intelligence
SynapticLlamas Integration
from sollol import SOLLOL, SOLLOLConfig
from synaptic_llamas import AgentOrchestrator
# Setup SOLLOL for multi-agent orchestration
config = SOLLOLConfig.auto_discover()
sollol = SOLLOL(config)
sollol.start(blocking=False)
# SynapticLlamas now uses SOLLOL for intelligent routing
orchestrator = AgentOrchestrator(
llm_endpoint="http://localhost:8000/api/chat"
)
# All agents automatically distributed and optimized
orchestrator.run_parallel_agents([...])
FlockParser Integration
from sollol import OllamaPool
# FlockParser uses SOLLOL's OllamaPool directly
pool = OllamaPool(
nodes=None, # Auto-discover all Ollama nodes
enable_intelligent_routing=True,
exclude_localhost=True,
discover_all_nodes=True,
app_name="FlockParser",
enable_ray=True
)
# All FlockParser document embeddings and queries route through SOLLOL
embeddings = pool.embed(model="mxbai-embed-large", input="document text")
response = pool.chat(model="llama3.2", messages=[{"role": "user", "content": "query"}])
LangChain Integration
from langchain.llms import Ollama
from sollol import OllamaPool
# Use SOLLOL as LangChain backend
pool = OllamaPool.auto_configure()
llm = Ollama(
base_url="http://localhost:8000",
model="llama3.2"
)
# LangChain requests now go through SOLLOL
response = llm("What is quantum computing?")
🏭 Production Deployment (Bare Metal)
For teams preferring bare metal infrastructure over containers, SOLLOL provides systemd-based deployment for production environments.
Multi-Node Bare Metal Setup
This setup assumes you have 3+ physical machines with Ollama installed. We'll configure SOLLOL as a centralized routing layer.
Architecture:
┌─────────────────────────────────────────┐
│ Central Router Machine (Control Plane│
│ - SOLLOL Dashboard (port 8080) │
│ - Redis (port 6379) │
│ - Optional: GPU reporter │
└────────────┬────────────────────────────┘
│ Auto-discovery via network
│ scan (ports 11434)
┌───────┼──────────┬─────────────┐
▼ ▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐
│ Node 1 │ │ Node 2 │ │ Node 3 │ │ Node N │
│ Ollama │ │ Ollama │ │ Ollama │ │ Ollama │
│ :11434 │ │ :11434 │ │ :11434 │ │ :11434 │
│ GPU 24GB│ │ GPU 16GB│ │ CPU 64c │ │ ... │
└─────────┘ └─────────┘ └─────────┘ └─────────┘
Step 1: Install Ollama on each node
On each worker node (Node 1, 2, 3, ...):
# Install Ollama
curl -fsSL https://ollama.ai/install.sh | sh
# Start Ollama service
sudo systemctl enable ollama
sudo systemctl start ollama
# Verify it's running
curl http://localhost:11434/api/tags
Step 2: Install SOLLOL on control plane machine
On your central router machine:
# Install SOLLOL and dependencies
pip install sollol redis
# Install Redis
sudo apt-get install redis-server # Ubuntu/Debian
# OR
sudo yum install redis # RHEL/CentOS
# Start Redis
sudo systemctl enable redis
sudo systemctl start redis
Step 3: Create systemd service for SOLLOL Dashboard
Create /etc/systemd/system/sollol-dashboard.service:
[Unit]
Description=SOLLOL Dashboard Service
After=network.target redis.service
Requires=redis.service
[Service]
Type=simple
User=sollol # Create dedicated user for security
Group=sollol
WorkingDirectory=/opt/sollol
Environment="SOLLOL_DASHBOARD=true"
Environment="SOLLOL_DASHBOARD_PORT=8080"
Environment="REDIS_URL=redis://localhost:6379"
ExecStart=/usr/bin/python3 -m sollol.dashboard_service --port 8080 --redis-url redis://localhost:6379
Restart=always
RestartSec=10
StandardOutput=journal
StandardError=journal
[Install]
WantedBy=multi-user.target
Enable and start:
sudo useradd -r -s /bin/false sollol # Create dedicated user
sudo mkdir -p /opt/sollol
sudo chown sollol:sollol /opt/sollol
sudo systemctl daemon-reload
sudo systemctl enable sollol-dashboard
sudo systemctl start sollol-dashboard
# Verify
sudo systemctl status sollol-dashboard
curl http://localhost:8080/health
Step 4: Install GPU reporters on nodes (optional but recommended)
On each GPU node for accurate VRAM monitoring:
# Install on each node with GPUs
pip install sollol gpustat
# Run GPU reporter (publishes to central Redis)
sollol install-gpu-reporter --redis-host <control-plane-ip>
# Example for node at 10.9.66.45
sollol install-gpu-reporter --redis-host 10.9.66.154
Create /etc/systemd/system/sollol-gpu-reporter.service on each GPU node:
[Unit]
Description=SOLLOL GPU Reporter
After=network.target
[Service]
Type=simple
User=sollol
ExecStart=/usr/local/bin/sollol-gpu-reporter --redis-host <control-plane-ip> --interval 5
Restart=always
RestartSec=5
[Install]
WantedBy=multi-user.target
Step 5: Configure firewall rules
On all nodes:
# Allow Ollama traffic (port 11434)
sudo ufw allow 11434/tcp comment "Ollama API"
# On control plane only: allow dashboard access
sudo ufw allow 8080/tcp comment "SOLLOL Dashboard"
sudo ufw allow 6379/tcp comment "Redis" # Only from trusted nodes
# Reload firewall
sudo ufw reload
Step 6: Test the deployment
From any machine with network access:
from sollol import OllamaPool
# SOLLOL auto-discovers all nodes via network scan
pool = OllamaPool.auto_configure()
# Verify nodes discovered
stats = pool.get_stats()
print(f"Discovered {stats['active_nodes']} nodes")
# Make a test request
response = pool.chat(
model="llama3.2",
messages=[{"role": "user", "content": "Hello!"}]
)
print(response['message']['content'])
print(f"Routed to: {response['_sollol_routing']['host']}")
Step 7: Monitor with systemd
# Check dashboard status
sudo systemctl status sollol-dashboard
# View live logs
sudo journalctl -u sollol-dashboard -f
# Check GPU reporters
sudo systemctl status sollol-gpu-reporter
# View metrics
curl http://localhost:8080/api/stats | jq
Production Hardening
Security:
# 1. Run SOLLOL as dedicated unprivileged user
sudo useradd -r -s /bin/false sollol
# 2. Configure Redis authentication
sudo vi /etc/redis/redis.conf
# Add: requirepass <strong-password>
# 3. Use firewall to restrict access
sudo ufw allow from 10.9.66.0/24 to any port 6379 # Redis from trusted subnet only
sudo ufw allow from 10.9.66.0/24 to any port 8080 # Dashboard from trusted subnet
High Availability:
# Use systemd watchdog for automatic restart on crashes
[Service]
WatchdogSec=30
Restart=always
RestartSec=10
Monitoring:
# Integrate with Prometheus
curl http://localhost:9090/metrics
# Or use systemd monitoring
systemctl status sollol-dashboard | grep "Active:"
Troubleshooting
Nodes not discovered:
# Check network connectivity
for ip in 10.9.66.{1..255}; do
timeout 0.5 bash -c "cat < /dev/null > /dev/tcp/$ip/11434 2>/dev/null" && echo "$ip:11434 reachable"
done
# Check Ollama is listening on all interfaces (not just localhost)
curl http://<node-ip>:11434/api/tags
Dashboard not starting:
# Check Redis is running
systemctl status redis
redis-cli ping # Should return "PONG"
# Check port not in use
sudo lsof -i :8080
# View detailed logs
journalctl -u sollol-dashboard --since "10 minutes ago"
Performance issues:
# Check node health
curl http://localhost:8080/api/stats | jq '.node_performance'
# Monitor resource usage
htop
nvidia-smi # On GPU nodes
# Check network latency between nodes
ping <node-ip>
📚 Documentation
- Architecture Guide - Deep dive into system design
- Backend Architecture - Backend extensibility and adding new LLM backends
- Batch Processing API - Complete guide to batch job management (NEW in v0.7.0)
- API endpoints and examples
- Job lifecycle and progress tracking
- Best practices and error handling
- llama.cpp Distributed Inference Guide - Complete guide to model sharding
- Setup and configuration
- Performance optimization
- Troubleshooting common issues
- Advanced topics (custom layer distribution, monitoring, etc.)
- Integration Examples - Practical integration patterns
- llama.cpp Distributed Examples - Model sharding examples
- Auto-setup and manual configuration
- Multi-turn conversations with monitoring
- Batch processing with multiple models
- Error handling and recovery patterns
- Deployment Guide - Production deployment patterns
- API Reference - Complete API documentation
- Performance Tuning - Optimization guide
- SynapticLlamas Learnings - Features from production use
🆕 What's New in v0.7.0
📦 Batch Processing API
Complete RESTful API for asynchronous batch job management. Submit large-scale batch operations (embeddings, bulk inference) and track progress via job IDs.
import requests
# Submit batch embedding job (up to 10,000 documents)
response = requests.post("http://localhost:11434/api/batch/embed", json={
"model": "nomic-embed-text",
"documents": ["doc1", "doc2", ...], # Thousands of documents
})
job_id = response.json()["job_id"]
# Check status
status = requests.get(f"http://localhost:11434/api/batch/jobs/{job_id}")
print(status.json()["progress"]["percent"]) # 100.0
# Get results
results = requests.get(f"http://localhost:11434/api/batch/results/{job_id}")
embeddings = results.json()["results"]
Batch API Endpoints:
POST /api/batch/embed- Submit batch embedding jobGET /api/batch/jobs/{job_id}- Get job status with progress trackingGET /api/batch/results/{job_id}- Retrieve job results and errorsDELETE /api/batch/jobs/{job_id}- Cancel running jobsGET /api/batch/jobs?limit=100- List recent jobs
Features:
- UUID-based job tracking with 5 states (PENDING, RUNNING, COMPLETED, FAILED, CANCELLED)
- Automatic TTL-based cleanup (1 hour default)
- Progress tracking: completed_items, failed_items, percentage
- Duration calculation and metadata storage
- Async job execution via Dask distributed processing
⚡ Performance Optimizations (v0.9.18+)
SOLLOL now includes 8 production-grade performance optimizations designed to improve throughput and latency:
⚠️ Transparency Note: These features are implemented and functional, but claimed performance improvements are projections based on architecture, NOT independently validated benchmarks. See Performance Impact section below for details.
🚀 Response Caching Layer
Expected Impact: Reduces latency for repeated queries (cache hit/miss tracking validated)
Intelligent LRU cache with TTL expiration:
from sollol import OllamaPool
# Enable response caching (enabled by default)
pool = OllamaPool.auto_configure(
enable_cache=True,
cache_max_size=1000, # Cache up to 1000 responses
cache_ttl=3600 # 1 hour TTL
)
# First request: normal latency
response1 = pool.embed(model="mxbai-embed-large", input="Hello world")
# Cached request: faster
response2 = pool.embed(model="mxbai-embed-large", input="Hello world") # Cache hit
# Programmatic cache management
pool.clear_cache() # Clear all
pool.invalidate_cache_by_model("llama3.2") # Invalidate by model
cache_data = pool.export_cache() # Export for persistence
pool.import_cache(cache_data) # Restore from export
# Get cache stats
stats = pool.get_cache_stats()
print(f"Hit rate: {stats['hit_rate']:.1%}") # 85.2%
print(f"Cache size: {stats['size']}") # 234/1000
🌊 Streaming Support
Expected Impact: Better UX, reduced perceived latency (streaming functionality validated)
Token-by-token streaming for chat() and generate():
# Stream chat responses
for chunk in pool.chat(
model="llama3.2",
messages=[{"role": "user", "content": "Tell me a story"}],
stream=True
):
content = chunk.get("message", {}).get("content", "")
print(content, end="", flush=True)
# Stream text generation
for chunk in pool.generate(
model="llama3.2",
prompt="Explain quantum computing",
stream=True
):
print(chunk.get("response", ""), end="", flush=True)
🔥 Smart Model Prefetching
Expected Impact: 1-5 seconds reduced first-request latency (projection, not measured)
Pre-load models into VRAM before first use:
# Warm a single model
pool.warm_model("llama3.2")
# Warm multiple models in parallel
results = pool.warm_models(
models=["llama3.2", "codellama", "mistral"],
parallel=True
)
print(f"Warmed {sum(results.values())} models")
⚡ Async I/O Support
Expected Impact: 2-3x throughput for concurrent requests (projection, not measured)
True non-blocking I/O with httpx AsyncClient:
import asyncio
# Async methods for concurrent requests
async def process_batch():
responses = await asyncio.gather(
pool.chat_async("llama3.2", messages=[...]),
pool.generate_async("llama3.2", prompt="..."),
pool.embed_async("mxbai-embed-large", input="...")
)
return responses
# Run async batch
results = asyncio.run(process_batch())
🔗 HTTP/2 Multiplexing
Expected Impact: 30-50% latency reduction for concurrent requests (projection, not measured)
Automatic HTTP/2 support when httpx is installed:
# Automatically uses HTTP/2 if available
pool = OllamaPool.auto_configure()
# Check if HTTP/2 is enabled
stats = pool.get_stats()
print(f"HTTP/2 enabled: {stats['http2_enabled']}") # True
📊 Additional Optimizations
Connection Pool Tuning (10-20% better concurrency):
- Optimized pool sizes: 10-20 connections per node
- Automatic retry with exponential backoff
- Connection reuse with keep-alive
Adaptive Health Checks (5-10% overhead reduction):
- Dynamic intervals based on node stability:
- Very stable (<1% failures): 60s interval
- Stable (<5% failures): 30s interval
- Degraded (5-15% failures): 15s interval
- Unstable (>15% failures): 5s interval
Telemetry Sampling (~90% overhead reduction):
- Configurable sampling for info-level events (default: 10%)
- Always logs errors and critical events
- Reduces dashboard logging overhead
📈 Performance Impact
⚠️ IMPORTANT: These are architectural projections, NOT measured results
These optimizations are implemented and functional, but multi-node performance gains have not been independently validated:
Projected improvements (unvalidated):
- Throughput: +150-300% for concurrent workloads (theory: parallel request handling)
- Latency: -40-70% for typical requests (theory: caching + HTTP/2)
- Cache hits: Significant latency reduction for repeated queries (validated in single-node tests)
What's actually measured:
- ✅ Response caching works (cache hit/miss rates tracked)
- ✅ Streaming works (token-by-token delivery confirmed)
- ✅ HTTP/2 enabled (httpx connection verified)
- ⚠️ Multi-node throughput gains: Not independently benchmarked
To validate these claims yourself:
# Run comparative benchmarks
cd benchmarks
python run_benchmarks.py --sollol-url http://localhost:8000 --duration 120
See BENCHMARKING.md for methodology.
Previous Features (v0.3.6+)
Synchronous API - No async/await required:
from sollol.sync_wrapper import OllamaPool
pool = OllamaPool.auto_configure()
response = pool.chat(...) # Synchronous call
Priority Helpers - Semantic priority levels:
from sollol.priority_helpers import Priority
priority = Priority.HIGH # 7
SOLLOL Detection:
X-Powered-By: SOLLOLheader on all responses/api/healthendpoint returns{"service": "SOLLOL", "version": "0.7.0"}
🆚 Comparison
SOLLOL vs. Simple Load Balancers
| Feature | nginx/HAProxy | SOLLOL |
|---|---|---|
| Routing | Round-robin/random | Context-aware, adapts from history |
| Resource awareness | None | GPU/CPU/memory-aware |
| Failover | Manual config | Automatic detection & recovery |
| Model sharding | ❌ | ✅ llama.cpp integration |
| Task prioritization | ❌ | ✅ Priority queue |
| Observability | Basic | Rich metrics + dashboard |
| Setup | Complex config | Auto-discover |
SOLLOL vs. Kubernetes
| Feature | Kubernetes | SOLLOL |
|---|---|---|
| Complexity | High - requires cluster setup | Low - pip install |
| AI-specific | Generic container orchestration | Purpose-built for LLMs |
| Intelligence | None | Task-aware routing |
| Model sharding | Manual | Automatic |
| Best for | Large-scale production | AI-focused teams |
Use both! Deploy SOLLOL on Kubernetes for ultimate scalability.
🤝 Contributing
We welcome contributions! Areas we'd love help with:
- ML-based routing predictions
- Additional monitoring integrations
- Cloud provider integrations
- Performance optimizations
- Documentation improvements
See CONTRIBUTING.md for guidelines.
📜 License
MIT License - see LICENSE file for details.
🙏 Credits
Created by BenevolentJoker-JohnL
Part of the Complete AI Ecosystem:
- SynapticLlamas - Multi-Agent Orchestration
- FlockParser - Document RAG Intelligence
- SOLLOL - Distributed Inference Platform (this project)
Special Thanks:
- Dallan Loomis - For always providing invaluable support, feedback, and guidance throughout development. Your insights and encouragement have been instrumental in shaping this project.
Built with: Ray, Dask, FastAPI, llama.cpp, Ollama
🎯 What Makes SOLLOL Different?
- Combines task distribution AND model sharding in one system
- Context-aware routing that adapts based on performance metrics
- Auto-discovery of nodes with minimal configuration
- Built-in failover and priority queuing
- Purpose-built for Ollama clusters (understands GPU requirements, task types)
Limitations to know:
- Model sharding verified with 13B models; larger models not extensively tested
- Performance benefits depend on network latency and workload patterns
- Not a drop-in replacement for single-node setups in all scenarios
Stop manually managing your LLM cluster. Let SOLLOL optimize it for you.