No description

Python 97.2%
HTML 1.7%
Shell 1%

Find a file

Hackall 6c69f8506e Some checks failed Lint / lint (push) Has been cancelled Details Tests / test (3.10) (push) Has been cancelled Details Tests / test (3.11) (push) Has been cancelled Details Tests / test (3.9) (push) Has been cancelled Details Tests / integration-test (push) Has been cancelled Details Merge pull request #29 from hackall360/codex/create-typer-cli-for-full-project-demo Add full project demo CLI for orchestrating SOLLOL examples		2025-10-18 10:43:11 -07:00
.github	feat: add GitHub Actions CI with multi-node integration testing	2025-10-16 23:07:47 -05:00
benchmarks	Add real benchmark results and fix CLI entry point	2025-10-06 08:31:07 -05:00
config	Add production deployment artifacts and real benchmarks	2025-10-06 08:36:17 -05:00
docs	docs: add dashboard screenshots	2025-10-17 08:31:35 -05:00
examples	Add full project demo CLI and update live dashboard summary	2025-10-18 10:42:51 -07:00
kubernetes	Add production deployment artifacts and real benchmarks	2025-10-06 08:36:17 -05:00
scripts	docs: reorganize documentation structure	2025-10-16 17:54:37 -05:00
src/sollol	feat: add routing decision tooltips to dashboard	2025-10-16 23:20:07 -05:00
SynapticLlamas@24be170a2d	feat: add routing decision tooltips to dashboard	2025-10-16 23:20:07 -05:00
systemd	docs: reorganize documentation structure	2025-10-16 17:54:37 -05:00
tests	Rename mock server helper to start_background_process	2025-10-18 05:30:29 -07:00
.dockerignore	Add production-ready security, deployment, and fault tolerance	2025-10-03 05:49:02 -05:00
.gitignore	docs: update .gitignore and add Credits section with special thanks to Dallan Loomis	2025-10-16 22:26:41 -05:00
ARCHITECTURE.md	Make documentation more honest and factual, reduce overselling	2025-10-05 20:39:20 -05:00
BACKENDS.md	Document backend extensibility architecture	2025-10-07 00:15:07 -05:00
BATCH_API.md	Update documentation for v0.7.0 batch API release	2025-10-06 17:38:59 -05:00
BENCHMARKING.md	Add honest validation status and benchmarking guide	2025-10-06 09:39:23 -05:00
BENCHMARKS.md	Add professional portfolio enhancements	2025-10-03 00:10:29 -05:00
CODE_WALKTHROUGH.md	Clean up claims and add detailed code walkthrough	2025-10-06 11:55:32 -05:00
compare_discovery_modes.py	fix: network discovery now finds ALL nodes on subnet	2025-10-06 21:00:29 -05:00
COMPLETE_SUMMARY.md	Add complete implementation summary for v0.3.6	2025-10-05 22:19:09 -05:00
config.yml	Initial commit: SOLLOL - Super Ollama Load Balancer	2025-10-02 18:51:38 -05:00
CONFIGURATION.md	Add comprehensive configuration guide	2025-10-06 14:43:52 -05:00
CONTRIBUTING.md	Add professional portfolio enhancements	2025-10-03 00:10:29 -05:00
COORDINATOR_REUSE_INVESTIGATION.md	Fix flake8 F824 errors: Remove unused global declarations	2025-10-05 18:05:21 -05:00
dashboard.html	feat: add routing decision tooltips to dashboard	2025-10-16 23:20:07 -05:00
DEPLOY_GPU_REPORTER.md	docs: reorganize documentation structure	2025-10-16 17:54:37 -05:00
DEPLOYMENT.md	feat: Integrate SynapticLlamas with SOLLOL as drop-in replacement	2025-10-03 10:04:12 -05:00
DEPLOYMENT_AWARE_RESOLUTION.md	feat: add heterogeneous GPU parallelization with VRAM-aware routing	2025-10-06 20:13:57 -05:00
DISTRIBUTED_INFERENCE_STATUS.md	feat: distributed inference stability and GPU monitoring	2025-10-12 10:47:59 -05:00
docker-compose.test.yml	Clean up claims and add detailed code walkthrough	2025-10-06 11:55:32 -05:00
docker-compose.yml	feat: Integrate SynapticLlamas with SOLLOL as drop-in replacement	2025-10-03 10:04:12 -05:00
DOCKER_IP_RESOLUTION.md	feat: add heterogeneous GPU parallelization with VRAM-aware routing	2025-10-06 20:13:57 -05:00
DOCKER_SETUP.md	Add Docker functional testing and comprehensive setup guide	2025-10-06 12:03:02 -05:00
Dockerfile	feat: Integrate SynapticLlamas with SOLLOL as drop-in replacement	2025-10-03 10:04:12 -05:00
FLOCKPARSER_FEATURES_ANALYSIS.md	docs: reorganize documentation structure	2025-10-16 17:54:37 -05:00
GPU_MONITORING_GUIDE.md	docs: reorganize documentation structure	2025-10-16 17:54:37 -05:00
GPU_MONITORING_SETUP.md	feat: distributed inference stability and GPU monitoring	2025-10-12 10:47:59 -05:00
gpu_reporter.py	docs: reorganize documentation structure	2025-10-16 17:54:37 -05:00
GRAFANA_SETUP.md	release: v0.9.47 - RPC backend fixes and FlockParser integration	2025-10-11 09:16:57 -05:00
HONEST_STATUS.md	Initial commit: SOLLOL - Super Ollama Load Balancer	2025-10-02 18:51:38 -05:00
INTEGRATION_BASIC.md	Add CI/CD pipeline and enhance repository visibility	2025-10-03 00:19:25 -05:00
INTEGRATION_COMPLETE.md	docs: Add integration completion summary	2025-10-03 10:12:41 -05:00
INTEGRATION_GUIDE.md	Add CI/CD pipeline and enhance repository visibility	2025-10-03 00:19:25 -05:00
KNOWN_ISSUES.md	docs: reorganize documentation structure	2025-10-16 17:54:37 -05:00
KNOWN_LIMITATIONS.md	docs: reorganize documentation structure	2025-10-16 17:54:37 -05:00
LEGACY_GPU_SUPPORT.md	feat: add heterogeneous GPU parallelization with VRAM-aware routing	2025-10-06 20:13:57 -05:00
LICENSE	🚀 Integrate distributed inference and make SOLLOL standalone package	2025-10-04 14:38:12 -05:00
llamacpp_github_issue.md	docs: reorganize documentation structure	2025-10-16 17:54:37 -05:00
MANIFEST.in	Prepare v0.3.6 for PyPI publication	2025-10-05 22:01:18 -05:00
mkdocs.yml	Add professional documentation site with MkDocs and GitHub Pages	2025-10-03 06:27:50 -05:00
MULTI_APP_ARCHITECTURE.md	Add comprehensive multi-application architecture guide	2025-10-06 12:25:48 -05:00
ollama_discussion_discord_final.md	docs: reorganize documentation structure	2025-10-16 17:54:37 -05:00
ollama_discussion_draft.md	docs: reorganize documentation structure	2025-10-16 17:54:37 -05:00
ollama_discussion_draft_v2.md	docs: reorganize documentation structure	2025-10-16 17:54:37 -05:00
ollama_discussion_final.md	docs: reorganize documentation structure	2025-10-16 17:54:37 -05:00
ollama_discussion_followup.md	docs: reorganize documentation structure	2025-10-16 17:54:37 -05:00
ollama_github_issue.md	docs: reorganize documentation structure	2025-10-16 17:54:37 -05:00
PHASE1_IMPLEMENTATION_COMPLETE.md	Add Phase 1 features: Sync API, Priority Helpers, SOLLOL Detection (v0.3.6)	2025-10-05 21:58:07 -05:00
PHASE2_COMPLETE.md	Document Phase 2 completion: Code consolidation successful	2025-10-05 22:10:18 -05:00
PHASE2_PROGRESS.md	Document Phase 2 completion: Code consolidation successful	2025-10-05 22:10:18 -05:00
PRODUCTION_READINESS.md	Fix missing exports and add production readiness assessment	2025-10-05 22:46:35 -05:00
prometheus.yml	Add production-ready security, deployment, and fault tolerance	2025-10-03 05:49:02 -05:00
PUBLISH.md	🚀 Integrate distributed inference and make SOLLOL standalone package	2025-10-04 14:38:12 -05:00
PUBLISHING.md	Add PyPI packaging for pip install support	2025-10-03 06:41:59 -05:00
PYPI_PUBLICATION_SUCCESS.md	Published SOLLOL v0.3.6 to PyPI	2025-10-05 22:16:59 -05:00
pyproject.toml	Move gateway local cluster integration test and gate on Ollama	2025-10-18 01:46:58 -07:00
QUICK_START.md	release: v0.9.47 - RPC backend fixes and FlockParser integration	2025-10-11 09:16:57 -05:00
README.md	fix: make performance claims transparent and honest	2025-10-17 09:28:10 -05:00
REDIS_SETUP.md	Implement Redis-based distributed coordination	2025-10-06 12:33:14 -05:00
REMOTE_ACCESS_SETUP_GUIDE.md	release: v0.9.47 - RPC backend fixes and FlockParser integration	2025-10-11 09:16:57 -05:00
REMOTE_ACCESS_STATUS.md	release: v0.9.47 - RPC backend fixes and FlockParser integration	2025-10-11 09:16:57 -05:00
ROUTING_LOGS.md	feat: add live routing decision logs with real-time dashboard	2025-10-11 14:48:57 -05:00
ROUTING_STRATEGIES.md	feat: add 5 routing strategies with clean extension points	2025-10-16 22:47:34 -05:00
RPC_BACKEND_FIX.md	docs: remove IP addresses from RPC_BACKEND_FIX.md	2025-10-12 10:55:00 -05:00
SECURITY.md	Add production-ready security, deployment, and fault tolerance	2025-10-03 05:49:02 -05:00
SESSION_SUMMARY.md	release: v0.9.47 - RPC backend fixes and FlockParser integration	2025-10-11 09:16:57 -05:00
setup.py	docs: reorganize documentation structure	2025-10-16 17:54:37 -05:00
setup_llama_cpp.py	🚀 Integrate distributed inference and make SOLLOL standalone package	2025-10-04 14:38:12 -05:00
test_activity.py	release: v0.9.47 - RPC backend fixes and FlockParser integration	2025-10-11 09:16:57 -05:00
test_batch_api.py	Add batch processing API endpoints (v0.7.0)	2025-10-06 16:58:32 -05:00
test_connection_reuse.py	perf: add HTTP connection reuse to OllamaPool	2025-10-12 11:11:51 -05:00
test_dashboard.py	release: v0.9.47 - RPC backend fixes and FlockParser integration	2025-10-11 09:16:57 -05:00
test_dashboard_fallback_simple.py	docs: document Dask worker logging as known issue	2025-10-07 10:33:21 -05:00
test_dashboards.py	feat: restore network infrastructure panels to unified dashboard	2025-10-07 00:51:22 -05:00
test_dask_adaptive.py	feat: add Dask distributed batch processing with adaptive routing	2025-10-12 19:08:35 -05:00
test_dask_batch.py	feat: add Dask distributed batch processing with adaptive routing	2025-10-12 19:08:35 -05:00
test_dask_comparison.py	feat: add Dask distributed batch processing with adaptive routing	2025-10-12 19:08:35 -05:00
test_docker.sh	Add Docker functional testing and comprehensive setup guide	2025-10-06 12:03:02 -05:00
test_embed_batch.py	feat: distributed inference stability and GPU monitoring	2025-10-12 10:47:59 -05:00
test_failure_recovery.py	feat: add failure & recovery simulation demonstration	2025-10-16 23:11:54 -05:00
test_full_network_discovery.py	fix: network discovery now finds ALL nodes on subnet	2025-10-06 21:00:29 -05:00
test_multi_app_dashboard.py	docs: document Dask worker logging as known issue	2025-10-07 10:33:21 -05:00
test_new_features.py	Add comprehensive feature test script	2025-10-06 16:04:25 -05:00
test_observer_debug.py	release: v0.9.47 - RPC backend fixes and FlockParser integration	2025-10-11 09:16:57 -05:00
test_optimizations.py	feat: add Dask distributed batch processing with adaptive routing	2025-10-12 19:08:35 -05:00
test_ray_features.py	fix: correct imports for GGUF resolver in Ray routers	2025-10-06 22:32:36 -05:00
test_resilience.py	Add production resilience features to SOLLOL v0.5.0	2025-10-06 16:18:48 -05:00
TEST_RESULTS.md	Initial commit: SOLLOL - Super Ollama Load Balancer	2025-10-02 18:51:38 -05:00
test_routing_log.py	feat: add live routing decision logs with real-time dashboard	2025-10-11 14:48:57 -05:00
test_routing_strategies.py	feat: add 5 routing strategies with clean extension points	2025-10-16 22:47:34 -05:00
test_rpc_backend_fix.py	release: v0.9.47 - RPC backend fixes and FlockParser integration	2025-10-11 09:16:57 -05:00
test_vram_monitoring.py	Add pre-routing VRAM size checks and model size estimation	2025-10-06 18:16:08 -05:00
test_websocket_client.py	release: v0.9.47 - RPC backend fixes and FlockParser integration	2025-10-11 09:16:57 -05:00
UNIVERSAL_DASHBOARD.md	feat: add universal network observability and application tracking	2025-10-07 00:30:30 -05:00
verify_dashboards.py	feat: add universal network observability and application tracking	2025-10-07 00:30:30 -05:00

README.md

SOLLOL - Production-Ready Orchestration for Local LLM Clusters

Open-source orchestration layer that combines intelligent task routing with distributed model inference for local LLM clusters.

Quick Start • Features • Architecture • Documentation • Examples

🎯 What is SOLLOL?

SOLLOL (Super Ollama Load balancer & Orchestration Layer) transforms your collection of Ollama nodes into an intelligent AI cluster with adaptive routing and automatic failover—all running on your own hardware.

The Problem

You have multiple machines with GPUs running Ollama, but:

❌ Manual node selection for each request
❌ No way to run models larger than your biggest GPU
❌ Can't distribute multi-agent workloads efficiently
❌ No automatic failover or load balancing
❌ Zero visibility into cluster performance

The SOLLOL Solution

SOLLOL provides:

✅ Intelligent routing that learns which nodes work best for each task
✅ Model sharding to run 70B+ models across multiple machines
✅ Parallel agent execution for multi-agent frameworks
✅ Auto-discovery of all nodes and capabilities
✅ Built-in observability with real-time metrics
✅ Zero-config deployment - just point and go

⚡ Quickstart (3 Commands)

# 1. Install SOLLOL
pip install sollol

# 2. Start the dashboard (optional but recommended)
python3 -m sollol.dashboard_service &

# 3. Run your first query
python3 -c "from sollol import OllamaPool; pool = OllamaPool.auto_configure(); print(pool.chat(model='llama3.2', messages=[{'role': 'user', 'content': 'Hello!'}])['message']['content'])"

What just happened?

✅ SOLLOL auto-discovered all Ollama nodes on your network
✅ Intelligently routed your request to the best available node
✅ Dashboard live at http://localhost:8080 (shows routing decisions, metrics, logs)

Expected output:

Discovering Ollama nodes...
Found 3 nodes: 10.9.66.45:11434, 10.9.66.154:11434, localhost:11434
Selected node: 10.9.66.45:11434 (GPU, 12ms latency)
Hello! How can I help you today?

Next steps:

Visit http://localhost:8080 to see the dashboard
Check Full Quick Start for production setup
Read Examples for multi-agent, batch, and sharding patterns

🚀 Full Quick Start

Installation

pip install sollol

Basic Usage

from sollol import OllamaPool

# Auto-discover nodes and start routing
pool = OllamaPool.auto_configure()

# Make requests - SOLLOL routes intelligently
response = pool.chat(
    model="llama3.2",
    messages=[{"role": "user", "content": "Hello!"}]
)

Enable Real-Time GPU Monitoring

For accurate VRAM-aware routing, install the GPU reporter on each node:

# On each Ollama node, run:
sollol install-gpu-reporter --redis-host <redis-server-ip>

# Example:
sollol install-gpu-reporter --redis-host 10.9.66.154

What this does:

Installs vendor-agnostic GPU monitoring (NVIDIA/AMD/Intel via gpustat)
Publishes real-time VRAM stats to Redis every 5 seconds
SOLLOL uses this data for intelligent routing decisions
See GPU Monitoring Guide for details

Without GPU monitoring: SOLLOL falls back to estimates which may be inaccurate.

📸 Screenshots

Dashboard Overview

Real-time monitoring with P50/P95/P99 latency metrics, network nodes, RPC backends, and active applications

Ray & Dask Integration

Embedded Ray and Dask dashboards for distributed task monitoring

Activity Monitoring

Live request/response activity streams from Ollama nodes and RPC backends

Applications & Traces

Applications, distributed traces, and Ollama activity logs with real-time request/response tracking

🔥 Why SOLLOL?

1. Two Distribution Modes in One System

SOLLOL combines both task distribution and model sharding:

📊 Task Distribution (Horizontal Scaling)

Distribute multiple requests across your cluster in parallel:

# Run 10 agents simultaneously across 5 nodes
pool = OllamaPool.auto_configure()
responses = await asyncio.gather(*[
    pool.chat(model="llama3.2", messages=[...])
    for _ in range(10)
])
# Parallel execution across available nodes

🧩 Model Sharding (Vertical Scaling)

Run single large models that don't fit on one machine:

# Run larger models across multiple nodes
# Note: Verified with 13B across 2-3 nodes; larger models not extensively tested
router = HybridRouter(
    enable_distributed=True,
    num_rpc_backends=4
)
response = await router.route_request(
    model="llama3:70b",  # Sharded automatically
    messages=[...]
)

Use them together! Small models use task distribution, large models use sharding.

2. Intelligent, Not Just Balanced

SOLLOL doesn't just distribute requests randomly—it learns and optimizes:

Feature	Simple Load Balancer	SOLLOL
Routing	Round-robin	Context-aware scoring
Learning	None	Adapts from performance history
Resource Awareness	None	GPU/CPU/memory-aware
Task Optimization	None	Routes by task type complexity
Failover	Manual	Automatic with health checks
Priority	FIFO	Priority queue with fairness

Example: SOLLOL automatically routes:

Heavy generation tasks → GPU nodes with 24GB VRAM
Fast embeddings → CPU nodes or smaller GPUs
Critical requests → Fastest, most reliable nodes
Batch processing → Lower priority, distributed load

3. Production-Ready from Day One

from sollol import SOLLOL, SOLLOLConfig

# Literally 3 lines to production
config = SOLLOLConfig.auto_discover()
sollol = SOLLOL(config)
sollol.start()  # ✅ Gateway running on :8000

Out of the box:

Auto-discovery of Ollama nodes
Health monitoring and failover
Prometheus metrics
Web dashboard
Connection pooling
Request hedging
Priority queuing

4. Unified Observability for Your Entire AI Network

SOLLOL provides a single pane of glass to monitor every application and every node in your distributed AI network.

✅ Centralized Dashboard: One web interface shows all applications, nodes, and RPC backends.
✅ Multi-App Tracking: See which applications (e.g., SynapticLlamas, custom agents) are using the cluster in real-time.
✅ Network-Wide Visibility: The dashboard runs as a persistent service, discovering and monitoring all components even if no applications are running.
✅ Zero-Config: Applications automatically appear in the dashboard with no extra code required.

This moves beyond per-application monitoring to provide true, centralized observability for your entire infrastructure.

🏗️ Architecture

High-Level Overview

┌────────────────────────────────────────────────────────┐
│                  Your Application                       │
│         (SynapticLlamas, custom agents, etc.)          │
└──────────────────────┬─────────────────────────────────┘
                       │
                       ▼
┌────────────────────────────────────────────────────────┐
│                 SOLLOL Gateway (:8000)                  │
│  ┌──────────────────────────────────────────────────┐  │
│  │         Intelligent Routing Engine               │  │
│  │  • Analyzes: task type, complexity, resources    │  │
│  │  • Scores: all nodes based on context            │  │
│  │  • Learns: from performance history              │  │
│  │  • Routes: to optimal node                       │  │
│  └──────────────────────────────────────────────────┘  │
│  ┌──────────────────────────────────────────────────┐  │
│  │          Priority Queue + Failover               │  │
│  └──────────────────────────────────────────────────┘  │
└────────┬─────────────────────────┬─────────────────────┘
         │                         │
         ▼                         ▼
  ┌─────────────┐          ┌──────────────┐
  │ Task Mode   │          │  Shard Mode  │
  │ Ray Cluster │          │  llama.cpp   │
  └──────┬──────┘          └──────┬───────┘
         │                         │
         ▼                         ▼
┌────────────────────────────────────────────────────────┐
│              Your Heterogeneous Cluster                 │
│  GPU (24GB) │ GPU (16GB) │ CPU (64c) │ GPU (8GB) │...  │
└────────────────────────────────────────────────────────┘

How Routing Works

# 1. Request arrives
POST /api/chat {
  "model": "llama3.2",
  "messages": [{"role": "user", "content": "Complex analysis task..."}],
  "priority": 8
}

# 2. SOLLOL analyzes
task_type = "generation"       # Auto-detected
complexity = "high"             # Token count analysis
requires_gpu = True             # Based on task
estimated_duration = 3.2s       # From history

# 3. SOLLOL scores all nodes
Node A (GPU 24GB, load: 0.2, latency: 120ms) → Score: 185.3 ✓ WINNER
Node B (GPU 8GB,  load: 0.6, latency: 200ms) → Score: 92.1
Node C (CPU only, load: 0.1, latency: 80ms)  → Score: 41.2

# 4. Routes to Node A, monitors execution, learns for next time

Scoring Algorithm:

Score = 100.0 (baseline)
      × success_rate (0.0-1.0)
      ÷ (1 + latency_penalty)
      × gpu_bonus (1.5x if GPU available & needed)
      ÷ (1 + load_penalty)
      × priority_alignment
      × task_specialization

📦 Installation

Quick Install (PyPI)

pip install sollol

From Source

git clone https://github.com/BenevolentJoker-JohnL/SOLLOL.git
cd SOLLOL
pip install -e .

⚡ Quick Start

1. Synchronous API (No async/await needed!)

New in v0.3.6: SOLLOL now provides a synchronous API for easier integration with non-async applications.

from sollol.sync_wrapper import OllamaPool
from sollol.priority_helpers import Priority

# Auto-discover and connect to all Ollama nodes
pool = OllamaPool.auto_configure()

# Make requests - SOLLOL routes intelligently
# No async/await needed!
response = pool.chat(
    model="llama3.2",
    messages=[{"role": "user", "content": "Hello!"}],
    priority=Priority.HIGH,  # Semantic priority levels
    timeout=60  # Request timeout in seconds
)

print(response['message']['content'])
print(f"Routed to: {response.get('_sollol_routing', {}).get('host', 'unknown')}")

Key features of synchronous API:

✅ No async/await syntax required
✅ Works with synchronous agent frameworks
✅ Same intelligent routing and features
✅ Runs async code in background thread automatically

2. Async API (Original)

For async applications, use the original async API:

from sollol import OllamaPool

# Auto-discover and connect to all Ollama nodes
pool = await OllamaPool.auto_configure()

# Make requests - SOLLOL routes intelligently
response = await pool.chat(
    model="llama3.2",
    messages=[{"role": "user", "content": "Hello!"}]
)

print(response['message']['content'])
print(f"Routed to: {response['_sollol_routing']['host']}")
print(f"Task type: {response['_sollol_routing']['task_type']}")

3. Priority-Based Multi-Agent Execution

New in v0.3.6: Use semantic priority levels and role-based mapping.

from sollol.sync_wrapper import OllamaPool
from sollol.priority_helpers import Priority, get_priority_for_role

pool = OllamaPool.auto_configure()

# Define agents with different priorities
agents = [
    {"name": "Researcher", "role": "researcher"},  # Priority 8
    {"name": "Editor", "role": "editor"},          # Priority 6
    {"name": "Summarizer", "role": "summarizer"},  # Priority 5
]

for agent in agents:
    priority = get_priority_for_role(agent["role"])

    response = pool.chat(
        model="llama3.2",
        messages=[{"role": "user", "content": f"Task for {agent['name']}"}],
        priority=priority
    )
    # User-facing agents get priority, background tasks wait

Priority levels available:

Priority.CRITICAL (10) - Mission-critical
Priority.URGENT (9) - Fast response needed
Priority.HIGH (7) - Important tasks
Priority.NORMAL (5) - Default
Priority.LOW (3) - Background tasks
Priority.BATCH (1) - Can wait

4. Model Sharding with llama.cpp (Large Models)

Run models larger than your biggest GPU by distributing layers across multiple machines.

When to Use Model Sharding

Use model sharding when:

✅ Model doesn't fit on your largest GPU (e.g., 70B models on 16GB GPUs)
✅ You have multiple machines with network connectivity
✅ You can tolerate slower inference for capability

Don't use sharding when:

❌ Model fits on a single GPU (use task distribution instead)
❌ You need maximum inference speed
❌ Network latency is high (>10ms between machines)

Quick Start: Auto-Setup (Easiest)

from sollol.sync_wrapper import HybridRouter, OllamaPool

# SOLLOL handles all setup automatically
router = HybridRouter(
    ollama_pool=OllamaPool.auto_configure(),
    enable_distributed=True,  # Enable model sharding
    auto_setup_rpc=True,      # Auto-configure RPC backends
    num_rpc_backends=3        # Distribute across 3 machines
)

# Use large model that doesn't fit on one machine
response = router.route_request(
    model="llama3.1:70b",  # Automatically sharded across backends
    messages=[{"role": "user", "content": "Explain quantum computing"}]
)

print(response['message']['content'])

What happens automatically:

SOLLOL discovers available RPC backends on your network
Extracts the GGUF model from Ollama storage
Starts llama-server coordinator with optimal settings
Distributes model layers across backends
Routes your request to the coordinator

RPC Server Auto-Installation

SOLLOL can automatically clone, build, and start llama.cpp RPC servers for you!

One-line installation:

from sollol.rpc_auto_setup import auto_setup_rpc_backends

# Automatically: clone → build → start RPC servers
backends = auto_setup_rpc_backends(num_backends=2)
# Output: [{'host': '127.0.0.1', 'port': 50052}, {'host': '127.0.0.1', 'port': 50053}]

What this does:

✅ Scans network for existing RPC servers
✅ If none found: clones llama.cpp to ~/llama.cpp
✅ Builds llama.cpp with RPC support (cmake -DGGML_RPC=ON)
✅ Starts RPC servers on ports 50052-50053
✅ Returns ready-to-use backend list

CLI installation:

# Full automated setup (clone + build + install systemd service)
python3 -m sollol.setup_llama_cpp --all

# Or step by step
python3 -m sollol.setup_llama_cpp --clone  # Clone llama.cpp
python3 -m sollol.setup_llama_cpp --build  # Build with RPC support
python3 -m sollol.setup_llama_cpp --start  # Start RPC server

Docker IP Resolution:

SOLLOL automatically resolves Docker container IPs to accessible host IPs:

# If Docker container reports IP 172.17.0.5:11434
# SOLLOL automatically resolves to:
# → 127.0.0.1:11434 (published port mapping)
# → host IP (if accessible)
# → Docker host gateway

from sollol import is_docker_ip, resolve_docker_ip

# Check if IP is Docker internal
is_docker = is_docker_ip("172.17.0.5")  # True

# Resolve Docker IP to accessible IP
accessible_ip = resolve_docker_ip("172.17.0.5", port=11434)
# Returns: "127.0.0.1" or host IP

Network Discovery with Docker Support:

from sollol import OllamaPool

# Auto-discover nodes (automatically resolves Docker IPs)
pool = OllamaPool.auto_configure()

# Manual control
from sollol.discovery import discover_ollama_nodes
nodes = discover_ollama_nodes(auto_resolve_docker=True)

Multi-Node Production Setup:

For distributed clusters, use systemd services on each node:

# On each RPC node
sudo systemctl enable llama-rpc@50052.service
sudo systemctl start llama-rpc@50052.service

See SOLLOL_RPC_SETUP.md for complete installation guide.

Architecture: How It Works

┌────────────────────────────────────────────┐
│    Llama 3.1 70B Model (40GB total)        │
│           Distributed Sharding             │
└────────────────────────────────────────────┘
                    │
       ┌────────────┼────────────┐
       │            │            │
       ▼            ▼            ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│  Machine 1   │ │  Machine 2   │ │  Machine 3   │
│ Layers 0-26  │ │ Layers 27-53 │ │ Layers 54-79 │
│   (~13GB)    │ │   (~13GB)    │ │   (~13GB)    │
│ RPC Backend  │ │ RPC Backend  │ │ RPC Backend  │
└──────────────┘ └──────────────┘ └──────────────┘
       ▲            ▲            ▲
       └────────────┼────────────┘
                    │
         ┌──────────┴──────────┐
         │ llama-server        │
         │ Coordinator         │
         │ (Port 18080)        │
         └─────────────────────┘

Manual Setup (Advanced)

For explicit control over RPC backends:

from sollol.llama_cpp_coordinator import LlamaCppCoordinator
from sollol.rpc_registry import RPCBackendRegistry

# 1. Register RPC backends explicitly
registry = RPCBackendRegistry()
registry.add_backend("rpc_1", "grpc://10.9.66.45:50052")
registry.add_backend("rpc_2", "grpc://10.9.66.46:50052")
registry.add_backend("rpc_3", "grpc://10.9.66.47:50052")

# 2. Create coordinator
coordinator = LlamaCppCoordinator(
    coordinator_port=18080,
    rpc_backends=registry.get_all_backends(),
    context_size=4096,
    gpu_layers=-1  # Use all available GPU layers
)

# 3. Start and use
await coordinator.start(model_name="llama3.1:70b")
response = await coordinator.generate(
    prompt="Explain the theory of relativity",
    max_tokens=500
)

Performance Expectations

Model Size	Single GPU	Sharded (3 nodes)	Trade-off
13B	✅ 20 tok/s	✅ 5 tok/s	-75% speed, works on 3×smaller GPUs
70B	❌ OOM	⚠️ 3-5 tok/s (est.)	Enables model that won't run otherwise

Trade-offs:

🐌 Startup: 2-5 minutes (model distribution + loading)
🐌 Inference: ~4x slower than local (network overhead)
✅ Capability: Run models that won't fit on single GPU

Learn More:

📖 Complete llama.cpp Guide - Setup, optimization, troubleshooting
💻 Working Examples - 5 complete examples including conversation, batch processing, error handling

5. Batch Processing API

New in v0.7.0: RESTful API for asynchronous batch job management.

Submit large-scale batch operations (thousands of embeddings, bulk inference) and track progress via job IDs:

import requests

# Submit batch embedding job (up to 10,000 documents)
response = requests.post("http://localhost:11434/api/batch/embed", json={
    "model": "nomic-embed-text",
    "documents": ["Document 1", "Document 2", ...],  # Can be thousands
    "metadata": {"source": "knowledge_base"}  # Optional metadata
})

job_id = response.json()["job_id"]
print(f"Job submitted: {job_id}")

# Poll for job status
import time
while True:
    status = requests.get(f"http://localhost:11434/api/batch/jobs/{job_id}").json()

    progress = status["progress"]["percent"]
    print(f"Progress: {progress}%")

    if status["status"] == "completed":
        break
    time.sleep(1)

# Get results
results = requests.get(f"http://localhost:11434/api/batch/results/{job_id}").json()
embeddings = results["results"]  # List of embedding vectors
print(f"Processed {len(embeddings)} documents in {status['duration_seconds']}s")

Available Batch Endpoints:

POST /api/batch/embed - Submit batch embedding job
GET /api/batch/jobs/{job_id} - Get job status
GET /api/batch/results/{job_id} - Get job results
GET /api/batch/jobs?limit=100 - List recent jobs
DELETE /api/batch/jobs/{job_id} - Cancel job

Use cases:

Embedding large document collections (thousands of documents)
Bulk inference for batch predictions
Background processing without blocking
Long-running operations with progress tracking

6. SOLLOL Detection

New in v0.3.6: Detect if SOLLOL is running vs native Ollama.

import requests

def is_sollol(url="http://localhost:11434"):
    """Check if SOLLOL is running at the given URL."""

    # Method 1: Check X-Powered-By header
    response = requests.get(url)
    if response.headers.get("X-Powered-By") == "SOLLOL":
        return True

    # Method 2: Check health endpoint
    response = requests.get(f"{url}/api/health")
    data = response.json()
    if data.get("service") == "SOLLOL":
        return True

    return False

# Use it
if is_sollol("http://localhost:11434"):
    print("✓ SOLLOL detected - using intelligent routing")
else:
    print("Native Ollama detected")

Why this matters:

Enables graceful fallback in client applications
Makes SOLLOL a true drop-in replacement
Clients can auto-detect and use SOLLOL features when available

7. Production Gateway

from sollol import SOLLOL, SOLLOLConfig

# Full production setup
config = SOLLOLConfig(
    ray_workers=4,
    dask_workers=2,
    hosts=["gpu-1:11434", "gpu-2:11434", "cpu-1:11434"],
    gateway_port=8000,
    metrics_port=9090
)

sollol = SOLLOL(config)
sollol.start()  # Blocks and runs gateway

# Access via HTTP:
# curl http://localhost:8000/api/chat -d '{...}'
# curl http://localhost:8000/api/stats
# curl http://localhost:8000/api/dashboard

🎓 Use Cases

1. Multi-Agent AI Systems (SynapticLlamas, CrewAI, AutoGPT)

Problem: Running 10 agents sequentially takes 10x longer than necessary.

Solution: SOLLOL distributes agents across nodes in parallel.

# Before: Sequential execution on one node
# After: Parallel execution with SOLLOL
pool = OllamaPool.auto_configure()
agents = await asyncio.gather(*[
    pool.chat(model="llama3.2", messages=agent_prompts[i])
    for i in range(10)
])
# Speedup depends on number of available nodes and their capacity

2. Large Model Inference

Problem: Your model doesn't fit in available VRAM.

Solution: SOLLOL can shard models across multiple machines via llama.cpp.

# Distribute model across multiple nodes
# Note: Verified with 13B models; larger models not extensively tested
router = HybridRouter(
    enable_distributed=True,
    num_rpc_backends=4
)
# Trade-off: Slower startup/inference but enables running larger models

3. Mixed Workloads

Problem: Different tasks need different resources.

Solution: SOLLOL routes each task to the optimal node.

pool = OllamaPool.auto_configure()

# Heavy generation → GPU node
chat = pool.chat(model="llama3.2:70b", messages=[...])

# Fast embeddings → CPU node
embeddings = pool.embed(model="nomic-embed-text", input=[...])

# SOLLOL automatically routes each to the best available node

4. High Availability Production

Problem: Node failures break your service.

Solution: SOLLOL auto-fails over and recovers.

# Node A fails mid-request
# ✅ SOLLOL automatically:
# 1. Detects failure
# 2. Retries on Node B
# 3. Marks Node A as degraded
# 4. Periodically re-checks Node A
# 5. Restores Node A when healthy

Simulate Failure & Recovery

Want to see SOLLOL's automatic failover in action? Run the included simulation:

python test_failure_recovery.py

What the simulation does:

Starts 3 mock Ollama nodes
Sends baseline requests (all nodes healthy)
Kills node #1 mid-execution
Continues sending requests (SOLLOL routes around failed node)
Restores node #1
Resumes sending requests (traffic returns to recovered node)

Expected output:

STEP 1: Starting Mock Nodes
✅ Started 3 mock nodes

BASELINE: Requests with all nodes healthy
  Request 1: ✓ Routed to localhost:21434
  Request 2: ✓ Routed to localhost:21435
  ...

STEP 3: Simulating Node Failure (killing node 0)
Killing node on port 21434...
✅ Node 21434 terminated

STEP 4: Requests after node failure (observe failover)
  Request 1: ✓ Routed to localhost:21435  ← Automatically avoided dead node
  Request 2: ✓ Routed to localhost:21436
  ...

STEP 5: Simulating Node Recovery
✅ Node 21434 recovered successfully

✅ Key Observations:
  1. Requests succeeded even after node failure
  2. SOLLOL automatically routed around the dead node
  3. Node recovered and rejoined the pool
  4. Traffic resumed to recovered node

This demonstrates SOLLOL's production-grade resilience without needing real infrastructure.

📊 Performance & Benchmarks

Validation Status

What's Been Validated ✅

Single-node baseline performance measured
Code exists and is reviewable (75+ modules)
Tests pass in CI (57 tests, coverage tracked)
Architecture implements intelligent routing

What Needs Validation ⚠️

Comparative benchmarks (SOLLOL vs round-robin)
Multi-node performance improvements
Real-world latency/throughput gains

📖 See BENCHMARKING.md for complete validation roadmap and how to run comparative tests.

Measured Baseline Performance

Single Ollama Node (llama3.2-3B, 50 requests, concurrency=5):

✅ Success Rate: 100%
⚡ Throughput: 0.51 req/s
📈 Average Latency: 5,659 ms
📈 P95 Latency: 11,299 ms
📈 P99 Latency: 12,259 ms

Hardware: Single Ollama instance with 75+ models loaded Data: See benchmarks/results/ for raw JSON

Run Your Own:

# Baseline test (no cluster needed)
python benchmarks/simple_ollama_benchmark.py llama3.2 50

# Comparative test (requires docker-compose)
docker-compose up -d
python benchmarks/run_benchmarks.py --sollol-url http://localhost:8000 --duration 60

Projected Performance (Unvalidated)

Note: These are architectural projections, not measured results. Requires multi-node cluster setup for validation.

Theory: With N nodes and parallelizable workload:

Task distribution can approach N× parallelization (limited by request rate)
Intelligent routing should reduce tail latencies vs random selection
Resource-aware placement reduces contention and failures

Reality: Requires multi-node cluster validation. See BENCHMARKING.md for test procedure and CODE_WALKTHROUGH.md for implementation details.

Model Sharding Performance

Model	Single 24GB GPU	SOLLOL (3×16GB)	Status
13B	✅ ~20 tok/s	✅ ~5 tok/s	✅ Verified working
70B	❌ OOM	⚠️ Estimated ~3-5 tok/s	⚠️ Not extensively tested

When to use sharding: When model doesn't fit on your largest GPU. You trade speed for capability.

Performance trade-offs: Distributed inference is 2-5 minutes slower to start and ~4x slower for inference compared to local. Use only when necessary.

Overhead

Routing decision: ~5-10ms (tested with 5-10 nodes)
Network overhead: Varies by network (typically 5-20ms)
Total added latency: ~20-50ms
Benefit: Better resource utilization + automatic failover

🛠️ Advanced Configuration

Custom Routing Strategy

from sollol import OllamaPool

pool = OllamaPool(
    nodes=[
        {"host": "gpu-1.local", "port": 11434, "priority": 10},  # Prefer this
        {"host": "gpu-2.local", "port": 11434, "priority": 5},
        {"host": "cpu-1.local", "port": 11434, "priority": 1},   # Last resort
    ],
    enable_intelligent_routing=True,
    enable_hedging=True,  # Duplicate critical requests
    max_queue_size=100
)

Priority-Based Scheduling

# Critical user-facing request
response = pool.chat(
    model="llama3.2",
    messages=[...],
    priority=10  # Highest priority
)

# Background batch job
response = pool.chat(
    model="llama3.2",
    messages=[...],
    priority=1  # Lowest priority
)

# SOLLOL ensures high-priority requests jump the queue

Observability & Monitoring

Zero-Config Auto-Registration 🎯

SOLLOL provides automatic observability with zero configuration required. All applications automatically register with the dashboard when they create an OllamaPool:

from sollol import OllamaPool

# Creates pool AND auto-registers with dashboard (if running)
pool = OllamaPool.auto_configure()
# ✅ Application automatically appears in dashboard at http://localhost:8080

How it works:

OllamaPool automatically detects if a dashboard is running on port 8080
Auto-discovers RPC backends and Ollama nodes
Registers application with metadata (node count, GPU info, etc.)
Sends periodic heartbeats to maintain "alive" status
No manual DashboardClient setup needed!

Architecture:

ONE persistent dashboard service runs independently
Multiple applications (SynapticLlamas, FlockParser, etc.) auto-register
Dashboard survives application exits
Zero-config auto-discovery of nodes and RPC backends

Custom Application Names 🏷️

By default, applications register as "OllamaPool (hostname)". To give your application a custom name in the dashboard:

from sollol import OllamaPool

# Register with custom application name
pool = OllamaPool(
    nodes=[{"host": "localhost", "port": 11434}],
    enable_intelligent_routing=True,
    app_name="MyApplication"  # Shows as "MyApplication" in dashboard
)

Example - Multi-application setup:

# Application 1: FlockParser
from sollol import OllamaPool

pool = OllamaPool.auto_configure(app_name="FlockParser")
# Dashboard shows: "FlockParser"

# Application 2: SynapticLlamas
from sollol.dashboard_client import DashboardClient

dashboard_client = DashboardClient(
    app_name="SynapticLlamas",
    router_type="IntelligentRouter",
    version="1.0.0",
    dashboard_url="http://localhost:8080",
    metadata={"agents": 3, "distributed": True},
    auto_register=True
)
# Dashboard shows: "SynapticLlamas"

Why use custom names?

Distinguish between multiple applications using SOLLOL
Better visibility in multi-tenant environments
Easier debugging and monitoring
Professional dashboard presentation

Manual/Programmatic Registration 🔧

For applications that don't use OllamaPool or need custom registration logic, use DashboardClient directly:

from sollol.dashboard_client import DashboardClient

# Create dashboard client with custom metadata
dashboard_client = DashboardClient(
    app_name="CustomApplication",
    router_type="CustomRouter",  # Or "OllamaPool", "HybridRouter", etc.
    version="1.0.0",
    dashboard_url="http://localhost:8080",
    metadata={
        # Custom metadata shown in dashboard
        "nodes": 5,
        "distributed": True,
        "custom_field": "value"
    },
    auto_register=True  # Registers immediately
)

# Dashboard client automatically sends heartbeats every 5 seconds
# to keep application status as "active"

# When application exits, clean up:
dashboard_client.close()  # Stops heartbeat thread

Advanced: Custom Heartbeat Logic

from sollol.dashboard_client import DashboardClient
import time

# Create client without auto-registration
dashboard_client = DashboardClient(
    app_name="BackgroundWorker",
    router_type="WorkerPool",
    version="2.0.0",
    dashboard_url="http://localhost:8080",
    metadata={"worker_count": 10},
    auto_register=False  # Don't register yet
)

# Register when ready
dashboard_client.register()

# Update metadata dynamically
dashboard_client.update_metadata({"worker_count": 15, "status": "processing"})

# Send manual heartbeat
dashboard_client.heartbeat()

# Application logic here...
time.sleep(60)

# Deregister when done
dashboard_client.deregister()
dashboard_client.close()

Use cases for manual registration:

Custom routers or load balancers
Background workers or daemons
Applications that need dynamic metadata updates
Testing and debugging
Applications without OllamaPool

Registration Methods Comparison 📊

Method	Use Case	Complexity	Customization
Auto-registration	Standard SOLLOL applications	✅ Zero config	Limited (app_name only)
Custom app_name	Multiple apps, better naming	✅ One parameter	App name
Manual DashboardClient	Custom applications	⚠️ More code	Full control

Quick decision guide:

Using OllamaPool? → Use app_name parameter
Need custom metadata? → Use DashboardClient directly
Need dynamic updates? → Use DashboardClient with manual heartbeats
Just want it to work? → Use auto-registration (default)

Persistent Dashboard Service

Start the persistent dashboard once (survives application exits):

# Start dashboard service (runs until stopped)
python3 -m sollol.dashboard_service --port 8080 --redis-url redis://localhost:6379

# Or run in background
nohup python3 -m sollol.dashboard_service --port 8080 --redis-url redis://localhost:6379 > /tmp/dashboard_service.log 2>&1 &

Features:

📊 Real-time metrics: System status, latency, success rate, GPU memory, Ray workers
📜 Live log streaming: WebSocket-based log tailing (via Redis pub/sub)
🌐 Activity monitoring: Ollama server and llama.cpp RPC activity
🔷 Embedded Ray dashboard: Task-level distributed tracing
📈 Embedded Dask dashboard: Performance profiling and task graphs
🔍 Auto-discovery: Automatically discovers Ollama nodes and RPC backends when no router context

Embedded Dashboard (Alternative)

Applications can also start their own embedded dashboards:

from sollol import run_unified_dashboard
import threading

# Start embedded dashboard with router context
dashboard_thread = threading.Thread(
    target=run_unified_dashboard,
    kwargs={
        "router": pool,  # Provides node/backend context
        "dashboard_port": 8080,
        "host": "0.0.0.0",
        "enable_dask": False
    },
    daemon=True
)
dashboard_thread.start()

Environment Variables (configure before initializing):

# Disable dashboard (default: true)
export SOLLOL_DASHBOARD=false

# Change dashboard port (default: 8080)
export SOLLOL_DASHBOARD_PORT=9090

# Disable Dask dashboard integration (default: true)
export SOLLOL_DASHBOARD_DASK=false

Multi-Application Pattern ✨

The persistent dashboard service enables multiple applications to share observability:

# Terminal 1: Start persistent dashboard
python3 -m sollol.dashboard_service --port 8080 --redis-url redis://localhost:6379

# Terminal 2: Start application 1
python my_app1.py  # Auto-registers with dashboard

# Terminal 3: Start application 2
python my_app2.py  # Also auto-registers

# Visit http://localhost:8080 to see both applications!

Benefits:

Single dashboard for all SOLLOL-based applications
Dashboard stays running when applications exit
Aggregated logs from all applications (via Redis pub/sub)
Centralized observability for distributed systems

Programmatic Stats Access

# Get detailed stats
stats = pool.get_stats()
print(f"Total requests: {stats['total_requests']}")
print(f"Average latency: {stats['avg_latency_ms']}ms")
print(f"Success rate: {stats['success_rate']:.2%}")

# Per-node breakdown
for host, metrics in stats['hosts'].items():
    print(f"{host}: {metrics['latency_ms']}ms, {metrics['success_rate']:.2%}")

Prometheus Metrics

# Prometheus metrics endpoint
curl http://localhost:9090/metrics

# sollol_requests_total{host="gpu-1:11434",model="llama3.2"} 1234
# sollol_latency_seconds{host="gpu-1:11434"} 0.234
# sollol_success_rate{host="gpu-1:11434"} 0.98

🔌 Integration Examples

🔗 Integration with SynapticLlamas & FlockParser

SOLLOL is the distributed inference platform for the complete AI ecosystem, powering both SynapticLlamas (multi-agent orchestration) and FlockParser (document RAG).

The Complete Stack

┌─────────────────────────────────────────────────────────────┐
│              SynapticLlamas (v0.1.0+)                       │
│          Multi-Agent System & Orchestration                 │
│  • Research agents  • Editor agents  • Storyteller agents  │
└───────────┬────────────────────────────────────┬───────────┘
            │                                    │
            │ RAG Queries                        │ Distributed
            │ (with pre-computed embeddings)     │ Inference
            │                                    │
     ┌──────▼──────────┐              ┌─────────▼────────────┐
     │  FlockParser    │              │      SOLLOL          │
     │  API (v1.0.4+)  │              │  Load Balancer       │
     │  Port: 8000     │              │  (v0.9.31+)          │
     └─────────────────┘              └──────────────────────┘
            │                                    │
            │ ChromaDB                          │ Intelligent
            │ Vector Store                      │ GPU/CPU Routing
            │                                    │
     ┌──────▼──────────┐              ┌─────────▼────────────┐
     │  Knowledge Base │              │  Ollama Nodes        │
     │  41 Documents   │              │  (Distributed)       │
     │  6,141 Chunks   │              │  GPU + CPU           │
     └─────────────────┘              └──────────────────────┘

Why This Integration Matters

Component	Role	Key Feature
SOLLOL	Distributed Inference	Intelligent GPU/CPU routing with load balancing
SynapticLlamas	Multi-Agent Orchestration	Research, Editor, Storyteller agents
FlockParser	Document RAG & Knowledge Base	ChromaDB vector store with 6,141+ chunks

Quick Start: Complete Ecosystem

# Install all three packages (auto-installs dependencies)
pip install synaptic-llamas  # Pulls in flockparser>=1.0.4 and sollol>=0.9.31

# Start FlockParser API
flockparse

# Run SynapticLlamas with SOLLOL + FlockParser integration
synaptic-llamas --interactive --distributed

Integration Example: Load Balanced RAG

from sollol import OllamaPool
from flockparser_adapter import FlockParserAdapter

# Initialize SOLLOL for distributed inference
sollol = OllamaPool.auto_configure()

# Initialize FlockParser adapter
flockparser = FlockParserAdapter("http://localhost:8000", remote_mode=True)

# Step 1: Generate embedding using SOLLOL (load balanced!)
user_query = "What does research say about quantum entanglement?"
embedding = sollol.embed(
    model="mxbai-embed-large",
    input=user_query
)
# SOLLOL routes to fastest GPU automatically

# Step 2: Query FlockParser with pre-computed embedding
rag_results = flockparser.query_remote(
    query=user_query,
    embedding=embedding,  # Skip FlockParser's embedding generation
    n_results=5
)
# FlockParser returns relevant chunks from 41 documents

# Performance gain: 2-5x faster when SOLLOL has faster nodes!

Production Integrations

SOLLOL is actively used in production by:

FlockParser - Document RAG Intelligence with distributed processing. FlockParser's legacy load balancing code was refactored and became core SOLLOL logic. FlockParser now uses SOLLOL directly via OllamaPool for intelligent routing across document embeddings and LLM queries.
SynapticLlamas - Multi-agent collaborative research framework. Uses SOLLOL's HybridRouter for distributed agent execution with RAG-enhanced research capabilities via FlockParser integration.

Related Projects:

SynapticLlamas - Multi-Agent Orchestration
FlockParser - Document RAG Intelligence

SynapticLlamas Integration

from sollol import SOLLOL, SOLLOLConfig
from synaptic_llamas import AgentOrchestrator

# Setup SOLLOL for multi-agent orchestration
config = SOLLOLConfig.auto_discover()
sollol = SOLLOL(config)
sollol.start(blocking=False)

# SynapticLlamas now uses SOLLOL for intelligent routing
orchestrator = AgentOrchestrator(
    llm_endpoint="http://localhost:8000/api/chat"
)

# All agents automatically distributed and optimized
orchestrator.run_parallel_agents([...])

FlockParser Integration

from sollol import OllamaPool

# FlockParser uses SOLLOL's OllamaPool directly
pool = OllamaPool(
    nodes=None,  # Auto-discover all Ollama nodes
    enable_intelligent_routing=True,
    exclude_localhost=True,
    discover_all_nodes=True,
    app_name="FlockParser",
    enable_ray=True
)

# All FlockParser document embeddings and queries route through SOLLOL
embeddings = pool.embed(model="mxbai-embed-large", input="document text")
response = pool.chat(model="llama3.2", messages=[{"role": "user", "content": "query"}])

LangChain Integration

from langchain.llms import Ollama
from sollol import OllamaPool

# Use SOLLOL as LangChain backend
pool = OllamaPool.auto_configure()

llm = Ollama(
    base_url="http://localhost:8000",
    model="llama3.2"
)

# LangChain requests now go through SOLLOL
response = llm("What is quantum computing?")

🏭 Production Deployment (Bare Metal)

For teams preferring bare metal infrastructure over containers, SOLLOL provides systemd-based deployment for production environments.

Multi-Node Bare Metal Setup

This setup assumes you have 3+ physical machines with Ollama installed. We'll configure SOLLOL as a centralized routing layer.

Architecture:

┌─────────────────────────────────────────┐
│   Central Router Machine (Control Plane│
│   - SOLLOL Dashboard (port 8080)       │
│   - Redis (port 6379)                  │
│   - Optional: GPU reporter             │
└────────────┬────────────────────────────┘
             │ Auto-discovery via network
             │ scan (ports 11434)
     ┌───────┼──────────┬─────────────┐
     ▼       ▼          ▼             ▼
┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐
│ Node 1  │ │ Node 2  │ │ Node 3  │ │ Node N  │
│ Ollama  │ │ Ollama  │ │ Ollama  │ │ Ollama  │
│ :11434  │ │ :11434  │ │ :11434  │ │ :11434  │
│ GPU 24GB│ │ GPU 16GB│ │ CPU 64c │ │ ...     │
└─────────┘ └─────────┘ └─────────┘ └─────────┘

Step 1: Install Ollama on each node

On each worker node (Node 1, 2, 3, ...):

# Install Ollama
curl -fsSL https://ollama.ai/install.sh | sh

# Start Ollama service
sudo systemctl enable ollama
sudo systemctl start ollama

# Verify it's running
curl http://localhost:11434/api/tags

Step 2: Install SOLLOL on control plane machine

On your central router machine:

# Install SOLLOL and dependencies
pip install sollol redis

# Install Redis
sudo apt-get install redis-server  # Ubuntu/Debian
# OR
sudo yum install redis              # RHEL/CentOS

# Start Redis
sudo systemctl enable redis
sudo systemctl start redis

Step 3: Create systemd service for SOLLOL Dashboard

Create /etc/systemd/system/sollol-dashboard.service:

[Unit]
Description=SOLLOL Dashboard Service
After=network.target redis.service
Requires=redis.service

[Service]
Type=simple
User=sollol  # Create dedicated user for security
Group=sollol
WorkingDirectory=/opt/sollol
Environment="SOLLOL_DASHBOARD=true"
Environment="SOLLOL_DASHBOARD_PORT=8080"
Environment="REDIS_URL=redis://localhost:6379"
ExecStart=/usr/bin/python3 -m sollol.dashboard_service --port 8080 --redis-url redis://localhost:6379
Restart=always
RestartSec=10
StandardOutput=journal
StandardError=journal

[Install]
WantedBy=multi-user.target

Enable and start:

sudo useradd -r -s /bin/false sollol  # Create dedicated user
sudo mkdir -p /opt/sollol
sudo chown sollol:sollol /opt/sollol

sudo systemctl daemon-reload
sudo systemctl enable sollol-dashboard
sudo systemctl start sollol-dashboard

# Verify
sudo systemctl status sollol-dashboard
curl http://localhost:8080/health

Step 4: Install GPU reporters on nodes (optional but recommended)

On each GPU node for accurate VRAM monitoring:

# Install on each node with GPUs
pip install sollol gpustat

# Run GPU reporter (publishes to central Redis)
sollol install-gpu-reporter --redis-host <control-plane-ip>

# Example for node at 10.9.66.45
sollol install-gpu-reporter --redis-host 10.9.66.154

Create /etc/systemd/system/sollol-gpu-reporter.service on each GPU node:

[Unit]
Description=SOLLOL GPU Reporter
After=network.target

[Service]
Type=simple
User=sollol
ExecStart=/usr/local/bin/sollol-gpu-reporter --redis-host <control-plane-ip> --interval 5
Restart=always
RestartSec=5

[Install]
WantedBy=multi-user.target

Step 5: Configure firewall rules

On all nodes:

# Allow Ollama traffic (port 11434)
sudo ufw allow 11434/tcp comment "Ollama API"

# On control plane only: allow dashboard access
sudo ufw allow 8080/tcp comment "SOLLOL Dashboard"
sudo ufw allow 6379/tcp comment "Redis"  # Only from trusted nodes

# Reload firewall
sudo ufw reload

Step 6: Test the deployment

From any machine with network access:

from sollol import OllamaPool

# SOLLOL auto-discovers all nodes via network scan
pool = OllamaPool.auto_configure()

# Verify nodes discovered
stats = pool.get_stats()
print(f"Discovered {stats['active_nodes']} nodes")

# Make a test request
response = pool.chat(
    model="llama3.2",
    messages=[{"role": "user", "content": "Hello!"}]
)
print(response['message']['content'])
print(f"Routed to: {response['_sollol_routing']['host']}")

Step 7: Monitor with systemd

# Check dashboard status
sudo systemctl status sollol-dashboard

# View live logs
sudo journalctl -u sollol-dashboard -f

# Check GPU reporters
sudo systemctl status sollol-gpu-reporter

# View metrics
curl http://localhost:8080/api/stats | jq

Production Hardening

Security:

# 1. Run SOLLOL as dedicated unprivileged user
sudo useradd -r -s /bin/false sollol

# 2. Configure Redis authentication
sudo vi /etc/redis/redis.conf
# Add: requirepass <strong-password>

# 3. Use firewall to restrict access
sudo ufw allow from 10.9.66.0/24 to any port 6379  # Redis from trusted subnet only
sudo ufw allow from 10.9.66.0/24 to any port 8080  # Dashboard from trusted subnet

High Availability:

# Use systemd watchdog for automatic restart on crashes
[Service]
WatchdogSec=30
Restart=always
RestartSec=10

Monitoring:

# Integrate with Prometheus
curl http://localhost:9090/metrics

# Or use systemd monitoring
systemctl status sollol-dashboard | grep "Active:"

Troubleshooting

Nodes not discovered:

# Check network connectivity
for ip in 10.9.66.{1..255}; do
    timeout 0.5 bash -c "cat < /dev/null > /dev/tcp/$ip/11434 2>/dev/null" && echo "$ip:11434 reachable"
done

# Check Ollama is listening on all interfaces (not just localhost)
curl http://<node-ip>:11434/api/tags

Dashboard not starting:

# Check Redis is running
systemctl status redis
redis-cli ping  # Should return "PONG"

# Check port not in use
sudo lsof -i :8080

# View detailed logs
journalctl -u sollol-dashboard --since "10 minutes ago"

Performance issues:

# Check node health
curl http://localhost:8080/api/stats | jq '.node_performance'

# Monitor resource usage
htop
nvidia-smi  # On GPU nodes

# Check network latency between nodes
ping <node-ip>

📚 Documentation

Architecture Guide - Deep dive into system design
Backend Architecture - Backend extensibility and adding new LLM backends
Batch Processing API - Complete guide to batch job management (NEW in v0.7.0)
- API endpoints and examples
- Job lifecycle and progress tracking
- Best practices and error handling
llama.cpp Distributed Inference Guide - Complete guide to model sharding
- Setup and configuration
- Performance optimization
- Troubleshooting common issues
- Advanced topics (custom layer distribution, monitoring, etc.)
Integration Examples - Practical integration patterns
llama.cpp Distributed Examples - Model sharding examples
- Auto-setup and manual configuration
- Multi-turn conversations with monitoring
- Batch processing with multiple models
- Error handling and recovery patterns
Deployment Guide - Production deployment patterns
API Reference - Complete API documentation
Performance Tuning - Optimization guide
SynapticLlamas Learnings - Features from production use

🆕 What's New in v0.7.0

📦 Batch Processing API

Complete RESTful API for asynchronous batch job management. Submit large-scale batch operations (embeddings, bulk inference) and track progress via job IDs.

import requests

# Submit batch embedding job (up to 10,000 documents)
response = requests.post("http://localhost:11434/api/batch/embed", json={
    "model": "nomic-embed-text",
    "documents": ["doc1", "doc2", ...],  # Thousands of documents
})
job_id = response.json()["job_id"]

# Check status
status = requests.get(f"http://localhost:11434/api/batch/jobs/{job_id}")
print(status.json()["progress"]["percent"])  # 100.0

# Get results
results = requests.get(f"http://localhost:11434/api/batch/results/{job_id}")
embeddings = results.json()["results"]

Batch API Endpoints:

POST /api/batch/embed - Submit batch embedding job
GET /api/batch/jobs/{job_id} - Get job status with progress tracking
GET /api/batch/results/{job_id} - Retrieve job results and errors
DELETE /api/batch/jobs/{job_id} - Cancel running jobs
GET /api/batch/jobs?limit=100 - List recent jobs

Features:

UUID-based job tracking with 5 states (PENDING, RUNNING, COMPLETED, FAILED, CANCELLED)
Automatic TTL-based cleanup (1 hour default)
Progress tracking: completed_items, failed_items, percentage
Duration calculation and metadata storage
Async job execution via Dask distributed processing

⚡ Performance Optimizations (v0.9.18+)

SOLLOL now includes 8 production-grade performance optimizations designed to improve throughput and latency:

⚠️ Transparency Note: These features are implemented and functional, but claimed performance improvements are projections based on architecture, NOT independently validated benchmarks. See Performance Impact section below for details.

🚀 Response Caching Layer

Expected Impact: Reduces latency for repeated queries (cache hit/miss tracking validated)

Intelligent LRU cache with TTL expiration:

from sollol import OllamaPool

# Enable response caching (enabled by default)
pool = OllamaPool.auto_configure(
    enable_cache=True,
    cache_max_size=1000,  # Cache up to 1000 responses
    cache_ttl=3600        # 1 hour TTL
)

# First request: normal latency
response1 = pool.embed(model="mxbai-embed-large", input="Hello world")

# Cached request: faster
response2 = pool.embed(model="mxbai-embed-large", input="Hello world")  # Cache hit

# Programmatic cache management
pool.clear_cache()                              # Clear all
pool.invalidate_cache_by_model("llama3.2")     # Invalidate by model
cache_data = pool.export_cache()                # Export for persistence
pool.import_cache(cache_data)                   # Restore from export

# Get cache stats
stats = pool.get_cache_stats()
print(f"Hit rate: {stats['hit_rate']:.1%}")    # 85.2%
print(f"Cache size: {stats['size']}")           # 234/1000

🌊 Streaming Support

Expected Impact: Better UX, reduced perceived latency (streaming functionality validated)

Token-by-token streaming for chat() and generate():

# Stream chat responses
for chunk in pool.chat(
    model="llama3.2",
    messages=[{"role": "user", "content": "Tell me a story"}],
    stream=True
):
    content = chunk.get("message", {}).get("content", "")
    print(content, end="", flush=True)

# Stream text generation
for chunk in pool.generate(
    model="llama3.2",
    prompt="Explain quantum computing",
    stream=True
):
    print(chunk.get("response", ""), end="", flush=True)

🔥 Smart Model Prefetching

Expected Impact: 1-5 seconds reduced first-request latency (projection, not measured)

Pre-load models into VRAM before first use:

# Warm a single model
pool.warm_model("llama3.2")

# Warm multiple models in parallel
results = pool.warm_models(
    models=["llama3.2", "codellama", "mistral"],
    parallel=True
)
print(f"Warmed {sum(results.values())} models")

⚡ Async I/O Support

Expected Impact: 2-3x throughput for concurrent requests (projection, not measured)

True non-blocking I/O with httpx AsyncClient:

import asyncio

# Async methods for concurrent requests
async def process_batch():
    responses = await asyncio.gather(
        pool.chat_async("llama3.2", messages=[...]),
        pool.generate_async("llama3.2", prompt="..."),
        pool.embed_async("mxbai-embed-large", input="...")
    )
    return responses

# Run async batch
results = asyncio.run(process_batch())

🔗 HTTP/2 Multiplexing

Expected Impact: 30-50% latency reduction for concurrent requests (projection, not measured)

Automatic HTTP/2 support when httpx is installed:

# Automatically uses HTTP/2 if available
pool = OllamaPool.auto_configure()

# Check if HTTP/2 is enabled
stats = pool.get_stats()
print(f"HTTP/2 enabled: {stats['http2_enabled']}")  # True

📊 Additional Optimizations

Connection Pool Tuning (10-20% better concurrency):

Optimized pool sizes: 10-20 connections per node
Automatic retry with exponential backoff
Connection reuse with keep-alive

Adaptive Health Checks (5-10% overhead reduction):

Dynamic intervals based on node stability:
- Very stable (<1% failures): 60s interval
- Stable (<5% failures): 30s interval
- Degraded (5-15% failures): 15s interval
- Unstable (>15% failures): 5s interval

Telemetry Sampling (~90% overhead reduction):

Configurable sampling for info-level events (default: 10%)
Always logs errors and critical events
Reduces dashboard logging overhead

📈 Performance Impact

⚠️ IMPORTANT: These are architectural projections, NOT measured results

These optimizations are implemented and functional, but multi-node performance gains have not been independently validated:

Projected improvements (unvalidated):

Throughput: +150-300% for concurrent workloads (theory: parallel request handling)
Latency: -40-70% for typical requests (theory: caching + HTTP/2)
Cache hits: Significant latency reduction for repeated queries (validated in single-node tests)

What's actually measured:

✅ Response caching works (cache hit/miss rates tracked)
✅ Streaming works (token-by-token delivery confirmed)
✅ HTTP/2 enabled (httpx connection verified)
⚠️ Multi-node throughput gains: Not independently benchmarked

To validate these claims yourself:

# Run comparative benchmarks
cd benchmarks
python run_benchmarks.py --sollol-url http://localhost:8000 --duration 120

See BENCHMARKING.md for methodology.

Previous Features (v0.3.6+)

Synchronous API - No async/await required:

from sollol.sync_wrapper import OllamaPool
pool = OllamaPool.auto_configure()
response = pool.chat(...)  # Synchronous call

Priority Helpers - Semantic priority levels:

from sollol.priority_helpers import Priority
priority = Priority.HIGH  # 7

SOLLOL Detection:

X-Powered-By: SOLLOL header on all responses
/api/health endpoint returns {"service": "SOLLOL", "version": "0.7.0"}

🆚 Comparison

SOLLOL vs. Simple Load Balancers

Feature	nginx/HAProxy	SOLLOL
Routing	Round-robin/random	Context-aware, adapts from history
Resource awareness	None	GPU/CPU/memory-aware
Failover	Manual config	Automatic detection & recovery
Model sharding	❌	✅ llama.cpp integration
Task prioritization	❌	✅ Priority queue
Observability	Basic	Rich metrics + dashboard
Setup	Complex config	Auto-discover

SOLLOL vs. Kubernetes

Feature	Kubernetes	SOLLOL
Complexity	High - requires cluster setup	Low - pip install
AI-specific	Generic container orchestration	Purpose-built for LLMs
Intelligence	None	Task-aware routing
Model sharding	Manual	Automatic
Best for	Large-scale production	AI-focused teams

Use both! Deploy SOLLOL on Kubernetes for ultimate scalability.

🤝 Contributing

We welcome contributions! Areas we'd love help with:

ML-based routing predictions
Additional monitoring integrations
Cloud provider integrations
Performance optimizations
Documentation improvements

See CONTRIBUTING.md for guidelines.

📜 License

MIT License - see LICENSE file for details.

🙏 Credits

Created by BenevolentJoker-JohnL

Part of the Complete AI Ecosystem:

SynapticLlamas - Multi-Agent Orchestration
FlockParser - Document RAG Intelligence
SOLLOL - Distributed Inference Platform (this project)

Special Thanks:

Dallan Loomis - For always providing invaluable support, feedback, and guidance throughout development. Your insights and encouragement have been instrumental in shaping this project.

Built with: Ray, Dask, FastAPI, llama.cpp, Ollama

🎯 What Makes SOLLOL Different?

Combines task distribution AND model sharding in one system
Context-aware routing that adapts based on performance metrics
Auto-discovery of nodes with minimal configuration
Built-in failover and priority queuing
Purpose-built for Ollama clusters (understands GPU requirements, task types)

Limitations to know:

Model sharding verified with 13B models; larger models not extensively tested
Performance benefits depend on network latency and workload patterns
Not a drop-in replacement for single-node setups in all scenarios

Stop manually managing your LLM cluster. Let SOLLOL optimize it for you.

Get Started • View on GitHub • Report Issue

README.md Unescape Escape