No description
  • Python 97.2%
  • HTML 1.7%
  • Shell 1%
Find a file
Hackall 6c69f8506e
Some checks failed
Lint / lint (push) Has been cancelled
Tests / test (3.10) (push) Has been cancelled
Tests / test (3.11) (push) Has been cancelled
Tests / test (3.9) (push) Has been cancelled
Tests / integration-test (push) Has been cancelled
Merge pull request #29 from hackall360/codex/create-typer-cli-for-full-project-demo
Add full project demo CLI for orchestrating SOLLOL examples
2025-10-18 10:43:11 -07:00
.github feat: add GitHub Actions CI with multi-node integration testing 2025-10-16 23:07:47 -05:00
benchmarks Add real benchmark results and fix CLI entry point 2025-10-06 08:31:07 -05:00
config Add production deployment artifacts and real benchmarks 2025-10-06 08:36:17 -05:00
docs docs: add dashboard screenshots 2025-10-17 08:31:35 -05:00
examples Add full project demo CLI and update live dashboard summary 2025-10-18 10:42:51 -07:00
kubernetes Add production deployment artifacts and real benchmarks 2025-10-06 08:36:17 -05:00
scripts docs: reorganize documentation structure 2025-10-16 17:54:37 -05:00
src/sollol feat: add routing decision tooltips to dashboard 2025-10-16 23:20:07 -05:00
SynapticLlamas@24be170a2d feat: add routing decision tooltips to dashboard 2025-10-16 23:20:07 -05:00
systemd docs: reorganize documentation structure 2025-10-16 17:54:37 -05:00
tests Rename mock server helper to start_background_process 2025-10-18 05:30:29 -07:00
.dockerignore Add production-ready security, deployment, and fault tolerance 2025-10-03 05:49:02 -05:00
.gitignore docs: update .gitignore and add Credits section with special thanks to Dallan Loomis 2025-10-16 22:26:41 -05:00
ARCHITECTURE.md Make documentation more honest and factual, reduce overselling 2025-10-05 20:39:20 -05:00
BACKENDS.md Document backend extensibility architecture 2025-10-07 00:15:07 -05:00
BATCH_API.md Update documentation for v0.7.0 batch API release 2025-10-06 17:38:59 -05:00
BENCHMARKING.md Add honest validation status and benchmarking guide 2025-10-06 09:39:23 -05:00
BENCHMARKS.md Add professional portfolio enhancements 2025-10-03 00:10:29 -05:00
CODE_WALKTHROUGH.md Clean up claims and add detailed code walkthrough 2025-10-06 11:55:32 -05:00
compare_discovery_modes.py fix: network discovery now finds ALL nodes on subnet 2025-10-06 21:00:29 -05:00
COMPLETE_SUMMARY.md Add complete implementation summary for v0.3.6 2025-10-05 22:19:09 -05:00
config.yml Initial commit: SOLLOL - Super Ollama Load Balancer 2025-10-02 18:51:38 -05:00
CONFIGURATION.md Add comprehensive configuration guide 2025-10-06 14:43:52 -05:00
CONTRIBUTING.md Add professional portfolio enhancements 2025-10-03 00:10:29 -05:00
COORDINATOR_REUSE_INVESTIGATION.md Fix flake8 F824 errors: Remove unused global declarations 2025-10-05 18:05:21 -05:00
dashboard.html feat: add routing decision tooltips to dashboard 2025-10-16 23:20:07 -05:00
DEPLOY_GPU_REPORTER.md docs: reorganize documentation structure 2025-10-16 17:54:37 -05:00
DEPLOYMENT.md feat: Integrate SynapticLlamas with SOLLOL as drop-in replacement 2025-10-03 10:04:12 -05:00
DEPLOYMENT_AWARE_RESOLUTION.md feat: add heterogeneous GPU parallelization with VRAM-aware routing 2025-10-06 20:13:57 -05:00
DISTRIBUTED_INFERENCE_STATUS.md feat: distributed inference stability and GPU monitoring 2025-10-12 10:47:59 -05:00
docker-compose.test.yml Clean up claims and add detailed code walkthrough 2025-10-06 11:55:32 -05:00
docker-compose.yml feat: Integrate SynapticLlamas with SOLLOL as drop-in replacement 2025-10-03 10:04:12 -05:00
DOCKER_IP_RESOLUTION.md feat: add heterogeneous GPU parallelization with VRAM-aware routing 2025-10-06 20:13:57 -05:00
DOCKER_SETUP.md Add Docker functional testing and comprehensive setup guide 2025-10-06 12:03:02 -05:00
Dockerfile feat: Integrate SynapticLlamas with SOLLOL as drop-in replacement 2025-10-03 10:04:12 -05:00
FLOCKPARSER_FEATURES_ANALYSIS.md docs: reorganize documentation structure 2025-10-16 17:54:37 -05:00
GPU_MONITORING_GUIDE.md docs: reorganize documentation structure 2025-10-16 17:54:37 -05:00
GPU_MONITORING_SETUP.md feat: distributed inference stability and GPU monitoring 2025-10-12 10:47:59 -05:00
gpu_reporter.py docs: reorganize documentation structure 2025-10-16 17:54:37 -05:00
GRAFANA_SETUP.md release: v0.9.47 - RPC backend fixes and FlockParser integration 2025-10-11 09:16:57 -05:00
HONEST_STATUS.md Initial commit: SOLLOL - Super Ollama Load Balancer 2025-10-02 18:51:38 -05:00
INTEGRATION_BASIC.md Add CI/CD pipeline and enhance repository visibility 2025-10-03 00:19:25 -05:00
INTEGRATION_COMPLETE.md docs: Add integration completion summary 2025-10-03 10:12:41 -05:00
INTEGRATION_GUIDE.md Add CI/CD pipeline and enhance repository visibility 2025-10-03 00:19:25 -05:00
KNOWN_ISSUES.md docs: reorganize documentation structure 2025-10-16 17:54:37 -05:00
KNOWN_LIMITATIONS.md docs: reorganize documentation structure 2025-10-16 17:54:37 -05:00
LEGACY_GPU_SUPPORT.md feat: add heterogeneous GPU parallelization with VRAM-aware routing 2025-10-06 20:13:57 -05:00
LICENSE 🚀 Integrate distributed inference and make SOLLOL standalone package 2025-10-04 14:38:12 -05:00
llamacpp_github_issue.md docs: reorganize documentation structure 2025-10-16 17:54:37 -05:00
MANIFEST.in Prepare v0.3.6 for PyPI publication 2025-10-05 22:01:18 -05:00
mkdocs.yml Add professional documentation site with MkDocs and GitHub Pages 2025-10-03 06:27:50 -05:00
MULTI_APP_ARCHITECTURE.md Add comprehensive multi-application architecture guide 2025-10-06 12:25:48 -05:00
ollama_discussion_discord_final.md docs: reorganize documentation structure 2025-10-16 17:54:37 -05:00
ollama_discussion_draft.md docs: reorganize documentation structure 2025-10-16 17:54:37 -05:00
ollama_discussion_draft_v2.md docs: reorganize documentation structure 2025-10-16 17:54:37 -05:00
ollama_discussion_final.md docs: reorganize documentation structure 2025-10-16 17:54:37 -05:00
ollama_discussion_followup.md docs: reorganize documentation structure 2025-10-16 17:54:37 -05:00
ollama_github_issue.md docs: reorganize documentation structure 2025-10-16 17:54:37 -05:00
PHASE1_IMPLEMENTATION_COMPLETE.md Add Phase 1 features: Sync API, Priority Helpers, SOLLOL Detection (v0.3.6) 2025-10-05 21:58:07 -05:00
PHASE2_COMPLETE.md Document Phase 2 completion: Code consolidation successful 2025-10-05 22:10:18 -05:00
PHASE2_PROGRESS.md Document Phase 2 completion: Code consolidation successful 2025-10-05 22:10:18 -05:00
PRODUCTION_READINESS.md Fix missing exports and add production readiness assessment 2025-10-05 22:46:35 -05:00
prometheus.yml Add production-ready security, deployment, and fault tolerance 2025-10-03 05:49:02 -05:00
PUBLISH.md 🚀 Integrate distributed inference and make SOLLOL standalone package 2025-10-04 14:38:12 -05:00
PUBLISHING.md Add PyPI packaging for pip install support 2025-10-03 06:41:59 -05:00
PYPI_PUBLICATION_SUCCESS.md Published SOLLOL v0.3.6 to PyPI 2025-10-05 22:16:59 -05:00
pyproject.toml Move gateway local cluster integration test and gate on Ollama 2025-10-18 01:46:58 -07:00
QUICK_START.md release: v0.9.47 - RPC backend fixes and FlockParser integration 2025-10-11 09:16:57 -05:00
README.md fix: make performance claims transparent and honest 2025-10-17 09:28:10 -05:00
REDIS_SETUP.md Implement Redis-based distributed coordination 2025-10-06 12:33:14 -05:00
REMOTE_ACCESS_SETUP_GUIDE.md release: v0.9.47 - RPC backend fixes and FlockParser integration 2025-10-11 09:16:57 -05:00
REMOTE_ACCESS_STATUS.md release: v0.9.47 - RPC backend fixes and FlockParser integration 2025-10-11 09:16:57 -05:00
ROUTING_LOGS.md feat: add live routing decision logs with real-time dashboard 2025-10-11 14:48:57 -05:00
ROUTING_STRATEGIES.md feat: add 5 routing strategies with clean extension points 2025-10-16 22:47:34 -05:00
RPC_BACKEND_FIX.md docs: remove IP addresses from RPC_BACKEND_FIX.md 2025-10-12 10:55:00 -05:00
SECURITY.md Add production-ready security, deployment, and fault tolerance 2025-10-03 05:49:02 -05:00
SESSION_SUMMARY.md release: v0.9.47 - RPC backend fixes and FlockParser integration 2025-10-11 09:16:57 -05:00
setup.py docs: reorganize documentation structure 2025-10-16 17:54:37 -05:00
setup_llama_cpp.py 🚀 Integrate distributed inference and make SOLLOL standalone package 2025-10-04 14:38:12 -05:00
test_activity.py release: v0.9.47 - RPC backend fixes and FlockParser integration 2025-10-11 09:16:57 -05:00
test_batch_api.py Add batch processing API endpoints (v0.7.0) 2025-10-06 16:58:32 -05:00
test_connection_reuse.py perf: add HTTP connection reuse to OllamaPool 2025-10-12 11:11:51 -05:00
test_dashboard.py release: v0.9.47 - RPC backend fixes and FlockParser integration 2025-10-11 09:16:57 -05:00
test_dashboard_fallback_simple.py docs: document Dask worker logging as known issue 2025-10-07 10:33:21 -05:00
test_dashboards.py feat: restore network infrastructure panels to unified dashboard 2025-10-07 00:51:22 -05:00
test_dask_adaptive.py feat: add Dask distributed batch processing with adaptive routing 2025-10-12 19:08:35 -05:00
test_dask_batch.py feat: add Dask distributed batch processing with adaptive routing 2025-10-12 19:08:35 -05:00
test_dask_comparison.py feat: add Dask distributed batch processing with adaptive routing 2025-10-12 19:08:35 -05:00
test_docker.sh Add Docker functional testing and comprehensive setup guide 2025-10-06 12:03:02 -05:00
test_embed_batch.py feat: distributed inference stability and GPU monitoring 2025-10-12 10:47:59 -05:00
test_failure_recovery.py feat: add failure & recovery simulation demonstration 2025-10-16 23:11:54 -05:00
test_full_network_discovery.py fix: network discovery now finds ALL nodes on subnet 2025-10-06 21:00:29 -05:00
test_multi_app_dashboard.py docs: document Dask worker logging as known issue 2025-10-07 10:33:21 -05:00
test_new_features.py Add comprehensive feature test script 2025-10-06 16:04:25 -05:00
test_observer_debug.py release: v0.9.47 - RPC backend fixes and FlockParser integration 2025-10-11 09:16:57 -05:00
test_optimizations.py feat: add Dask distributed batch processing with adaptive routing 2025-10-12 19:08:35 -05:00
test_ray_features.py fix: correct imports for GGUF resolver in Ray routers 2025-10-06 22:32:36 -05:00
test_resilience.py Add production resilience features to SOLLOL v0.5.0 2025-10-06 16:18:48 -05:00
TEST_RESULTS.md Initial commit: SOLLOL - Super Ollama Load Balancer 2025-10-02 18:51:38 -05:00
test_routing_log.py feat: add live routing decision logs with real-time dashboard 2025-10-11 14:48:57 -05:00
test_routing_strategies.py feat: add 5 routing strategies with clean extension points 2025-10-16 22:47:34 -05:00
test_rpc_backend_fix.py release: v0.9.47 - RPC backend fixes and FlockParser integration 2025-10-11 09:16:57 -05:00
test_vram_monitoring.py Add pre-routing VRAM size checks and model size estimation 2025-10-06 18:16:08 -05:00
test_websocket_client.py release: v0.9.47 - RPC backend fixes and FlockParser integration 2025-10-11 09:16:57 -05:00
UNIVERSAL_DASHBOARD.md feat: add universal network observability and application tracking 2025-10-07 00:30:30 -05:00
verify_dashboards.py feat: add universal network observability and application tracking 2025-10-07 00:30:30 -05:00

SOLLOL - Production-Ready Orchestration for Local LLM Clusters

Python 3.8+ License: MIT Tests codecov Ollama llama.cpp

Open-source orchestration layer that combines intelligent task routing with distributed model inference for local LLM clusters.

Quick StartFeaturesArchitectureDocumentationExamples


🎯 What is SOLLOL?

SOLLOL (Super Ollama Load balancer & Orchestration Layer) transforms your collection of Ollama nodes into an intelligent AI cluster with adaptive routing and automatic failover—all running on your own hardware.

The Problem

You have multiple machines with GPUs running Ollama, but:

  • Manual node selection for each request
  • No way to run models larger than your biggest GPU
  • Can't distribute multi-agent workloads efficiently
  • No automatic failover or load balancing
  • Zero visibility into cluster performance

The SOLLOL Solution

SOLLOL provides:

  • Intelligent routing that learns which nodes work best for each task
  • Model sharding to run 70B+ models across multiple machines
  • Parallel agent execution for multi-agent frameworks
  • Auto-discovery of all nodes and capabilities
  • Built-in observability with real-time metrics
  • Zero-config deployment - just point and go

Quickstart (3 Commands)

# 1. Install SOLLOL
pip install sollol

# 2. Start the dashboard (optional but recommended)
python3 -m sollol.dashboard_service &

# 3. Run your first query
python3 -c "from sollol import OllamaPool; pool = OllamaPool.auto_configure(); print(pool.chat(model='llama3.2', messages=[{'role': 'user', 'content': 'Hello!'}])['message']['content'])"

What just happened?

  • SOLLOL auto-discovered all Ollama nodes on your network
  • Intelligently routed your request to the best available node
  • Dashboard live at http://localhost:8080 (shows routing decisions, metrics, logs)

Expected output:

Discovering Ollama nodes...
Found 3 nodes: 10.9.66.45:11434, 10.9.66.154:11434, localhost:11434
Selected node: 10.9.66.45:11434 (GPU, 12ms latency)
Hello! How can I help you today?

Next steps:

  • Visit http://localhost:8080 to see the dashboard
  • Check Full Quick Start for production setup
  • Read Examples for multi-agent, batch, and sharding patterns

🚀 Full Quick Start

Installation

pip install sollol

Basic Usage

from sollol import OllamaPool

# Auto-discover nodes and start routing
pool = OllamaPool.auto_configure()

# Make requests - SOLLOL routes intelligently
response = pool.chat(
    model="llama3.2",
    messages=[{"role": "user", "content": "Hello!"}]
)

Enable Real-Time GPU Monitoring

For accurate VRAM-aware routing, install the GPU reporter on each node:

# On each Ollama node, run:
sollol install-gpu-reporter --redis-host <redis-server-ip>

# Example:
sollol install-gpu-reporter --redis-host 10.9.66.154

What this does:

  • Installs vendor-agnostic GPU monitoring (NVIDIA/AMD/Intel via gpustat)
  • Publishes real-time VRAM stats to Redis every 5 seconds
  • SOLLOL uses this data for intelligent routing decisions
  • See GPU Monitoring Guide for details

Without GPU monitoring: SOLLOL falls back to estimates which may be inaccurate.


📸 Screenshots

Dashboard Overview

SOLLOL Unified Dashboard Real-time monitoring with P50/P95/P99 latency metrics, network nodes, RPC backends, and active applications

Ray & Dask Integration

Ray and Dask Dashboards Embedded Ray and Dask dashboards for distributed task monitoring

Activity Monitoring

Real-time Activity Logs Live request/response activity streams from Ollama nodes and RPC backends

Applications & Traces

Performance Analytics Applications, distributed traces, and Ollama activity logs with real-time request/response tracking


🔥 Why SOLLOL?

1. Two Distribution Modes in One System

SOLLOL combines both task distribution and model sharding:

📊 Task Distribution (Horizontal Scaling)

Distribute multiple requests across your cluster in parallel:

# Run 10 agents simultaneously across 5 nodes
pool = OllamaPool.auto_configure()
responses = await asyncio.gather(*[
    pool.chat(model="llama3.2", messages=[...])
    for _ in range(10)
])
# Parallel execution across available nodes

🧩 Model Sharding (Vertical Scaling)

Run single large models that don't fit on one machine:

# Run larger models across multiple nodes
# Note: Verified with 13B across 2-3 nodes; larger models not extensively tested
router = HybridRouter(
    enable_distributed=True,
    num_rpc_backends=4
)
response = await router.route_request(
    model="llama3:70b",  # Sharded automatically
    messages=[...]
)

Use them together! Small models use task distribution, large models use sharding.


2. Intelligent, Not Just Balanced

SOLLOL doesn't just distribute requests randomly—it learns and optimizes:

Feature Simple Load Balancer SOLLOL
Routing Round-robin Context-aware scoring
Learning None Adapts from performance history
Resource Awareness None GPU/CPU/memory-aware
Task Optimization None Routes by task type complexity
Failover Manual Automatic with health checks
Priority FIFO Priority queue with fairness

Example: SOLLOL automatically routes:

  • Heavy generation tasks → GPU nodes with 24GB VRAM
  • Fast embeddings → CPU nodes or smaller GPUs
  • Critical requests → Fastest, most reliable nodes
  • Batch processing → Lower priority, distributed load

3. Production-Ready from Day One

from sollol import SOLLOL, SOLLOLConfig

# Literally 3 lines to production
config = SOLLOLConfig.auto_discover()
sollol = SOLLOL(config)
sollol.start()  # ✅ Gateway running on :8000

Out of the box:

  • Auto-discovery of Ollama nodes
  • Health monitoring and failover
  • Prometheus metrics
  • Web dashboard
  • Connection pooling
  • Request hedging
  • Priority queuing

4. Unified Observability for Your Entire AI Network

SOLLOL provides a single pane of glass to monitor every application and every node in your distributed AI network.

  • Centralized Dashboard: One web interface shows all applications, nodes, and RPC backends.
  • Multi-App Tracking: See which applications (e.g., SynapticLlamas, custom agents) are using the cluster in real-time.
  • Network-Wide Visibility: The dashboard runs as a persistent service, discovering and monitoring all components even if no applications are running.
  • Zero-Config: Applications automatically appear in the dashboard with no extra code required.

This moves beyond per-application monitoring to provide true, centralized observability for your entire infrastructure.


🏗️ Architecture

High-Level Overview

┌────────────────────────────────────────────────────────┐
│                  Your Application                       │
│         (SynapticLlamas, custom agents, etc.)          │
└──────────────────────┬─────────────────────────────────┘
                       │
                       ▼
┌────────────────────────────────────────────────────────┐
│                 SOLLOL Gateway (:8000)                  │
│  ┌──────────────────────────────────────────────────┐  │
│  │         Intelligent Routing Engine               │  │
│  │  • Analyzes: task type, complexity, resources    │  │
│  │  • Scores: all nodes based on context            │  │
│  │  • Learns: from performance history              │  │
│  │  • Routes: to optimal node                       │  │
│  └──────────────────────────────────────────────────┘  │
│  ┌──────────────────────────────────────────────────┐  │
│  │          Priority Queue + Failover               │  │
│  └──────────────────────────────────────────────────┘  │
└────────┬─────────────────────────┬─────────────────────┘
         │                         │
         ▼                         ▼
  ┌─────────────┐          ┌──────────────┐
  │ Task Mode   │          │  Shard Mode  │
  │ Ray Cluster │          │  llama.cpp   │
  └──────┬──────┘          └──────┬───────┘
         │                         │
         ▼                         ▼
┌────────────────────────────────────────────────────────┐
│              Your Heterogeneous Cluster                 │
│  GPU (24GB) │ GPU (16GB) │ CPU (64c) │ GPU (8GB) │...  │
└────────────────────────────────────────────────────────┘

How Routing Works

# 1. Request arrives
POST /api/chat {
  "model": "llama3.2",
  "messages": [{"role": "user", "content": "Complex analysis task..."}],
  "priority": 8
}

# 2. SOLLOL analyzes
task_type = "generation"       # Auto-detected
complexity = "high"             # Token count analysis
requires_gpu = True             # Based on task
estimated_duration = 3.2s       # From history

# 3. SOLLOL scores all nodes
Node A (GPU 24GB, load: 0.2, latency: 120ms)  Score: 185.3  WINNER
Node B (GPU 8GB,  load: 0.6, latency: 200ms)  Score: 92.1
Node C (CPU only, load: 0.1, latency: 80ms)   Score: 41.2

# 4. Routes to Node A, monitors execution, learns for next time

Scoring Algorithm:

Score = 100.0 (baseline)
      × success_rate (0.0-1.0)
      ÷ (1 + latency_penalty)
      × gpu_bonus (1.5x if GPU available & needed)
      ÷ (1 + load_penalty)
      × priority_alignment
      × task_specialization

📦 Installation

Quick Install (PyPI)

pip install sollol

From Source

git clone https://github.com/BenevolentJoker-JohnL/SOLLOL.git
cd SOLLOL
pip install -e .

Quick Start

1. Synchronous API (No async/await needed!)

New in v0.3.6: SOLLOL now provides a synchronous API for easier integration with non-async applications.

from sollol.sync_wrapper import OllamaPool
from sollol.priority_helpers import Priority

# Auto-discover and connect to all Ollama nodes
pool = OllamaPool.auto_configure()

# Make requests - SOLLOL routes intelligently
# No async/await needed!
response = pool.chat(
    model="llama3.2",
    messages=[{"role": "user", "content": "Hello!"}],
    priority=Priority.HIGH,  # Semantic priority levels
    timeout=60  # Request timeout in seconds
)

print(response['message']['content'])
print(f"Routed to: {response.get('_sollol_routing', {}).get('host', 'unknown')}")

Key features of synchronous API:

  • No async/await syntax required
  • Works with synchronous agent frameworks
  • Same intelligent routing and features
  • Runs async code in background thread automatically

2. Async API (Original)

For async applications, use the original async API:

from sollol import OllamaPool

# Auto-discover and connect to all Ollama nodes
pool = await OllamaPool.auto_configure()

# Make requests - SOLLOL routes intelligently
response = await pool.chat(
    model="llama3.2",
    messages=[{"role": "user", "content": "Hello!"}]
)

print(response['message']['content'])
print(f"Routed to: {response['_sollol_routing']['host']}")
print(f"Task type: {response['_sollol_routing']['task_type']}")

3. Priority-Based Multi-Agent Execution

New in v0.3.6: Use semantic priority levels and role-based mapping.

from sollol.sync_wrapper import OllamaPool
from sollol.priority_helpers import Priority, get_priority_for_role

pool = OllamaPool.auto_configure()

# Define agents with different priorities
agents = [
    {"name": "Researcher", "role": "researcher"},  # Priority 8
    {"name": "Editor", "role": "editor"},          # Priority 6
    {"name": "Summarizer", "role": "summarizer"},  # Priority 5
]

for agent in agents:
    priority = get_priority_for_role(agent["role"])

    response = pool.chat(
        model="llama3.2",
        messages=[{"role": "user", "content": f"Task for {agent['name']}"}],
        priority=priority
    )
    # User-facing agents get priority, background tasks wait

Priority levels available:

  • Priority.CRITICAL (10) - Mission-critical
  • Priority.URGENT (9) - Fast response needed
  • Priority.HIGH (7) - Important tasks
  • Priority.NORMAL (5) - Default
  • Priority.LOW (3) - Background tasks
  • Priority.BATCH (1) - Can wait

4. Model Sharding with llama.cpp (Large Models)

Run models larger than your biggest GPU by distributing layers across multiple machines.

When to Use Model Sharding

Use model sharding when:

  • Model doesn't fit on your largest GPU (e.g., 70B models on 16GB GPUs)
  • You have multiple machines with network connectivity
  • You can tolerate slower inference for capability

Don't use sharding when:

  • Model fits on a single GPU (use task distribution instead)
  • You need maximum inference speed
  • Network latency is high (>10ms between machines)

Quick Start: Auto-Setup (Easiest)

from sollol.sync_wrapper import HybridRouter, OllamaPool

# SOLLOL handles all setup automatically
router = HybridRouter(
    ollama_pool=OllamaPool.auto_configure(),
    enable_distributed=True,  # Enable model sharding
    auto_setup_rpc=True,      # Auto-configure RPC backends
    num_rpc_backends=3        # Distribute across 3 machines
)

# Use large model that doesn't fit on one machine
response = router.route_request(
    model="llama3.1:70b",  # Automatically sharded across backends
    messages=[{"role": "user", "content": "Explain quantum computing"}]
)

print(response['message']['content'])

What happens automatically:

  1. SOLLOL discovers available RPC backends on your network
  2. Extracts the GGUF model from Ollama storage
  3. Starts llama-server coordinator with optimal settings
  4. Distributes model layers across backends
  5. Routes your request to the coordinator

RPC Server Auto-Installation

SOLLOL can automatically clone, build, and start llama.cpp RPC servers for you!

One-line installation:

from sollol.rpc_auto_setup import auto_setup_rpc_backends

# Automatically: clone → build → start RPC servers
backends = auto_setup_rpc_backends(num_backends=2)
# Output: [{'host': '127.0.0.1', 'port': 50052}, {'host': '127.0.0.1', 'port': 50053}]

What this does:

  1. Scans network for existing RPC servers
  2. If none found: clones llama.cpp to ~/llama.cpp
  3. Builds llama.cpp with RPC support (cmake -DGGML_RPC=ON)
  4. Starts RPC servers on ports 50052-50053
  5. Returns ready-to-use backend list

CLI installation:

# Full automated setup (clone + build + install systemd service)
python3 -m sollol.setup_llama_cpp --all

# Or step by step
python3 -m sollol.setup_llama_cpp --clone  # Clone llama.cpp
python3 -m sollol.setup_llama_cpp --build  # Build with RPC support
python3 -m sollol.setup_llama_cpp --start  # Start RPC server

Docker IP Resolution:

SOLLOL automatically resolves Docker container IPs to accessible host IPs:

# If Docker container reports IP 172.17.0.5:11434
# SOLLOL automatically resolves to:
# → 127.0.0.1:11434 (published port mapping)
# → host IP (if accessible)
# → Docker host gateway

from sollol import is_docker_ip, resolve_docker_ip

# Check if IP is Docker internal
is_docker = is_docker_ip("172.17.0.5")  # True

# Resolve Docker IP to accessible IP
accessible_ip = resolve_docker_ip("172.17.0.5", port=11434)
# Returns: "127.0.0.1" or host IP

Network Discovery with Docker Support:

from sollol import OllamaPool

# Auto-discover nodes (automatically resolves Docker IPs)
pool = OllamaPool.auto_configure()

# Manual control
from sollol.discovery import discover_ollama_nodes
nodes = discover_ollama_nodes(auto_resolve_docker=True)

Multi-Node Production Setup:

For distributed clusters, use systemd services on each node:

# On each RPC node
sudo systemctl enable llama-rpc@50052.service
sudo systemctl start llama-rpc@50052.service

See SOLLOL_RPC_SETUP.md for complete installation guide.

Architecture: How It Works

┌────────────────────────────────────────────┐
│    Llama 3.1 70B Model (40GB total)        │
│           Distributed Sharding             │
└────────────────────────────────────────────┘
                    │
       ┌────────────┼────────────┐
       │            │            │
       ▼            ▼            ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│  Machine 1   │ │  Machine 2   │ │  Machine 3   │
│ Layers 0-26  │ │ Layers 27-53 │ │ Layers 54-79 │
│   (~13GB)    │ │   (~13GB)    │ │   (~13GB)    │
│ RPC Backend  │ │ RPC Backend  │ │ RPC Backend  │
└──────────────┘ └──────────────┘ └──────────────┘
       ▲            ▲            ▲
       └────────────┼────────────┘
                    │
         ┌──────────┴──────────┐
         │ llama-server        │
         │ Coordinator         │
         │ (Port 18080)        │
         └─────────────────────┘

Manual Setup (Advanced)

For explicit control over RPC backends:

from sollol.llama_cpp_coordinator import LlamaCppCoordinator
from sollol.rpc_registry import RPCBackendRegistry

# 1. Register RPC backends explicitly
registry = RPCBackendRegistry()
registry.add_backend("rpc_1", "grpc://10.9.66.45:50052")
registry.add_backend("rpc_2", "grpc://10.9.66.46:50052")
registry.add_backend("rpc_3", "grpc://10.9.66.47:50052")

# 2. Create coordinator
coordinator = LlamaCppCoordinator(
    coordinator_port=18080,
    rpc_backends=registry.get_all_backends(),
    context_size=4096,
    gpu_layers=-1  # Use all available GPU layers
)

# 3. Start and use
await coordinator.start(model_name="llama3.1:70b")
response = await coordinator.generate(
    prompt="Explain the theory of relativity",
    max_tokens=500
)

Performance Expectations

Model Size Single GPU Sharded (3 nodes) Trade-off
13B 20 tok/s 5 tok/s -75% speed, works on 3×smaller GPUs
70B OOM ⚠️ 3-5 tok/s (est.) Enables model that won't run otherwise

Trade-offs:

  • 🐌 Startup: 2-5 minutes (model distribution + loading)
  • 🐌 Inference: ~4x slower than local (network overhead)
  • Capability: Run models that won't fit on single GPU

Learn More:


5. Batch Processing API

New in v0.7.0: RESTful API for asynchronous batch job management.

Submit large-scale batch operations (thousands of embeddings, bulk inference) and track progress via job IDs:

import requests

# Submit batch embedding job (up to 10,000 documents)
response = requests.post("http://localhost:11434/api/batch/embed", json={
    "model": "nomic-embed-text",
    "documents": ["Document 1", "Document 2", ...],  # Can be thousands
    "metadata": {"source": "knowledge_base"}  # Optional metadata
})

job_id = response.json()["job_id"]
print(f"Job submitted: {job_id}")

# Poll for job status
import time
while True:
    status = requests.get(f"http://localhost:11434/api/batch/jobs/{job_id}").json()

    progress = status["progress"]["percent"]
    print(f"Progress: {progress}%")

    if status["status"] == "completed":
        break
    time.sleep(1)

# Get results
results = requests.get(f"http://localhost:11434/api/batch/results/{job_id}").json()
embeddings = results["results"]  # List of embedding vectors
print(f"Processed {len(embeddings)} documents in {status['duration_seconds']}s")

Available Batch Endpoints:

  • POST /api/batch/embed - Submit batch embedding job
  • GET /api/batch/jobs/{job_id} - Get job status
  • GET /api/batch/results/{job_id} - Get job results
  • GET /api/batch/jobs?limit=100 - List recent jobs
  • DELETE /api/batch/jobs/{job_id} - Cancel job

Use cases:

  • Embedding large document collections (thousands of documents)
  • Bulk inference for batch predictions
  • Background processing without blocking
  • Long-running operations with progress tracking

6. SOLLOL Detection

New in v0.3.6: Detect if SOLLOL is running vs native Ollama.

import requests

def is_sollol(url="http://localhost:11434"):
    """Check if SOLLOL is running at the given URL."""

    # Method 1: Check X-Powered-By header
    response = requests.get(url)
    if response.headers.get("X-Powered-By") == "SOLLOL":
        return True

    # Method 2: Check health endpoint
    response = requests.get(f"{url}/api/health")
    data = response.json()
    if data.get("service") == "SOLLOL":
        return True

    return False

# Use it
if is_sollol("http://localhost:11434"):
    print("✓ SOLLOL detected - using intelligent routing")
else:
    print("Native Ollama detected")

Why this matters:

  • Enables graceful fallback in client applications
  • Makes SOLLOL a true drop-in replacement
  • Clients can auto-detect and use SOLLOL features when available

7. Production Gateway

from sollol import SOLLOL, SOLLOLConfig

# Full production setup
config = SOLLOLConfig(
    ray_workers=4,
    dask_workers=2,
    hosts=["gpu-1:11434", "gpu-2:11434", "cpu-1:11434"],
    gateway_port=8000,
    metrics_port=9090
)

sollol = SOLLOL(config)
sollol.start()  # Blocks and runs gateway

# Access via HTTP:
# curl http://localhost:8000/api/chat -d '{...}'
# curl http://localhost:8000/api/stats
# curl http://localhost:8000/api/dashboard

🎓 Use Cases

1. Multi-Agent AI Systems (SynapticLlamas, CrewAI, AutoGPT)

Problem: Running 10 agents sequentially takes 10x longer than necessary.

Solution: SOLLOL distributes agents across nodes in parallel.

# Before: Sequential execution on one node
# After: Parallel execution with SOLLOL
pool = OllamaPool.auto_configure()
agents = await asyncio.gather(*[
    pool.chat(model="llama3.2", messages=agent_prompts[i])
    for i in range(10)
])
# Speedup depends on number of available nodes and their capacity

2. Large Model Inference

Problem: Your model doesn't fit in available VRAM.

Solution: SOLLOL can shard models across multiple machines via llama.cpp.

# Distribute model across multiple nodes
# Note: Verified with 13B models; larger models not extensively tested
router = HybridRouter(
    enable_distributed=True,
    num_rpc_backends=4
)
# Trade-off: Slower startup/inference but enables running larger models

3. Mixed Workloads

Problem: Different tasks need different resources.

Solution: SOLLOL routes each task to the optimal node.

pool = OllamaPool.auto_configure()

# Heavy generation → GPU node
chat = pool.chat(model="llama3.2:70b", messages=[...])

# Fast embeddings → CPU node
embeddings = pool.embed(model="nomic-embed-text", input=[...])

# SOLLOL automatically routes each to the best available node

4. High Availability Production

Problem: Node failures break your service.

Solution: SOLLOL auto-fails over and recovers.

# Node A fails mid-request
# ✅ SOLLOL automatically:
# 1. Detects failure
# 2. Retries on Node B
# 3. Marks Node A as degraded
# 4. Periodically re-checks Node A
# 5. Restores Node A when healthy

Simulate Failure & Recovery

Want to see SOLLOL's automatic failover in action? Run the included simulation:

python test_failure_recovery.py

What the simulation does:

  1. Starts 3 mock Ollama nodes
  2. Sends baseline requests (all nodes healthy)
  3. Kills node #1 mid-execution
  4. Continues sending requests (SOLLOL routes around failed node)
  5. Restores node #1
  6. Resumes sending requests (traffic returns to recovered node)

Expected output:

STEP 1: Starting Mock Nodes
✅ Started 3 mock nodes

BASELINE: Requests with all nodes healthy
  Request 1: ✓ Routed to localhost:21434
  Request 2: ✓ Routed to localhost:21435
  ...

STEP 3: Simulating Node Failure (killing node 0)
Killing node on port 21434...
✅ Node 21434 terminated

STEP 4: Requests after node failure (observe failover)
  Request 1: ✓ Routed to localhost:21435  ← Automatically avoided dead node
  Request 2: ✓ Routed to localhost:21436
  ...

STEP 5: Simulating Node Recovery
✅ Node 21434 recovered successfully

✅ Key Observations:
  1. Requests succeeded even after node failure
  2. SOLLOL automatically routed around the dead node
  3. Node recovered and rejoined the pool
  4. Traffic resumed to recovered node

This demonstrates SOLLOL's production-grade resilience without needing real infrastructure.


📊 Performance & Benchmarks

Validation Status

What's Been Validated

  • Single-node baseline performance measured
  • Code exists and is reviewable (75+ modules)
  • Tests pass in CI (57 tests, coverage tracked)
  • Architecture implements intelligent routing

What Needs Validation ⚠️

  • Comparative benchmarks (SOLLOL vs round-robin)
  • Multi-node performance improvements
  • Real-world latency/throughput gains

📖 See BENCHMARKING.md for complete validation roadmap and how to run comparative tests.


Measured Baseline Performance

Single Ollama Node (llama3.2-3B, 50 requests, concurrency=5):

  • Success Rate: 100%
  • Throughput: 0.51 req/s
  • 📈 Average Latency: 5,659 ms
  • 📈 P95 Latency: 11,299 ms
  • 📈 P99 Latency: 12,259 ms

Hardware: Single Ollama instance with 75+ models loaded Data: See benchmarks/results/ for raw JSON

Run Your Own:

# Baseline test (no cluster needed)
python benchmarks/simple_ollama_benchmark.py llama3.2 50

# Comparative test (requires docker-compose)
docker-compose up -d
python benchmarks/run_benchmarks.py --sollol-url http://localhost:8000 --duration 60

Projected Performance (Unvalidated)

Note: These are architectural projections, not measured results. Requires multi-node cluster setup for validation.

Theory: With N nodes and parallelizable workload:

  • Task distribution can approach N× parallelization (limited by request rate)
  • Intelligent routing should reduce tail latencies vs random selection
  • Resource-aware placement reduces contention and failures

Reality: Requires multi-node cluster validation. See BENCHMARKING.md for test procedure and CODE_WALKTHROUGH.md for implementation details.

Model Sharding Performance

Model Single 24GB GPU SOLLOL (3×16GB) Status
13B ~20 tok/s ~5 tok/s Verified working
70B OOM ⚠️ Estimated ~3-5 tok/s ⚠️ Not extensively tested

When to use sharding: When model doesn't fit on your largest GPU. You trade speed for capability.

Performance trade-offs: Distributed inference is 2-5 minutes slower to start and ~4x slower for inference compared to local. Use only when necessary.

Overhead

  • Routing decision: ~5-10ms (tested with 5-10 nodes)
  • Network overhead: Varies by network (typically 5-20ms)
  • Total added latency: ~20-50ms
  • Benefit: Better resource utilization + automatic failover

🛠️ Advanced Configuration

Custom Routing Strategy

from sollol import OllamaPool

pool = OllamaPool(
    nodes=[
        {"host": "gpu-1.local", "port": 11434, "priority": 10},  # Prefer this
        {"host": "gpu-2.local", "port": 11434, "priority": 5},
        {"host": "cpu-1.local", "port": 11434, "priority": 1},   # Last resort
    ],
    enable_intelligent_routing=True,
    enable_hedging=True,  # Duplicate critical requests
    max_queue_size=100
)

Priority-Based Scheduling

# Critical user-facing request
response = pool.chat(
    model="llama3.2",
    messages=[...],
    priority=10  # Highest priority
)

# Background batch job
response = pool.chat(
    model="llama3.2",
    messages=[...],
    priority=1  # Lowest priority
)

# SOLLOL ensures high-priority requests jump the queue

Observability & Monitoring

Zero-Config Auto-Registration 🎯

SOLLOL provides automatic observability with zero configuration required. All applications automatically register with the dashboard when they create an OllamaPool:

from sollol import OllamaPool

# Creates pool AND auto-registers with dashboard (if running)
pool = OllamaPool.auto_configure()
# ✅ Application automatically appears in dashboard at http://localhost:8080

How it works:

  1. OllamaPool automatically detects if a dashboard is running on port 8080
  2. Auto-discovers RPC backends and Ollama nodes
  3. Registers application with metadata (node count, GPU info, etc.)
  4. Sends periodic heartbeats to maintain "alive" status
  5. No manual DashboardClient setup needed!

Architecture:

  • ONE persistent dashboard service runs independently
  • Multiple applications (SynapticLlamas, FlockParser, etc.) auto-register
  • Dashboard survives application exits
  • Zero-config auto-discovery of nodes and RPC backends

Custom Application Names 🏷️

By default, applications register as "OllamaPool (hostname)". To give your application a custom name in the dashboard:

from sollol import OllamaPool

# Register with custom application name
pool = OllamaPool(
    nodes=[{"host": "localhost", "port": 11434}],
    enable_intelligent_routing=True,
    app_name="MyApplication"  # Shows as "MyApplication" in dashboard
)

Example - Multi-application setup:

# Application 1: FlockParser
from sollol import OllamaPool

pool = OllamaPool.auto_configure(app_name="FlockParser")
# Dashboard shows: "FlockParser"

# Application 2: SynapticLlamas
from sollol.dashboard_client import DashboardClient

dashboard_client = DashboardClient(
    app_name="SynapticLlamas",
    router_type="IntelligentRouter",
    version="1.0.0",
    dashboard_url="http://localhost:8080",
    metadata={"agents": 3, "distributed": True},
    auto_register=True
)
# Dashboard shows: "SynapticLlamas"

Why use custom names?

  • Distinguish between multiple applications using SOLLOL
  • Better visibility in multi-tenant environments
  • Easier debugging and monitoring
  • Professional dashboard presentation

Manual/Programmatic Registration 🔧

For applications that don't use OllamaPool or need custom registration logic, use DashboardClient directly:

from sollol.dashboard_client import DashboardClient

# Create dashboard client with custom metadata
dashboard_client = DashboardClient(
    app_name="CustomApplication",
    router_type="CustomRouter",  # Or "OllamaPool", "HybridRouter", etc.
    version="1.0.0",
    dashboard_url="http://localhost:8080",
    metadata={
        # Custom metadata shown in dashboard
        "nodes": 5,
        "distributed": True,
        "custom_field": "value"
    },
    auto_register=True  # Registers immediately
)

# Dashboard client automatically sends heartbeats every 5 seconds
# to keep application status as "active"

# When application exits, clean up:
dashboard_client.close()  # Stops heartbeat thread

Advanced: Custom Heartbeat Logic

from sollol.dashboard_client import DashboardClient
import time

# Create client without auto-registration
dashboard_client = DashboardClient(
    app_name="BackgroundWorker",
    router_type="WorkerPool",
    version="2.0.0",
    dashboard_url="http://localhost:8080",
    metadata={"worker_count": 10},
    auto_register=False  # Don't register yet
)

# Register when ready
dashboard_client.register()

# Update metadata dynamically
dashboard_client.update_metadata({"worker_count": 15, "status": "processing"})

# Send manual heartbeat
dashboard_client.heartbeat()

# Application logic here...
time.sleep(60)

# Deregister when done
dashboard_client.deregister()
dashboard_client.close()

Use cases for manual registration:

  • Custom routers or load balancers
  • Background workers or daemons
  • Applications that need dynamic metadata updates
  • Testing and debugging
  • Applications without OllamaPool

Registration Methods Comparison 📊

Method Use Case Complexity Customization
Auto-registration Standard SOLLOL applications Zero config Limited (app_name only)
Custom app_name Multiple apps, better naming One parameter App name
Manual DashboardClient Custom applications ⚠️ More code Full control

Quick decision guide:

  • Using OllamaPool? → Use app_name parameter
  • Need custom metadata? → Use DashboardClient directly
  • Need dynamic updates? → Use DashboardClient with manual heartbeats
  • Just want it to work? → Use auto-registration (default)

Persistent Dashboard Service

Start the persistent dashboard once (survives application exits):

# Start dashboard service (runs until stopped)
python3 -m sollol.dashboard_service --port 8080 --redis-url redis://localhost:6379

# Or run in background
nohup python3 -m sollol.dashboard_service --port 8080 --redis-url redis://localhost:6379 > /tmp/dashboard_service.log 2>&1 &

Features:

  • 📊 Real-time metrics: System status, latency, success rate, GPU memory, Ray workers
  • 📜 Live log streaming: WebSocket-based log tailing (via Redis pub/sub)
  • 🌐 Activity monitoring: Ollama server and llama.cpp RPC activity
  • 🔷 Embedded Ray dashboard: Task-level distributed tracing
  • 📈 Embedded Dask dashboard: Performance profiling and task graphs
  • 🔍 Auto-discovery: Automatically discovers Ollama nodes and RPC backends when no router context

Embedded Dashboard (Alternative)

Applications can also start their own embedded dashboards:

from sollol import run_unified_dashboard
import threading

# Start embedded dashboard with router context
dashboard_thread = threading.Thread(
    target=run_unified_dashboard,
    kwargs={
        "router": pool,  # Provides node/backend context
        "dashboard_port": 8080,
        "host": "0.0.0.0",
        "enable_dask": False
    },
    daemon=True
)
dashboard_thread.start()

Environment Variables (configure before initializing):

# Disable dashboard (default: true)
export SOLLOL_DASHBOARD=false

# Change dashboard port (default: 8080)
export SOLLOL_DASHBOARD_PORT=9090

# Disable Dask dashboard integration (default: true)
export SOLLOL_DASHBOARD_DASK=false

Multi-Application Pattern

The persistent dashboard service enables multiple applications to share observability:

# Terminal 1: Start persistent dashboard
python3 -m sollol.dashboard_service --port 8080 --redis-url redis://localhost:6379

# Terminal 2: Start application 1
python my_app1.py  # Auto-registers with dashboard

# Terminal 3: Start application 2
python my_app2.py  # Also auto-registers

# Visit http://localhost:8080 to see both applications!

Benefits:

  • Single dashboard for all SOLLOL-based applications
  • Dashboard stays running when applications exit
  • Aggregated logs from all applications (via Redis pub/sub)
  • Centralized observability for distributed systems

Programmatic Stats Access

# Get detailed stats
stats = pool.get_stats()
print(f"Total requests: {stats['total_requests']}")
print(f"Average latency: {stats['avg_latency_ms']}ms")
print(f"Success rate: {stats['success_rate']:.2%}")

# Per-node breakdown
for host, metrics in stats['hosts'].items():
    print(f"{host}: {metrics['latency_ms']}ms, {metrics['success_rate']:.2%}")

Prometheus Metrics

# Prometheus metrics endpoint
curl http://localhost:9090/metrics

# sollol_requests_total{host="gpu-1:11434",model="llama3.2"} 1234
# sollol_latency_seconds{host="gpu-1:11434"} 0.234
# sollol_success_rate{host="gpu-1:11434"} 0.98

🔌 Integration Examples

🔗 Integration with SynapticLlamas & FlockParser

SOLLOL is the distributed inference platform for the complete AI ecosystem, powering both SynapticLlamas (multi-agent orchestration) and FlockParser (document RAG).

The Complete Stack

┌─────────────────────────────────────────────────────────────┐
│              SynapticLlamas (v0.1.0+)                       │
│          Multi-Agent System & Orchestration                 │
│  • Research agents  • Editor agents  • Storyteller agents  │
└───────────┬────────────────────────────────────┬───────────┘
            │                                    │
            │ RAG Queries                        │ Distributed
            │ (with pre-computed embeddings)     │ Inference
            │                                    │
     ┌──────▼──────────┐              ┌─────────▼────────────┐
     │  FlockParser    │              │      SOLLOL          │
     │  API (v1.0.4+)  │              │  Load Balancer       │
     │  Port: 8000     │              │  (v0.9.31+)          │
     └─────────────────┘              └──────────────────────┘
            │                                    │
            │ ChromaDB                          │ Intelligent
            │ Vector Store                      │ GPU/CPU Routing
            │                                    │
     ┌──────▼──────────┐              ┌─────────▼────────────┐
     │  Knowledge Base │              │  Ollama Nodes        │
     │  41 Documents   │              │  (Distributed)       │
     │  6,141 Chunks   │              │  GPU + CPU           │
     └─────────────────┘              └──────────────────────┘

Why This Integration Matters

Component Role Key Feature
SOLLOL Distributed Inference Intelligent GPU/CPU routing with load balancing
SynapticLlamas Multi-Agent Orchestration Research, Editor, Storyteller agents
FlockParser Document RAG & Knowledge Base ChromaDB vector store with 6,141+ chunks

Quick Start: Complete Ecosystem

# Install all three packages (auto-installs dependencies)
pip install synaptic-llamas  # Pulls in flockparser>=1.0.4 and sollol>=0.9.31

# Start FlockParser API
flockparse

# Run SynapticLlamas with SOLLOL + FlockParser integration
synaptic-llamas --interactive --distributed

Integration Example: Load Balanced RAG

from sollol import OllamaPool
from flockparser_adapter import FlockParserAdapter

# Initialize SOLLOL for distributed inference
sollol = OllamaPool.auto_configure()

# Initialize FlockParser adapter
flockparser = FlockParserAdapter("http://localhost:8000", remote_mode=True)

# Step 1: Generate embedding using SOLLOL (load balanced!)
user_query = "What does research say about quantum entanglement?"
embedding = sollol.embed(
    model="mxbai-embed-large",
    input=user_query
)
# SOLLOL routes to fastest GPU automatically

# Step 2: Query FlockParser with pre-computed embedding
rag_results = flockparser.query_remote(
    query=user_query,
    embedding=embedding,  # Skip FlockParser's embedding generation
    n_results=5
)
# FlockParser returns relevant chunks from 41 documents

# Performance gain: 2-5x faster when SOLLOL has faster nodes!

Production Integrations

SOLLOL is actively used in production by:

  • FlockParser - Document RAG Intelligence with distributed processing. FlockParser's legacy load balancing code was refactored and became core SOLLOL logic. FlockParser now uses SOLLOL directly via OllamaPool for intelligent routing across document embeddings and LLM queries.

  • SynapticLlamas - Multi-agent collaborative research framework. Uses SOLLOL's HybridRouter for distributed agent execution with RAG-enhanced research capabilities via FlockParser integration.

Related Projects:


SynapticLlamas Integration

from sollol import SOLLOL, SOLLOLConfig
from synaptic_llamas import AgentOrchestrator

# Setup SOLLOL for multi-agent orchestration
config = SOLLOLConfig.auto_discover()
sollol = SOLLOL(config)
sollol.start(blocking=False)

# SynapticLlamas now uses SOLLOL for intelligent routing
orchestrator = AgentOrchestrator(
    llm_endpoint="http://localhost:8000/api/chat"
)

# All agents automatically distributed and optimized
orchestrator.run_parallel_agents([...])

FlockParser Integration

from sollol import OllamaPool

# FlockParser uses SOLLOL's OllamaPool directly
pool = OllamaPool(
    nodes=None,  # Auto-discover all Ollama nodes
    enable_intelligent_routing=True,
    exclude_localhost=True,
    discover_all_nodes=True,
    app_name="FlockParser",
    enable_ray=True
)

# All FlockParser document embeddings and queries route through SOLLOL
embeddings = pool.embed(model="mxbai-embed-large", input="document text")
response = pool.chat(model="llama3.2", messages=[{"role": "user", "content": "query"}])

LangChain Integration

from langchain.llms import Ollama
from sollol import OllamaPool

# Use SOLLOL as LangChain backend
pool = OllamaPool.auto_configure()

llm = Ollama(
    base_url="http://localhost:8000",
    model="llama3.2"
)

# LangChain requests now go through SOLLOL
response = llm("What is quantum computing?")

🏭 Production Deployment (Bare Metal)

For teams preferring bare metal infrastructure over containers, SOLLOL provides systemd-based deployment for production environments.

Multi-Node Bare Metal Setup

This setup assumes you have 3+ physical machines with Ollama installed. We'll configure SOLLOL as a centralized routing layer.

Architecture:

┌─────────────────────────────────────────┐
│   Central Router Machine (Control Plane│
│   - SOLLOL Dashboard (port 8080)       │
│   - Redis (port 6379)                  │
│   - Optional: GPU reporter             │
└────────────┬────────────────────────────┘
             │ Auto-discovery via network
             │ scan (ports 11434)
     ┌───────┼──────────┬─────────────┐
     ▼       ▼          ▼             ▼
┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐
│ Node 1  │ │ Node 2  │ │ Node 3  │ │ Node N  │
│ Ollama  │ │ Ollama  │ │ Ollama  │ │ Ollama  │
│ :11434  │ │ :11434  │ │ :11434  │ │ :11434  │
│ GPU 24GB│ │ GPU 16GB│ │ CPU 64c │ │ ...     │
└─────────┘ └─────────┘ └─────────┘ └─────────┘

Step 1: Install Ollama on each node

On each worker node (Node 1, 2, 3, ...):

# Install Ollama
curl -fsSL https://ollama.ai/install.sh | sh

# Start Ollama service
sudo systemctl enable ollama
sudo systemctl start ollama

# Verify it's running
curl http://localhost:11434/api/tags

Step 2: Install SOLLOL on control plane machine

On your central router machine:

# Install SOLLOL and dependencies
pip install sollol redis

# Install Redis
sudo apt-get install redis-server  # Ubuntu/Debian
# OR
sudo yum install redis              # RHEL/CentOS

# Start Redis
sudo systemctl enable redis
sudo systemctl start redis

Step 3: Create systemd service for SOLLOL Dashboard

Create /etc/systemd/system/sollol-dashboard.service:

[Unit]
Description=SOLLOL Dashboard Service
After=network.target redis.service
Requires=redis.service

[Service]
Type=simple
User=sollol  # Create dedicated user for security
Group=sollol
WorkingDirectory=/opt/sollol
Environment="SOLLOL_DASHBOARD=true"
Environment="SOLLOL_DASHBOARD_PORT=8080"
Environment="REDIS_URL=redis://localhost:6379"
ExecStart=/usr/bin/python3 -m sollol.dashboard_service --port 8080 --redis-url redis://localhost:6379
Restart=always
RestartSec=10
StandardOutput=journal
StandardError=journal

[Install]
WantedBy=multi-user.target

Enable and start:

sudo useradd -r -s /bin/false sollol  # Create dedicated user
sudo mkdir -p /opt/sollol
sudo chown sollol:sollol /opt/sollol

sudo systemctl daemon-reload
sudo systemctl enable sollol-dashboard
sudo systemctl start sollol-dashboard

# Verify
sudo systemctl status sollol-dashboard
curl http://localhost:8080/health

On each GPU node for accurate VRAM monitoring:

# Install on each node with GPUs
pip install sollol gpustat

# Run GPU reporter (publishes to central Redis)
sollol install-gpu-reporter --redis-host <control-plane-ip>

# Example for node at 10.9.66.45
sollol install-gpu-reporter --redis-host 10.9.66.154

Create /etc/systemd/system/sollol-gpu-reporter.service on each GPU node:

[Unit]
Description=SOLLOL GPU Reporter
After=network.target

[Service]
Type=simple
User=sollol
ExecStart=/usr/local/bin/sollol-gpu-reporter --redis-host <control-plane-ip> --interval 5
Restart=always
RestartSec=5

[Install]
WantedBy=multi-user.target

Step 5: Configure firewall rules

On all nodes:

# Allow Ollama traffic (port 11434)
sudo ufw allow 11434/tcp comment "Ollama API"

# On control plane only: allow dashboard access
sudo ufw allow 8080/tcp comment "SOLLOL Dashboard"
sudo ufw allow 6379/tcp comment "Redis"  # Only from trusted nodes

# Reload firewall
sudo ufw reload

Step 6: Test the deployment

From any machine with network access:

from sollol import OllamaPool

# SOLLOL auto-discovers all nodes via network scan
pool = OllamaPool.auto_configure()

# Verify nodes discovered
stats = pool.get_stats()
print(f"Discovered {stats['active_nodes']} nodes")

# Make a test request
response = pool.chat(
    model="llama3.2",
    messages=[{"role": "user", "content": "Hello!"}]
)
print(response['message']['content'])
print(f"Routed to: {response['_sollol_routing']['host']}")

Step 7: Monitor with systemd

# Check dashboard status
sudo systemctl status sollol-dashboard

# View live logs
sudo journalctl -u sollol-dashboard -f

# Check GPU reporters
sudo systemctl status sollol-gpu-reporter

# View metrics
curl http://localhost:8080/api/stats | jq

Production Hardening

Security:

# 1. Run SOLLOL as dedicated unprivileged user
sudo useradd -r -s /bin/false sollol

# 2. Configure Redis authentication
sudo vi /etc/redis/redis.conf
# Add: requirepass <strong-password>

# 3. Use firewall to restrict access
sudo ufw allow from 10.9.66.0/24 to any port 6379  # Redis from trusted subnet only
sudo ufw allow from 10.9.66.0/24 to any port 8080  # Dashboard from trusted subnet

High Availability:

# Use systemd watchdog for automatic restart on crashes
[Service]
WatchdogSec=30
Restart=always
RestartSec=10

Monitoring:

# Integrate with Prometheus
curl http://localhost:9090/metrics

# Or use systemd monitoring
systemctl status sollol-dashboard | grep "Active:"

Troubleshooting

Nodes not discovered:

# Check network connectivity
for ip in 10.9.66.{1..255}; do
    timeout 0.5 bash -c "cat < /dev/null > /dev/tcp/$ip/11434 2>/dev/null" && echo "$ip:11434 reachable"
done

# Check Ollama is listening on all interfaces (not just localhost)
curl http://<node-ip>:11434/api/tags

Dashboard not starting:

# Check Redis is running
systemctl status redis
redis-cli ping  # Should return "PONG"

# Check port not in use
sudo lsof -i :8080

# View detailed logs
journalctl -u sollol-dashboard --since "10 minutes ago"

Performance issues:

# Check node health
curl http://localhost:8080/api/stats | jq '.node_performance'

# Monitor resource usage
htop
nvidia-smi  # On GPU nodes

# Check network latency between nodes
ping <node-ip>

📚 Documentation


🆕 What's New in v0.7.0

📦 Batch Processing API

Complete RESTful API for asynchronous batch job management. Submit large-scale batch operations (embeddings, bulk inference) and track progress via job IDs.

import requests

# Submit batch embedding job (up to 10,000 documents)
response = requests.post("http://localhost:11434/api/batch/embed", json={
    "model": "nomic-embed-text",
    "documents": ["doc1", "doc2", ...],  # Thousands of documents
})
job_id = response.json()["job_id"]

# Check status
status = requests.get(f"http://localhost:11434/api/batch/jobs/{job_id}")
print(status.json()["progress"]["percent"])  # 100.0

# Get results
results = requests.get(f"http://localhost:11434/api/batch/results/{job_id}")
embeddings = results.json()["results"]

Batch API Endpoints:

  • POST /api/batch/embed - Submit batch embedding job
  • GET /api/batch/jobs/{job_id} - Get job status with progress tracking
  • GET /api/batch/results/{job_id} - Retrieve job results and errors
  • DELETE /api/batch/jobs/{job_id} - Cancel running jobs
  • GET /api/batch/jobs?limit=100 - List recent jobs

Features:

  • UUID-based job tracking with 5 states (PENDING, RUNNING, COMPLETED, FAILED, CANCELLED)
  • Automatic TTL-based cleanup (1 hour default)
  • Progress tracking: completed_items, failed_items, percentage
  • Duration calculation and metadata storage
  • Async job execution via Dask distributed processing

Performance Optimizations (v0.9.18+)

SOLLOL now includes 8 production-grade performance optimizations designed to improve throughput and latency:

⚠️ Transparency Note: These features are implemented and functional, but claimed performance improvements are projections based on architecture, NOT independently validated benchmarks. See Performance Impact section below for details.

🚀 Response Caching Layer

Expected Impact: Reduces latency for repeated queries (cache hit/miss tracking validated)

Intelligent LRU cache with TTL expiration:

from sollol import OllamaPool

# Enable response caching (enabled by default)
pool = OllamaPool.auto_configure(
    enable_cache=True,
    cache_max_size=1000,  # Cache up to 1000 responses
    cache_ttl=3600        # 1 hour TTL
)

# First request: normal latency
response1 = pool.embed(model="mxbai-embed-large", input="Hello world")

# Cached request: faster
response2 = pool.embed(model="mxbai-embed-large", input="Hello world")  # Cache hit

# Programmatic cache management
pool.clear_cache()                              # Clear all
pool.invalidate_cache_by_model("llama3.2")     # Invalidate by model
cache_data = pool.export_cache()                # Export for persistence
pool.import_cache(cache_data)                   # Restore from export

# Get cache stats
stats = pool.get_cache_stats()
print(f"Hit rate: {stats['hit_rate']:.1%}")    # 85.2%
print(f"Cache size: {stats['size']}")           # 234/1000

🌊 Streaming Support

Expected Impact: Better UX, reduced perceived latency (streaming functionality validated)

Token-by-token streaming for chat() and generate():

# Stream chat responses
for chunk in pool.chat(
    model="llama3.2",
    messages=[{"role": "user", "content": "Tell me a story"}],
    stream=True
):
    content = chunk.get("message", {}).get("content", "")
    print(content, end="", flush=True)

# Stream text generation
for chunk in pool.generate(
    model="llama3.2",
    prompt="Explain quantum computing",
    stream=True
):
    print(chunk.get("response", ""), end="", flush=True)

🔥 Smart Model Prefetching

Expected Impact: 1-5 seconds reduced first-request latency (projection, not measured)

Pre-load models into VRAM before first use:

# Warm a single model
pool.warm_model("llama3.2")

# Warm multiple models in parallel
results = pool.warm_models(
    models=["llama3.2", "codellama", "mistral"],
    parallel=True
)
print(f"Warmed {sum(results.values())} models")

Async I/O Support

Expected Impact: 2-3x throughput for concurrent requests (projection, not measured)

True non-blocking I/O with httpx AsyncClient:

import asyncio

# Async methods for concurrent requests
async def process_batch():
    responses = await asyncio.gather(
        pool.chat_async("llama3.2", messages=[...]),
        pool.generate_async("llama3.2", prompt="..."),
        pool.embed_async("mxbai-embed-large", input="...")
    )
    return responses

# Run async batch
results = asyncio.run(process_batch())

🔗 HTTP/2 Multiplexing

Expected Impact: 30-50% latency reduction for concurrent requests (projection, not measured)

Automatic HTTP/2 support when httpx is installed:

# Automatically uses HTTP/2 if available
pool = OllamaPool.auto_configure()

# Check if HTTP/2 is enabled
stats = pool.get_stats()
print(f"HTTP/2 enabled: {stats['http2_enabled']}")  # True

📊 Additional Optimizations

Connection Pool Tuning (10-20% better concurrency):

  • Optimized pool sizes: 10-20 connections per node
  • Automatic retry with exponential backoff
  • Connection reuse with keep-alive

Adaptive Health Checks (5-10% overhead reduction):

  • Dynamic intervals based on node stability:
    • Very stable (<1% failures): 60s interval
    • Stable (<5% failures): 30s interval
    • Degraded (5-15% failures): 15s interval
    • Unstable (>15% failures): 5s interval

Telemetry Sampling (~90% overhead reduction):

  • Configurable sampling for info-level events (default: 10%)
  • Always logs errors and critical events
  • Reduces dashboard logging overhead

📈 Performance Impact

⚠️ IMPORTANT: These are architectural projections, NOT measured results

These optimizations are implemented and functional, but multi-node performance gains have not been independently validated:

Projected improvements (unvalidated):

  • Throughput: +150-300% for concurrent workloads (theory: parallel request handling)
  • Latency: -40-70% for typical requests (theory: caching + HTTP/2)
  • Cache hits: Significant latency reduction for repeated queries (validated in single-node tests)

What's actually measured:

  • Response caching works (cache hit/miss rates tracked)
  • Streaming works (token-by-token delivery confirmed)
  • HTTP/2 enabled (httpx connection verified)
  • ⚠️ Multi-node throughput gains: Not independently benchmarked

To validate these claims yourself:

# Run comparative benchmarks
cd benchmarks
python run_benchmarks.py --sollol-url http://localhost:8000 --duration 120

See BENCHMARKING.md for methodology.


Previous Features (v0.3.6+)

Synchronous API - No async/await required:

from sollol.sync_wrapper import OllamaPool
pool = OllamaPool.auto_configure()
response = pool.chat(...)  # Synchronous call

Priority Helpers - Semantic priority levels:

from sollol.priority_helpers import Priority
priority = Priority.HIGH  # 7

SOLLOL Detection:

  • X-Powered-By: SOLLOL header on all responses
  • /api/health endpoint returns {"service": "SOLLOL", "version": "0.7.0"}

🆚 Comparison

SOLLOL vs. Simple Load Balancers

Feature nginx/HAProxy SOLLOL
Routing Round-robin/random Context-aware, adapts from history
Resource awareness None GPU/CPU/memory-aware
Failover Manual config Automatic detection & recovery
Model sharding llama.cpp integration
Task prioritization Priority queue
Observability Basic Rich metrics + dashboard
Setup Complex config Auto-discover

SOLLOL vs. Kubernetes

Feature Kubernetes SOLLOL
Complexity High - requires cluster setup Low - pip install
AI-specific Generic container orchestration Purpose-built for LLMs
Intelligence None Task-aware routing
Model sharding Manual Automatic
Best for Large-scale production AI-focused teams

Use both! Deploy SOLLOL on Kubernetes for ultimate scalability.


🤝 Contributing

We welcome contributions! Areas we'd love help with:

  • ML-based routing predictions
  • Additional monitoring integrations
  • Cloud provider integrations
  • Performance optimizations
  • Documentation improvements

See CONTRIBUTING.md for guidelines.


📜 License

MIT License - see LICENSE file for details.


🙏 Credits

Created by BenevolentJoker-JohnL

Part of the Complete AI Ecosystem:

Special Thanks:

  • Dallan Loomis - For always providing invaluable support, feedback, and guidance throughout development. Your insights and encouragement have been instrumental in shaping this project.

Built with: Ray, Dask, FastAPI, llama.cpp, Ollama


🎯 What Makes SOLLOL Different?

  1. Combines task distribution AND model sharding in one system
  2. Context-aware routing that adapts based on performance metrics
  3. Auto-discovery of nodes with minimal configuration
  4. Built-in failover and priority queuing
  5. Purpose-built for Ollama clusters (understands GPU requirements, task types)

Limitations to know:

  • Model sharding verified with 13B models; larger models not extensively tested
  • Performance benefits depend on network latency and workload patterns
  • Not a drop-in replacement for single-node setups in all scenarios

Stop manually managing your LLM cluster. Let SOLLOL optimize it for you.

Get StartedView on GitHubReport Issue