QuinkGL: Decentralized Gossip Learning Framework

QuinkGL is a fully decentralized, peer-to-peer (P2P) federated learning framework that enables collaborative model training across distributed devices without relying on a central parameter server. Built on gossip-based protocols, QuinkGL addresses the core challenges of decentralized learning: communication efficiency, non-IID data heterogeneity, and Byzantine fault tolerance.

Motivation

Centralized federated learning (FL) architectures such as FedAvg [McMahan et al., 2017] depend on a parameter server for global aggregation, introducing a single point of failure and a communication bottleneck. As edge computing scales — driven by IoT proliferation and privacy-sensitive domains like healthcare — decentralized alternatives become essential.

QuinkGL draws from the gossip learning paradigm [Ormándi et al., 2013], where nodes exchange model updates directly with randomly selected peers. This eliminates server dependency and enables organic convergence through repeated local interactions. The framework extends this foundation with:

Data-aware peer selection via privacy-preserving fingerprints
Entropy-weighted aggregation inspired by RNEP [Kang & Lee, 2024]
Byzantine-resilient strategies including Krum [Blanchard et al., 2017] and TrimmedMean
Pluggable architecture for topology, aggregation, and model strategies

Key Features

Feature	Description
Fully Decentralized	No central server — pure P2P gossip protocol
Non-IID Resilient	AffinityTopology + EntropyWeightedAvg + FedProx + SCAFFOLD for heterogeneous data
Privacy-Preserving Fingerprints	Quantized, noised, schema-validated data summaries with per-round binding for peer matching
Byzantine Fault Tolerance	Krum, MultiKrum, TrimmedMean aggregation strategies
NAT Traversal	IPv8 with UDP hole punching + automatic tunnel fallback
Framework Agnostic	PyTorch, TensorFlow, or custom model wrappers
Swarm Manifest	Canonical SHA-256 commitment to training protocol and privacy policy
Personalized FL	APFL adaptive mixing, FedRep-style backbone/head split
Staleness-Aware	StalenessWeightedFedAvg for asynchronous environments
Variance Reduction	SCAFFOLD with gossip-adapted control variates (Karimireddy et al., 2020)
Spectral Analysis	Runtime algebraic connectivity (λ₂) and spectral gap measurement for topology evaluation
Observability	Event-driven telemetry with terminal rendering

Installation

pip install quinkgl

For development:

git clone https://github.com/QuinkGL/quinkgl-framework.git
cd quinkgl-framework
pip install -e ".[dev]"

Quick Start

CLI (New in Phase 1)

# Install
pip install quinkgl

# 1. Create a manifest (this is the swarm blueprint, not the swarm itself)
quinkgl manifest create \
  --name my-swarm \
  --task-type class \
  --input-shape 3,224,224 \
  --output-shape 10 \
  --label-type integer \
  --model-framework pytorch \
  --model-arch-hash sha256:7f2c1a9b3e4d0123456789abcdef0123456789abcdef0123456789abcdef0123 \
  --aggregation FedAvg \
  --topology Random \
  --output swarm.qgl

# 2. Verify the manifest
quinkgl manifest verify swarm.qgl

# 3. Get a shareable magnet URI
quinkgl manifest magnet swarm.qgl

# 4. Scaffold a custom peer project
quinkgl init --output-dir my-peer --template pytorch-vision --manifest swarm.qgl

# 5. Start a peer — the swarm is born when the first peer runs
quinkgl run --manifest swarm.qgl --script my-peer/peer_script.py --dry-run

Note: Creating the manifest does not start a swarm. The manifest is only a static blueprint. A swarm comes into existence when the first peer calls quinkgl run with that manifest.

Python API

import asyncio
import torch.nn as nn
from quinkgl import GossipNode, PyTorchModel, AffinityTopology, EntropyWeightedAvg

# 1. Define your model
class SimpleNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(784, 128)
        self.fc2 = nn.Linear(128, 10)
        self.relu = nn.ReLU()

    def forward(self, x):
        x = x.view(x.size(0), -1)
        return self.fc2(self.relu(self.fc1(x)))

# 2. Wrap the model
model = PyTorchModel(SimpleNet(), device="cpu")

# 3. Create and run the node
async def main():
    node = GossipNode(
        node_id="alice",
        domain="mnist",
        model=model,
        port=7000,
        topology=AffinityTopology(min_affinity=0.3),
        aggregation=EntropyWeightedAvg(),
    )

    await node.start()
    await node.run_continuous(training_data)
    await node.shutdown()

asyncio.run(main())

Architecture

┌──────────────────────────────────────────────────────────────────┐
│                          GossipNode                              │
│    (Production-ready node with P2P networking + fallback)        │
├──────────────────────────────────────────────────────────────────┤
│  ┌──────────────┐  ┌────────────────┐  ┌──────────────────────┐ │
│  │ PyTorchModel │  │ RandomTopology │  │      FedAvg          │ │
│  │ TensorFlow   │  │ CyclonTopology │  │ FedProx  │ FedAvgM  │ │
│  │ CustomModel  │  │ AffinityTopol. │  │ Krum │ TrimmedMean  │ │
│  │              │  │                │  │ EntropyWeightedAvg   │ │
│  │              │  │                │  │ StalenessWeighted    │ │
│  └──────────────┘  └────────────────┘  └──────────────────────┘ │
├──────────────────────────────────────────────────────────────────┤
│  ┌────────────────────────────────────────────────────────────┐  │
│  │    DataFingerprint ─► AffinityScore ─► Peer Selection     │  │
│  │    (Privacy-preserving data distribution summaries)       │  │
│  └────────────────────────────────────────────────────────────┘  │
├──────────────────────────────────────────────────────────────────┤
│  ┌────────────────────────────────────────────────────────────┐  │
│  │           ModelAggregator (Train → Gossip → Aggregate)    │  │
│  └────────────────────────────────────────────────────────────┘  │
├──────────────────────────────────────────────────────────────────┤
│  ┌────────────────────────────────────────────────────────────┐  │
│  │         IPv8 Network Layer + Tunnel Fallback              │  │
│  │      (P2P, NAT Traversal, UDP Hole Punching, Relay)      │  │
│  └────────────────────────────────────────────────────────────┘  │
├──────────────────────────────────────────────────────────────────┤
│  ┌────────────────────────────────────────────────────────────┐  │
│  │    Observability: EventEmitter → TelemetryClient          │  │
│  └────────────────────────────────────────────────────────────┘  │
└──────────────────────────────────────────────────────────────────┘

Project Structure

QuinkGL/
├── src/quinkgl/
│   ├── core/                  # LearningNode (network-agnostic abstraction)
│   ├── models/                # PyTorch, TensorFlow, personalized model wrappers
│   ├── topology/              # RandomTopology, CyclonTopology, AffinityTopology, SpectralAnalyzer
│   ├── aggregation/           # FedAvg, FedProx, FedAvgM, Krum, TrimmedMean,
│   │                          # EntropyWeightedAvg, StalenessWeightedFedAvg, Scaffold
│   ├── fingerprint/           # DataFingerprint, AffinityWeights, FingerprintComputer
│   ├── manifest/              # SwarmManifest, DataPolicy, CollaborationPolicy
│   ├── gossip/                # Protocol primitives, ModelAggregator orchestration
│   ├── network/               # GossipNode, IPv8 manager, gossip community
│   ├── training/              # Convergence monitoring, prototype-based alignment
│   ├── serialization/         # Model weight serialization, compression pipeline, Error Feedback
│   ├── storage/               # Model checkpointing
│   ├── observability/         # EventEmitter, RuntimeEvent, TerminalObserver
│   ├── telemetry/             # TelemetryClient
│   └── utils/                 # Shared utilities
├── tests/                     # 364+ unit tests
└── docs/                      # Deployment guides, research notes

Package Responsibilities

Package	Responsibility
`core`	Public node abstraction without transport concerns
`gossip`	Round orchestration and protocol primitives
`network`	IPv8 transport, NAT traversal, and wire delivery
`aggregation`	Model merge strategies (pluggable)
`topology`	Peer selection, partial-view management, spectral analysis
`fingerprint`	Privacy-preserving data distribution summaries
`manifest`	Cryptographic swarm identity and policy declaration
`training`	Convergence monitoring, prototype alignment (FedProto/FedPAC)
`serialization`	Model weight serialization, compression pipeline, error feedback
`observability`	Event-driven runtime telemetry

Topology Strategies

QuinkGL provides pluggable peer selection strategies that determine which peers to exchange models with each round.

Strategy	Approach	Literature
`RandomTopology`	Uniform random peer selection	Ormándi et al., 2013
`CyclonTopology`	Periodic shuffling for network exploration	Voulgaris et al., 2005
`AffinityTopology`	Data-aware peer selection via fingerprint similarity with exploration–exploitation balancing	Domain-aware collaboration (this work)

Spectral Analysis

The SpectralAnalyzer provides runtime measurement of topology quality through algebraic connectivity and spectral gap — quantities that directly determine gossip convergence speed [Koloskova et al., 2020].

from quinkgl.topology import SpectralAnalyzer, build_ring_adjacency

analyzer = SpectralAnalyzer()
report = analyzer.analyze(build_ring_adjacency(10))
print(report.summary())
# n=10 e=10 λ₂=0.3820 gap=0.1315 connected=True mix_time≤17.5

Metric	Meaning
`algebraic_connectivity` (λ₂)	Fiedler value — positive ↔ connected graph
`spectral_gap` (1−\|λ₂(W)\|)	Larger gap → faster gossip convergence
`mixing_time_upper`	Upper bound: `log(n) / spectral_gap`
`is_connected`	Whether the graph is fully connected

AffinityTopology — Like-Attracts-Like

AffinityTopology selects peers based on data distribution similarity using privacy-preserving fingerprints. It incorporates:

Multi-signal affinity — label buckets (40%), feature moments (30%), gradient similarity (15%), collaboration history (15%)
Cold-start resilience — three phases (blind → learning → exploiting) with decaying exploration ratio
Adaptive collaboration graph — EMA-blended edge weights with automatic decay and eviction of stale edges

Communication Efficiency — Error Feedback

QuinkGL's compression pipeline (Delta → Sparsify → Quantize → Serialize → Zlib) uses biased compressors (Top-k, QSGD). Without correction, these break convergence guarantees. The ErrorFeedbackState module implements the Error Feedback mechanism [Alistarh et al., 2018] that accumulates the compression residual and re-injects it in the next round:

from quinkgl.serialization import CompressionConfig, SparsificationConfig

config = CompressionConfig(
    sparsification=SparsificationConfig(top_k_ratio=0.01),
    error_feedback=True,   # activate EF — turns biased compressor effectively unbiased
)

Key property: Over K rounds, Σ compressed_outputs + final_residual = Σ raw_deltas (information conservation, verified by unit tests). Supports EF21-style momentum blending and optional residual norm capping.

Aggregation Strategies

All strategies implement the AggregationStrategy interface and are hot-swappable.

Strategy	Type	Description	Reference
`FedAvg`	Standard	Weighted averaging by sample count	McMahan et al., 2017
`FedProx`	Non-IID	Proximal term to limit client drift	Li et al., 2020
`FedAvgM`	Stability	Server momentum for smoother convergence	Hsu et al., 2019
`EntropyWeightedAvg`	Non-IID	Shannon entropy–based weighting (RNEP-inspired)	Kang & Lee, 2024
`StalenessWeightedFedAvg`	Async	Exponential penalty for stale updates	—
`Scaffold`	Non-IID	Control-variate drift correction (gossip variant)	Karimireddy et al., 2020
`TrimmedMean`	Byzantine	Trim extreme values before averaging	Yin et al., 2018
`Krum` / `MultiKrum`	Byzantine	Select most central update(s)	Blanchard et al., 2017

EntropyWeightedAvg — RNEP-Inspired Aggregation

Weights each peer's contribution by the Shannon entropy of its local label distribution. Peers with diverse (high-entropy) data exert more influence on the aggregated model, while skewed (low-entropy) peers contribute less — preventing overfitting to biased local distributions.

from quinkgl import EntropyWeightedAvg

aggregation = EntropyWeightedAvg(
    entropy_floor=0.01,    # minimum weight for single-class peers
    fallback_weight=1.0,   # weight when no distribution metadata available
)

Scaffold — Variance Reduction via Control Variates

Implements the SCAFFOLD algorithm [Karimireddy et al., 2020] adapted for gossip topology. Each node maintains a control variate that estimates its local gradient drift. The gossip variant replaces the central server's global control variate with a running EMA of peer control variates.

from quinkgl import Scaffold

aggregation = Scaffold(
    learning_rate=0.01,       # local SGD learning rate
    global_learning_rate=1.0, # aggregation-side scaling
    control_momentum=0.0,     # 0.0 = classic EF, 0.9 = EF21 momentum
)

Key property: SCAFFOLD provably reduces the gradient variance caused by non-IID data, unlike FedProx which only adds a proximal penalty.

Privacy-Preserving Data Fingerprints

Each node computes a lightweight, privacy-preserving summary of its local data distribution. Raw statistics are never shared — all fields are transformed before transmission.

Raw Field	Privacy Transform	Output
Label distribution	Quantize into buckets (low/medium/high)	`label_buckets`
Feature moments (mean, var)	Add calibrated Gaussian noise	`noised_moments`
Sample count	Bucket into ranges (e.g., "1k–10k")	`sample_bucket`
Gradient moments	Noise + disabled by default (gradient inversion risk)	`gradient_moments`

Fingerprints are exchanged during peer discovery and used by AffinityTopology to compute affinity scores.

Fingerprint payloads are schema-versioned, strictly validated on parse, and can be refreshed with a per-round nonce during long-running gossip sessions to reduce cross-round linkability.

Swarm Manifest

The Swarm Manifest (.qgl file) is the canonical protocol-identity layer that binds swarm compatibility to a description of the training protocol, model architecture, aggregation strategy, topology, and trust boundary.

A manifest is not a running swarm — it is only a static blueprint. The swarm comes into existence when peers call quinkgl run --manifest swarm.qgl.

Manifests are:

Canonically hashed (SHA-256 over deterministic JSON) so any change to policy or architecture produces a new swarm identity.
Schema-versioned and strictly validated to avoid silent field drops or incompatible policy mixes.
Optionally signed with Ed25519 so peers can verify creator identity before joining.

To create a manifest you need the architecture hash of your model, which is a fingerprint of layer names, shapes, and dtypes (not weights). Compute it with quinkgl.manifest.compute_arch_hash(model) and pass it to quinkgl manifest create --model-arch-hash <hash>.

Personalized Federated Learning

QuinkGL supports personalization techniques to handle statistical heterogeneity:

Technique	Description
APFL (Adaptive Personalized FL)	Adaptive mixing coefficient between local and global models
FedRep-style split	Shared backbone + personalized head via `ModelSplit`
FedProto / FedPAC	Prototype-based alignment and classifier collaboration

Public API Overview

Core

Class	Description
`LearningNode`	Framework node without networking (bring your own transport)
`GossipNode`	Production node with IPv8 P2P + automatic tunnel fallback

Models

Class	Description
`PyTorchModel`	Wrapper for PyTorch `nn.Module` with NaN validation, gradient clipping
`TensorFlowModel`	Wrapper for TensorFlow/Keras models
`ModelWrapper`	Base class for custom framework wrappers
`PersonalizedModelWrapper`	Base for APFL-style personalized models
`TrainingConfig`	Training configuration (epochs, batch_size, lr, grad_clip, optimizer)

Fingerprint

Class	Description
`DataFingerprint`	Privacy-preserving data distribution summary
`FingerprintComputer`	Computes fingerprints from raw data with configurable privacy
`AffinityWeights`	Weights for multi-signal affinity computation
`FingerprintPrivacyConfig`	ε-DP budget, noise levels, bucket granularity

Manifest & Policy

Class	Description
`DataPolicy`	Minimum affinity, privacy level, cold-start rounds
`CollaborationPolicy`	Aggregation and topology parameters
`PersonalizationPolicy`	APFL, FedRep configuration
`PrototypePolicy`	FedProto/FedPAC alignment settings

Observability

Class	Description
`EventEmitter`	Publish/subscribe runtime events
`RuntimeEvent`	Structured event payload
`TerminalObserver`	Human-readable terminal rendering
`TelemetryClient`	Telemetry data collection

Requirements

Python 3.10+
PyTorch 1.9+ (optional, for PyTorchModel)
TensorFlow 2.x (optional, for TensorFlowModel)
IPv8 2.0+ (for P2P networking)
NumPy

Documentation

The canonical documentation set lives under docs/. Use docs/index.md as the entry point: it has a short decision tree and a table of contents that mirrors the book layout (Sphinx toctree).

Quick entry

Document	Description
`docs/index.md`	Hub: decision tree and links into all sections
`docs/quickstart.md`	Minimal “get running” path
`docs/getting-started.md`	Full getting started (English)
`docs/getting-started-tr.md`	Full getting started (Turkish)
`docs/faq.md`	Frequently asked questions

By section

Section	Start here
User guide	`docs/user-guide/index.md` (manifest, peer script, trust, telemetry, troubleshooting)
CLI	`docs/cli/index.md` (`manifest`, `run`, `init`, `keygen`, …)
Tutorials	`docs/tutorials/index.md` (T1–T6)
Concepts	`docs/concepts/index.md` (gossip, swarm, fingerprints)
Reference	`docs/reference/index.md` (API, manifest schema, error codes)
Security	`docs/security/index.md` (threat model, signing, TOFU, rate limits)
Cookbook	`docs/cookbook/index.md` (local swarm, multi-peer testing, custom wrappers)
Migration	`docs/migration/index.md`

References

McMahan et al. (2017). Communication-Efficient Learning of Deep Networks from Decentralized Data. AISTATS. (FedAvg)
Ormándi et al. (2013). Gossip Learning with Linear Models on Fully Distributed Data. Concurrency and Computation.
Blanchard et al. (2017). Machine Learning with Adversaries: Byzantine Tolerant Gradient Descent. NeurIPS. (Krum)
Yin et al. (2018). Byzantine-Robust Distributed Learning. ICML. (TrimmedMean)
Li et al. (2020). Federated Optimization in Heterogeneous Networks. MLSys. (FedProx)
Hsu et al. (2019). Measuring the Effects of Non-Identical Data Distribution for Federated Visual Classification. (FedAvgM)
Kang & Lee (2024). RNEP: Random Node Entropy Pairing for Efficient Decentralized Training with Non-IID Local Data. Electronics, 13(21), 4193. (EntropyWeightedAvg)
Karimireddy et al. (2020). SCAFFOLD: Stochastic Controlled Averaging for Federated Learning. ICML. (Scaffold)
Alistarh et al. (2018). The Convergence of Sparsified Gradient Methods. NeurIPS. (Error Feedback)
Richtárik et al. (2021). EF21: A New, Simpler, Theoretically Better. NeurIPS. (EF21 momentum)
Koloskova et al. (2020). Unified Theory of Decentralized SGD with Changing Topology and Local Updates. ICML. (Spectral Gap)
Boyd et al. (2006). Randomized Gossip Algorithms. IEEE Trans. Inf. Theory. (Metropolis–Hastings mixing)
Voulgaris et al. (2005). Cyclon: Inexpensive Membership Management for Unstructured P2P Overlays. JNSM. (CyclonTopology)
Deng et al. (2021). Adaptive Personalized Federated Learning. (APFL)

License

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Contributing

Contributions are welcome! Please read our contributing guidelines and submit pull requests to the main repository.

Name		Name	Last commit message	Last commit date
Latest commit History 92 Commits
.github/workflows		.github/workflows
docker		docker
docs		docs
protos		protos
simulation-tests		simulation-tests
src/quinkgl		src/quinkgl
tests		tests
.env.example		.env.example
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

QuinkGL: Decentralized Gossip Learning Framework

Motivation

Key Features

Installation

Quick Start

CLI (New in Phase 1)

Python API

Architecture

Project Structure

Package Responsibilities

Topology Strategies

Spectral Analysis

AffinityTopology — Like-Attracts-Like

Communication Efficiency — Error Feedback

Aggregation Strategies

EntropyWeightedAvg — RNEP-Inspired Aggregation

Scaffold — Variance Reduction via Control Variates

Privacy-Preserving Data Fingerprints

Swarm Manifest

Personalized Federated Learning

Public API Overview

Core

Models

Fingerprint

Manifest & Policy

Observability

Requirements

Documentation

Quick entry

By section

References

License

Contributing

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages