Best Practices Guide for ISP Operations

This guide consolidates operational best practices for deploying and running LibreQoS in ISP environments.

Use this page as an operations playbook. Use feature-specific pages for implementation details.

How to Use This Guide

  1. Pick your operating model with the decision tables.

  2. Complete the pre-deployment checklist before first production cutover.

  3. Use the runbooks during maintenance windows and incidents.

  4. Use the day-2 checklist as your recurring operational cadence.

1) Purpose and Audience

This guide is for:

  • ISP owners and technical managers

  • Network engineers and architects

  • NOC/support operators

This guide is not a replacement for installation and component documentation. Start with Quickstart, then use this page to standardize real-world operations.

2) Quick Decision Matrix

Decision Table A: Source of Truth

If this describes your operation

Recommended model

Why

You use a supported CRM/NMS integration end-to-end

Built-in integration mode

Lowest manual drift and simpler recurring updates

You have an internal orchestration pipeline

Custom source of truth mode

Preserves your automation while keeping ownership explicit

You are small and intentionally manage files directly

Manual files mode

Acceptable only when change volume is low and discipline is high

Reference: Operating Modes and Source of Truth, CRM/NMS Integrations

Decision Table B: Topology Strategy

Operational requirement

Recommended strategy

Maximum performance, minimal hierarchy

flat

AP-level aggregation with better performance headroom

ap_only

Site/AP visibility with moderate overhead

ap_site

Full backhaul/path hierarchy is required

full

Reference: Scale Planning and Topology Design

Decision Table C: Deployment Substrate

Deployment choice

Choose when

Main caution

Bare metal

Production-critical throughput and lowest latency overhead are required

Validate NIC support and single-thread CPU performance

VM (for example Proxmox)

You already operate mature virtualization and throughput targets fit VM envelope

Account for virtualization overhead and align virtio multiqueue to vCPU

Reference: System Requirements, Recipe: Proxmox VM Deployment

Feature Coverage Matrix

Feature Area

Operational Best-Practice Focus

Primary Reference

Integrations

Single source-of-truth ownership, overwrite discipline, controlled parameter changes

Integrations

Topology Strategies

Right-size hierarchy depth, balance parent distribution, maintain stable naming

Scale Planning and Topology Design

High Availability and Bypass

Deterministic active/backup routing policy, recurring failover/failback drills

High Availability and Failure Domains

StormGuard

Use as adaptive protection where WAN conditions vary, validate behavior during events

StormGuard

Node Manager UI

Verify operational state after changes; distinguish path failures from UI-only symptoms

Node Manager UI

Deployment Recipes

Apply proven implementation patterns for real-world topology scenarios

Deployment Recipes

3) Pre-Deployment Best Practices

  1. Validate platform fit before cabling.

    • Supported NIC family

    • Sufficient single-thread CPU performance

    • RAM sized for expected subscriber/circuit scale

  2. Validate baseline service health before pilot traffic.

sudo systemctl status lqosd lqos_scheduler
journalctl -u lqosd -u lqos_scheduler --since "10 minutes ago"
  1. Validate source-of-truth ownership before first sync.

    • Decide whether integrations, external scripts, or manual files own persistence

    • Avoid mixed ownership from day one (the most common cause of configuration drift)

  2. Validate integration data hygiene before production shaping.

    • Duplicate IP checks

    • ParentNode consistency checks

    • allow_subnets and ignore_subnets scope checks

Checklist: Pre-Deploy Readiness

  • [ ] Platform meets supported NIC/CPU/RAM requirements

  • [ ] lqosd and lqos_scheduler are healthy

  • [ ] Source-of-truth owner is explicitly documented

  • [ ] Integration credentials and sync settings are validated

  • [ ] Data hygiene checks are clean (duplicates/parents/subnets)

  • [ ] Rollback path is documented before first cutover

4) Source-of-Truth and Integration Best Practices

  1. Enforce one durable owner for shaping data.

    • Built-in integration mode: integration jobs own regenerated files

    • Custom mode: your external system owns generated files

    • Manual mode: direct edits are the durable path

  2. Treat overwrite behavior as an explicit design choice.

    • If integrations own topology, use overwrite behavior intentionally

    • Do not rely on manual edits to generated files unless ownership policy explicitly allows it

  3. Change integration parameters one set at a time.

    • Save changes

    • Restart scheduler if required by your workflow

    • Validate logs and WebUI state before applying additional changes

  4. Use recurring scheduler behavior intentionally.

    • Faster refresh intervals increase control-plane churn

    • Slower intervals reduce churn but delay corrections

Reference: Integrations, Configuration, Troubleshooting

5) Topology and Scale Best Practices

  1. Keep hierarchy only as deep as operationally necessary.

  2. Balance parent distribution to avoid single-core concentration.

  3. Favor stable naming and parent relationships to reduce queue churn.

  4. For multi-edge environments, prefer explicit, operator-controlled path policy over inferred assumptions.

  5. Validate topology changes in a maintenance window, not ad hoc during peak load.

Field pattern:

  • Operators commonly recover performance and stability by moving from unnecessary full hierarchy depth to ap_site or flat when full path control is not required.

Reference: Scale Planning and Topology Design, Recipes

6) Performance and Capacity Best Practices

  1. Design around single-thread performance, not only total core count.

  2. Verify queue/CPU distribution after topology or strategy changes.

  3. In VM deployments, align virtio multiqueue with vCPU and verify under realistic peak load.

  4. Treat MTU/encapsulation mismatches as first-class suspects in throughput anomalies.

  5. Use capacity planning discipline before symptoms force emergency hardware changes.

Reference: System Requirements, Performance Tuning

7) High Availability, Bypass, and Maintenance Best Practices

  1. Use deterministic active/backup policy (OSPF/BGP) for failover and failback.

  2. In switch-centric designs, validate shaped path and bypass path behavior independently.

  3. Keep failover drills routine; do not wait for incidents to test convergence behavior.

  4. Ensure backup paths are sized for realistic degraded-state demand.

Runbook: Maintenance Cutover Validation

  1. Confirm backup path health and expected capacity.

  2. Record current service and path state.

  3. Shift preference to backup policy.

  4. Validate subscriber traffic continuity plus key latency/throughput indicators.

  5. Execute maintenance on primary path.

  6. Restore primary preference.

  7. Validate failback behavior and post-change stability.

  8. Document outcome and any required corrections.

Maintenance Cutover Sequence (Visual)

        flowchart TD
    A[Confirm backup path health and capacity] --> B[Record current service and path state]
    B --> C[Shift routing preference to backup path]
    C --> D{Traffic continuity and KPIs healthy?}
    D -->|No| E[Stop and remediate before maintenance]
    D -->|Yes| F[Perform maintenance on primary path]
    F --> G[Restore primary path preference]
    G --> H{Failback stable?}
    H -->|No| I[Investigate and hold on backup/controlled state]
    H -->|Yes| J[Document closure and outcomes]
    

Reference: High Availability and Failure Domains, Recipe: Switch-Centric Fabric

8) Monitoring and Incident Response Best Practices

  1. Start triage with service health and logs, then move to topology and integration.

  2. Distinguish shaping-path failures from telemetry/UI presentation issues.

  3. Capture reproducible evidence before making broad corrective changes.

  4. Keep a standard incident evidence bundle for escalation.

Standard evidence bundle:

sudo systemctl status lqosd lqos_scheduler
journalctl -u lqosd --since "30 minutes ago"
journalctl -u lqos_scheduler --since "30 minutes ago"
ls -lh /opt/libreqos/src/network.json /opt/libreqos/src/ShapedDevices.csv

Runbook: Shaping Not Applied or Coverage Drops

  1. Confirm services are healthy.

  2. Check scheduler logs for validation failures.

  3. Check for duplicate IP, parent mismatch, or subnet-scope misalignment.

  4. Confirm source-of-truth ownership (manual edits vs integration regeneration).

  5. Re-run scheduler refresh after corrections.

  6. Validate shaped/unshaped trend and any high-priority impacted subscribers.

  7. If unresolved, collect the evidence bundle and escalate with timestamps and recent config changes.

Incident Triage Flow (Visual)

        flowchart TD
    A[Check lqosd and lqos_scheduler service health] --> B{Services healthy?}
    B -->|No| C[Restore service health and re-check]
    B -->|Yes| D[Review scheduler logs for validation failures]
    D --> E{Data hygiene issues present?}
    E -->|Yes| F[Correct duplicate IP, parent mismatch, subnet scope]
    E -->|No| G[Confirm source-of-truth ownership]
    F --> H[Re-run scheduler refresh]
    G --> H
    H --> I{Shaping coverage restored?}
    I -->|Yes| J[Validate subscriber impact and close]
    I -->|No| K[Collect evidence bundle and escalate]
    

Reference: Troubleshooting, Integrations

9) Change Management Best Practices

  1. Use pilot-first progression for strategy and topology changes.

  2. Apply one change set per window and validate before next set.

  3. Always preserve rollback artifacts before major changes.

  4. Log change intent, execution, validation evidence, and closure.

Checklist: Change Window Execution

  • [ ] Scope and success criteria are defined

  • [ ] Rollback plan and artifacts are ready

  • [ ] One change set only (no mixed experiments)

  • [ ] Post-change validation completed and recorded

  • [ ] Escalation path and owner are defined before closure

Checklist: Day-2 Operations Cadence

  • [ ] Daily: service health + urgent issues reviewed

  • [ ] Daily: shaped/unshaped trend reviewed for regressions

  • [ ] Weekly: topology and parent distribution sanity review

  • [ ] Weekly: scheduler/log anomaly review

  • [ ] Monthly: capacity/headroom and hardware-fit review

10) Common Anti-Patterns to Avoid

  1. Competing sources of truth for shaping inputs.

  2. Treating unsupported NICs as production-safe.

  3. Defaulting to deep hierarchy without operational need.

  4. Making failover assumptions without explicit validation.

  5. Mixing multiple major changes in one maintenance window.

  6. Ignoring data hygiene errors (duplicates, parent mismatch, subnet mis-scope).

11) MikroTik RouterOS v7 Practical Notes

These are operational notes, not full router design guidance.

  1. Keep routing policy deterministic between shaped and bypass paths.

  2. Keep interface naming and policy mapping consistent with your LibreQoS runbooks.

  3. Validate failover/failback behavior under maintenance conditions, not only in theory.

  4. For multi-WAN/PCC environments, avoid ambiguous path ownership; ensure each subscriber flow has a predictable shaping path.

Conceptual policy example (cost preference pattern):

/routing ospf interface-template
add interfaces=vlan-primary-path area=backbone-v2 cost=10
add interfaces=vlan-bypass-path area=backbone-v2 cost=200

Reference: Recipe: Switch-Centric Fabric

12) NLNet Verification Mapping (Milestone 10a)

This guide satisfies 10a Best practices guide by providing:

  1. End-to-end operational decision framework (source of truth, topology strategy, substrate choice).

  2. Actionable checklists for pre-deploy, change windows, and day-2 operations.

  3. Incident response runbooks for common operational failure classes.

  4. Field-aligned anti-pattern guidance derived from real deployment behavior.

  5. Cross-linked references to detailed implementation docs and recipes.