Best Practices Guide for ISP Operations

This guide consolidates operational best practices for deploying and running LibreQoS in ISP environments.

Use this page as an operations playbook. Use feature-specific pages for implementation details.

How to Use This Guide

Pick your operating model with the decision tables.
Complete the pre-deployment checklist before first production cutover.
Use the runbooks during maintenance windows and incidents.
Use the day-2 checklist as your recurring operational cadence.

1) Purpose and Audience

This guide is for:

ISP owners and technical managers
Network engineers and architects
NOC/support operators

This guide is not a replacement for installation and component documentation. Start with Quickstart, then use this page to standardize real-world operations.

2) Quick Decision Matrix

Decision Table A: Source of Truth

If this describes your operation	Recommended model	Why
You use a supported CRM/NMS integration end-to-end	Built-in integration mode	Lowest manual drift and simpler recurring updates
You have an internal orchestration pipeline	Custom source of truth mode	Preserves your automation while keeping ownership explicit
You are small and intentionally manage files directly	Manual files mode	Acceptable only when change volume is low and discipline is high

Reference: Operating Modes and Source of Truth, CRM/NMS Integrations

Decision Table B: Topology Strategy

Operational requirement	Recommended strategy
Maximum performance, minimal hierarchy	`flat`
AP-level aggregation with better performance headroom	`ap_only`
Site/AP visibility with moderate overhead	`ap_site`
Full backhaul/path hierarchy is required	`full`

Reference: Scale Planning and Topology Design

Decision Table C: Deployment Substrate

Deployment choice	Choose when	Main caution
Bare metal	Production-critical throughput and lowest latency overhead are required	Validate NIC support and single-thread CPU performance
VM (for example Proxmox)	You already operate mature virtualization and throughput targets fit VM envelope	Account for virtualization overhead and align virtio multiqueue to vCPU

Reference: System Requirements, Recipe: Proxmox VM Deployment

Feature Coverage Matrix

Feature Area	Operational Best-Practice Focus	Primary Reference
Integrations	Single source-of-truth ownership, overwrite discipline, controlled parameter changes	Integrations
Topology Strategies	Right-size hierarchy depth, balance parent distribution, maintain stable naming	Scale Planning and Topology Design
High Availability and Bypass	Deterministic active/backup routing policy, recurring failover/failback drills	High Availability and Failure Domains
StormGuard	Use as adaptive protection where WAN conditions vary, validate behavior during events	StormGuard
Node Manager UI	Verify operational state after changes; distinguish path failures from UI-only symptoms	Node Manager UI
Deployment Recipes	Apply proven implementation patterns for real-world topology scenarios	Deployment Recipes

3) Pre-Deployment Best Practices

Validate platform fit before cabling.
- Supported NIC family
- Sufficient single-thread CPU performance
- RAM sized for expected subscriber/circuit scale
Validate baseline service health before pilot traffic.

sudo systemctl status lqosd lqos_scheduler
journalctl -u lqosd -u lqos_scheduler --since "10 minutes ago"

Validate source-of-truth ownership before first sync.
- Decide whether integrations, external scripts, or manual files own persistence
- Avoid mixed ownership from day one (the most common cause of configuration drift)
Validate integration data hygiene before production shaping.
- Duplicate IP checks
- ParentNode consistency checks
- allow_subnets and ignore_subnets scope checks

Checklist: Pre-Deploy Readiness

[ ] Platform meets supported NIC/CPU/RAM requirements
[ ] lqosd and lqos_scheduler are healthy
[ ] Source-of-truth owner is explicitly documented
[ ] Integration credentials and sync settings are validated
[ ] Data hygiene checks are clean (duplicates/parents/subnets)
[ ] Rollback path is documented before first cutover

4) Source-of-Truth and Integration Best Practices

Enforce one durable owner for shaping data.
- Built-in integration mode: integration jobs own regenerated files
- Custom mode: your external system owns generated files
- Manual mode: direct edits are the durable path
Treat overwrite behavior as an explicit design choice.
- If integrations own topology, use overwrite behavior intentionally
- Do not rely on manual edits to generated files unless ownership policy explicitly allows it
Change integration parameters one set at a time.
- Save changes
- Restart scheduler if required by your workflow
- Validate logs and WebUI state before applying additional changes
Use recurring scheduler behavior intentionally.
- Faster refresh intervals increase control-plane churn
- Slower intervals reduce churn but delay corrections

Reference: Integrations, Configuration, Troubleshooting

5) Topology and Scale Best Practices

Keep hierarchy only as deep as operationally necessary.
Balance parent distribution to avoid single-core concentration.
Favor stable naming and parent relationships to reduce queue churn.
For multi-edge environments, prefer explicit, operator-controlled path policy over inferred assumptions.
Validate topology changes in a maintenance window, not ad hoc during peak load.

Field pattern:

Operators commonly recover performance and stability by moving from unnecessary full hierarchy depth to ap_site or flat when full path control is not required.

Reference: Scale Planning and Topology Design, Recipes

6) Performance and Capacity Best Practices

Design around single-thread performance, not only total core count.
Verify queue/CPU distribution after topology or strategy changes.
In VM deployments, align virtio multiqueue with vCPU and verify under realistic peak load.
Treat MTU/encapsulation mismatches as first-class suspects in throughput anomalies.
Use capacity planning discipline before symptoms force emergency hardware changes.

Reference: System Requirements, Performance Tuning

7) High Availability, Bypass, and Maintenance Best Practices

Use deterministic active/backup policy (OSPF/BGP) for failover and failback.
In switch-centric designs, validate shaped path and bypass path behavior independently.
Keep failover drills routine; do not wait for incidents to test convergence behavior.
Ensure backup paths are sized for realistic degraded-state demand.

Runbook: Maintenance Cutover Validation

Confirm backup path health and expected capacity.
Record current service and path state.
Shift preference to backup policy.
Validate subscriber traffic continuity plus key latency/throughput indicators.
Execute maintenance on primary path.
Restore primary preference.
Validate failback behavior and post-change stability.
Document outcome and any required corrections.

Maintenance Cutover Sequence (Visual)

        flowchart TD
    A[Confirm backup path health and capacity] --> B[Record current service and path state]
    B --> C[Shift routing preference to backup path]
    C --> D{Traffic continuity and KPIs healthy?}
    D -->|No| E[Stop and remediate before maintenance]
    D -->|Yes| F[Perform maintenance on primary path]
    F --> G[Restore primary path preference]
    G --> H{Failback stable?}
    H -->|No| I[Investigate and hold on backup/controlled state]
    H -->|Yes| J[Document closure and outcomes]

Reference: High Availability and Failure Domains, Recipe: Switch-Centric Fabric

8) Monitoring and Incident Response Best Practices

Start triage with service health and logs, then move to topology and integration.
Distinguish shaping-path failures from telemetry/UI presentation issues.
Capture reproducible evidence before making broad corrective changes.
Keep a standard incident evidence bundle for escalation.

Standard evidence bundle:

sudo systemctl status lqosd lqos_scheduler
journalctl -u lqosd --since "30 minutes ago"
journalctl -u lqos_scheduler --since "30 minutes ago"
ls -lh /opt/libreqos/src/network.json /opt/libreqos/src/ShapedDevices.csv

Runbook: Shaping Not Applied or Coverage Drops

Confirm services are healthy.
Check scheduler logs for validation failures.
Check for duplicate IP, parent mismatch, or subnet-scope misalignment.
Confirm source-of-truth ownership (manual edits vs integration regeneration).
Re-run scheduler refresh after corrections.
Validate shaped/unshaped trend and any high-priority impacted subscribers.
If unresolved, collect the evidence bundle and escalate with timestamps and recent config changes.

Incident Triage Flow (Visual)

        flowchart TD
    A[Check lqosd and lqos_scheduler service health] --> B{Services healthy?}
    B -->|No| C[Restore service health and re-check]
    B -->|Yes| D[Review scheduler logs for validation failures]
    D --> E{Data hygiene issues present?}
    E -->|Yes| F[Correct duplicate IP, parent mismatch, subnet scope]
    E -->|No| G[Confirm source-of-truth ownership]
    F --> H[Re-run scheduler refresh]
    G --> H
    H --> I{Shaping coverage restored?}
    I -->|Yes| J[Validate subscriber impact and close]
    I -->|No| K[Collect evidence bundle and escalate]

Reference: Troubleshooting, Integrations

9) Change Management Best Practices

Use pilot-first progression for strategy and topology changes.
Apply one change set per window and validate before next set.
Always preserve rollback artifacts before major changes.
Log change intent, execution, validation evidence, and closure.

Checklist: Change Window Execution

[ ] Scope and success criteria are defined
[ ] Rollback plan and artifacts are ready
[ ] One change set only (no mixed experiments)
[ ] Post-change validation completed and recorded
[ ] Escalation path and owner are defined before closure

Checklist: Day-2 Operations Cadence

[ ] Daily: service health + urgent issues reviewed
[ ] Daily: shaped/unshaped trend reviewed for regressions
[ ] Weekly: topology and parent distribution sanity review
[ ] Weekly: scheduler/log anomaly review
[ ] Monthly: capacity/headroom and hardware-fit review

10) Common Anti-Patterns to Avoid

Competing sources of truth for shaping inputs.
Treating unsupported NICs as production-safe.
Defaulting to deep hierarchy without operational need.
Making failover assumptions without explicit validation.
Mixing multiple major changes in one maintenance window.
Ignoring data hygiene errors (duplicates, parent mismatch, subnet mis-scope).

11) MikroTik RouterOS v7 Practical Notes

These are operational notes, not full router design guidance.

Keep routing policy deterministic between shaped and bypass paths.
Keep interface naming and policy mapping consistent with your LibreQoS runbooks.
Validate failover/failback behavior under maintenance conditions, not only in theory.
For multi-WAN/PCC environments, avoid ambiguous path ownership; ensure each subscriber flow has a predictable shaping path.

Conceptual policy example (cost preference pattern):

/routing ospf interface-template
add interfaces=vlan-primary-path area=backbone-v2 cost=10
add interfaces=vlan-bypass-path area=backbone-v2 cost=200

Reference: Recipe: Switch-Centric Fabric

12) NLNet Verification Mapping (Milestone 10a)

This guide satisfies 10a Best practices guide by providing:

End-to-end operational decision framework (source of truth, topology strategy, substrate choice).
Actionable checklists for pre-deploy, change windows, and day-2 operations.
Incident response runbooks for common operational failure classes.
Field-aligned anti-pattern guidance derived from real deployment behavior.
Cross-linked references to detailed implementation docs and recipes.