Best Practices Guide for ISP Operations
This guide consolidates operational best practices for deploying and running LibreQoS in ISP environments.
Use this page as an operations playbook. Use feature-specific pages for implementation details.
How to Use This Guide
Pick your operating model with the decision tables.
Complete the pre-deployment checklist before first production cutover.
Use the runbooks during maintenance windows and incidents.
Use the day-2 checklist as your recurring operational cadence.
1) Purpose and Audience
This guide is for:
ISP owners and technical managers
Network engineers and architects
NOC/support operators
This guide is not a replacement for installation and component documentation. Start with Quickstart, then use this page to standardize real-world operations.
2) Quick Decision Matrix
Decision Table A: Source of Truth
If this describes your operation |
Recommended model |
Why |
|---|---|---|
You use a supported CRM/NMS integration end-to-end |
Built-in integration mode |
Lowest manual drift and simpler recurring updates |
You have an internal orchestration pipeline |
Custom source of truth mode |
Preserves your automation while keeping ownership explicit |
You are small and intentionally manage files directly |
Manual files mode |
Acceptable only when change volume is low and discipline is high |
Reference: Operating Modes and Source of Truth, CRM/NMS Integrations
Decision Table B: Topology Strategy
Operational requirement |
Recommended strategy |
|---|---|
Maximum performance, minimal hierarchy |
|
AP-level aggregation with better performance headroom |
|
Site/AP visibility with moderate overhead |
|
Full backhaul/path hierarchy is required |
|
Reference: Scale Planning and Topology Design
Decision Table C: Deployment Substrate
Deployment choice |
Choose when |
Main caution |
|---|---|---|
Bare metal |
Production-critical throughput and lowest latency overhead are required |
Validate NIC support and single-thread CPU performance |
VM (for example Proxmox) |
You already operate mature virtualization and throughput targets fit VM envelope |
Account for virtualization overhead and align virtio multiqueue to vCPU |
Reference: System Requirements, Recipe: Proxmox VM Deployment
Feature Coverage Matrix
Feature Area |
Operational Best-Practice Focus |
Primary Reference |
|---|---|---|
Integrations |
Single source-of-truth ownership, overwrite discipline, controlled parameter changes |
|
Topology Strategies |
Right-size hierarchy depth, balance parent distribution, maintain stable naming |
|
High Availability and Bypass |
Deterministic active/backup routing policy, recurring failover/failback drills |
|
StormGuard |
Use as adaptive protection where WAN conditions vary, validate behavior during events |
|
Node Manager UI |
Verify operational state after changes; distinguish path failures from UI-only symptoms |
|
Deployment Recipes |
Apply proven implementation patterns for real-world topology scenarios |
3) Pre-Deployment Best Practices
Validate platform fit before cabling.
Supported NIC family
Sufficient single-thread CPU performance
RAM sized for expected subscriber/circuit scale
Validate baseline service health before pilot traffic.
sudo systemctl status lqosd lqos_scheduler
journalctl -u lqosd -u lqos_scheduler --since "10 minutes ago"
Validate source-of-truth ownership before first sync.
Decide whether integrations, external scripts, or manual files own persistence
Avoid mixed ownership from day one (the most common cause of configuration drift)
Validate integration data hygiene before production shaping.
Duplicate IP checks
ParentNode consistency checks
allow_subnetsandignore_subnetsscope checks
Checklist: Pre-Deploy Readiness
[ ] Platform meets supported NIC/CPU/RAM requirements
[ ]
lqosdandlqos_schedulerare healthy[ ] Source-of-truth owner is explicitly documented
[ ] Integration credentials and sync settings are validated
[ ] Data hygiene checks are clean (duplicates/parents/subnets)
[ ] Rollback path is documented before first cutover
4) Source-of-Truth and Integration Best Practices
Enforce one durable owner for shaping data.
Built-in integration mode: integration jobs own regenerated files
Custom mode: your external system owns generated files
Manual mode: direct edits are the durable path
Treat overwrite behavior as an explicit design choice.
If integrations own topology, use overwrite behavior intentionally
Do not rely on manual edits to generated files unless ownership policy explicitly allows it
Change integration parameters one set at a time.
Save changes
Restart scheduler if required by your workflow
Validate logs and WebUI state before applying additional changes
Use recurring scheduler behavior intentionally.
Faster refresh intervals increase control-plane churn
Slower intervals reduce churn but delay corrections
Reference: Integrations, Configuration, Troubleshooting
5) Topology and Scale Best Practices
Keep hierarchy only as deep as operationally necessary.
Balance parent distribution to avoid single-core concentration.
Favor stable naming and parent relationships to reduce queue churn.
For multi-edge environments, prefer explicit, operator-controlled path policy over inferred assumptions.
Validate topology changes in a maintenance window, not ad hoc during peak load.
Field pattern:
Operators commonly recover performance and stability by moving from unnecessary
fullhierarchy depth toap_siteorflatwhen full path control is not required.
Reference: Scale Planning and Topology Design, Recipes
6) Performance and Capacity Best Practices
Design around single-thread performance, not only total core count.
Verify queue/CPU distribution after topology or strategy changes.
In VM deployments, align virtio multiqueue with vCPU and verify under realistic peak load.
Treat MTU/encapsulation mismatches as first-class suspects in throughput anomalies.
Use capacity planning discipline before symptoms force emergency hardware changes.
Reference: System Requirements, Performance Tuning
7) High Availability, Bypass, and Maintenance Best Practices
Use deterministic active/backup policy (OSPF/BGP) for failover and failback.
In switch-centric designs, validate shaped path and bypass path behavior independently.
Keep failover drills routine; do not wait for incidents to test convergence behavior.
Ensure backup paths are sized for realistic degraded-state demand.
Runbook: Maintenance Cutover Validation
Confirm backup path health and expected capacity.
Record current service and path state.
Shift preference to backup policy.
Validate subscriber traffic continuity plus key latency/throughput indicators.
Execute maintenance on primary path.
Restore primary preference.
Validate failback behavior and post-change stability.
Document outcome and any required corrections.
Maintenance Cutover Sequence (Visual)
flowchart TD
A[Confirm backup path health and capacity] --> B[Record current service and path state]
B --> C[Shift routing preference to backup path]
C --> D{Traffic continuity and KPIs healthy?}
D -->|No| E[Stop and remediate before maintenance]
D -->|Yes| F[Perform maintenance on primary path]
F --> G[Restore primary path preference]
G --> H{Failback stable?}
H -->|No| I[Investigate and hold on backup/controlled state]
H -->|Yes| J[Document closure and outcomes]
Reference: High Availability and Failure Domains, Recipe: Switch-Centric Fabric
8) Monitoring and Incident Response Best Practices
Start triage with service health and logs, then move to topology and integration.
Distinguish shaping-path failures from telemetry/UI presentation issues.
Capture reproducible evidence before making broad corrective changes.
Keep a standard incident evidence bundle for escalation.
Standard evidence bundle:
sudo systemctl status lqosd lqos_scheduler
journalctl -u lqosd --since "30 minutes ago"
journalctl -u lqos_scheduler --since "30 minutes ago"
ls -lh /opt/libreqos/src/network.json /opt/libreqos/src/ShapedDevices.csv
Runbook: Shaping Not Applied or Coverage Drops
Confirm services are healthy.
Check scheduler logs for validation failures.
Check for duplicate IP, parent mismatch, or subnet-scope misalignment.
Confirm source-of-truth ownership (manual edits vs integration regeneration).
Re-run scheduler refresh after corrections.
Validate shaped/unshaped trend and any high-priority impacted subscribers.
If unresolved, collect the evidence bundle and escalate with timestamps and recent config changes.
Incident Triage Flow (Visual)
flowchart TD
A[Check lqosd and lqos_scheduler service health] --> B{Services healthy?}
B -->|No| C[Restore service health and re-check]
B -->|Yes| D[Review scheduler logs for validation failures]
D --> E{Data hygiene issues present?}
E -->|Yes| F[Correct duplicate IP, parent mismatch, subnet scope]
E -->|No| G[Confirm source-of-truth ownership]
F --> H[Re-run scheduler refresh]
G --> H
H --> I{Shaping coverage restored?}
I -->|Yes| J[Validate subscriber impact and close]
I -->|No| K[Collect evidence bundle and escalate]
Reference: Troubleshooting, Integrations
9) Change Management Best Practices
Use pilot-first progression for strategy and topology changes.
Apply one change set per window and validate before next set.
Always preserve rollback artifacts before major changes.
Log change intent, execution, validation evidence, and closure.
Checklist: Change Window Execution
[ ] Scope and success criteria are defined
[ ] Rollback plan and artifacts are ready
[ ] One change set only (no mixed experiments)
[ ] Post-change validation completed and recorded
[ ] Escalation path and owner are defined before closure
Checklist: Day-2 Operations Cadence
[ ] Daily: service health + urgent issues reviewed
[ ] Daily: shaped/unshaped trend reviewed for regressions
[ ] Weekly: topology and parent distribution sanity review
[ ] Weekly: scheduler/log anomaly review
[ ] Monthly: capacity/headroom and hardware-fit review
10) Common Anti-Patterns to Avoid
Competing sources of truth for shaping inputs.
Treating unsupported NICs as production-safe.
Defaulting to deep hierarchy without operational need.
Making failover assumptions without explicit validation.
Mixing multiple major changes in one maintenance window.
Ignoring data hygiene errors (duplicates, parent mismatch, subnet mis-scope).
11) MikroTik RouterOS v7 Practical Notes
These are operational notes, not full router design guidance.
Keep routing policy deterministic between shaped and bypass paths.
Keep interface naming and policy mapping consistent with your LibreQoS runbooks.
Validate failover/failback behavior under maintenance conditions, not only in theory.
For multi-WAN/PCC environments, avoid ambiguous path ownership; ensure each subscriber flow has a predictable shaping path.
Conceptual policy example (cost preference pattern):
/routing ospf interface-template
add interfaces=vlan-primary-path area=backbone-v2 cost=10
add interfaces=vlan-bypass-path area=backbone-v2 cost=200
Reference: Recipe: Switch-Centric Fabric
12) NLNet Verification Mapping (Milestone 10a)
This guide satisfies 10a Best practices guide by providing:
End-to-end operational decision framework (source of truth, topology strategy, substrate choice).
Actionable checklists for pre-deploy, change windows, and day-2 operations.
Incident response runbooks for common operational failure classes.
Field-aligned anti-pattern guidance derived from real deployment behavior.
Cross-linked references to detailed implementation docs and recipes.