[NX-102] Improve collector resilience with retry/backoff and log throttling #8

Closed
opened 2026-02-13 13:12:29 +00:00 by nessi · 1 comment
Owner

Goal

Prevent noisy logs and unstable behavior when targets are unreachable.

Scope

  • Add exponential/backoff retry logic.
  • Add per-target error log throttling.
  • Keep collector loop stable under repeated failures.

Acceptance Criteria

  • No log flood when a target is down.
  • Collector continues polling healthy targets.
  • Recovery works automatically after target comes back.
## Goal Prevent noisy logs and unstable behavior when targets are unreachable. ## Scope - Add exponential/backoff retry logic. - Add per-target error log throttling. - Keep collector loop stable under repeated failures. ## Acceptance Criteria - No log flood when a target is down. - Collector continues polling healthy targets. - Recovery works automatically after target comes back.
nessi added this to the v1.0 - Stability, Reliability & Security (P0) milestone 2026-02-13 13:12:29 +00:00
nessi added the P0 label 2026-02-13 13:12:29 +00:00
nessi added reference development 2026-02-14 14:51:08 +00:00
Author
Owner

Implemented in commit(s) for NX-102.

Goal Achieved

Improved collector resilience to prevent noisy logs and unstable behavior when targets are unreachable.

What Was Done

  • Added per-target exponential backoff for failed targets.
    • Retry delay now increases with consecutive failures (with jitter) up to a max cap.
    • Unreachable targets are skipped until next_attempt_at, instead of being retried every poll cycle.
  • Added/kept per-target log throttling for repeated failures.
    • No traceback flood for expected connection failures.
    • Warning logs now include retry timing metadata (retry_in_seconds).
  • Ensured collector loop remains stable and cadence-aware.
    • Poll loop now compensates for collection runtime (sleep = poll_interval - elapsed) to reduce interval drift.
  • Added recovery and cleanup behavior.
    • On target recovery, collector logs recovery context (after_failures, downtime_seconds).
    • Stale failure state is cleaned when targets are removed.

Acceptance Criteria

  • No log flood when a target is down.
  • Collector continues polling healthy targets while failed targets are backoff-scheduled.
  • Recovery works automatically when target comes back online.

Notes

  • This change is internal to collector behavior; no API schema changes required.
  • No DB migration required.
Implemented in commit(s) for **NX-102**. ## Goal Achieved Improved collector resilience to prevent noisy logs and unstable behavior when targets are unreachable. ## What Was Done - Added **per-target exponential backoff** for failed targets. - Retry delay now increases with consecutive failures (with jitter) up to a max cap. - Unreachable targets are skipped until `next_attempt_at`, instead of being retried every poll cycle. - Added/kept **per-target log throttling** for repeated failures. - No traceback flood for expected connection failures. - Warning logs now include retry timing metadata (`retry_in_seconds`). - Ensured collector loop remains stable and cadence-aware. - Poll loop now compensates for collection runtime (`sleep = poll_interval - elapsed`) to reduce interval drift. - Added recovery and cleanup behavior. - On target recovery, collector logs recovery context (`after_failures`, `downtime_seconds`). - Stale failure state is cleaned when targets are removed. ## Acceptance Criteria - ✅ No log flood when a target is down. - ✅ Collector continues polling healthy targets while failed targets are backoff-scheduled. - ✅ Recovery works automatically when target comes back online. ## Notes - This change is internal to collector behavior; no API schema changes required. - No DB migration required.
nessi closed this issue 2026-02-14 14:52:01 +00:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: nessi/NexaPG#8