[NX-102] Improve collector resilience with retry/backoff and log throttling #8

New Issue

nessi · 2026-02-13T13:12:29Z

nessi commented

2026-02-13 13:12:29 +00:00

Goal

Prevent noisy logs and unstable behavior when targets are unreachable.

Scope

Add exponential/backoff retry logic.
Add per-target error log throttling.
Keep collector loop stable under repeated failures.

Acceptance Criteria

No log flood when a target is down.
Collector continues polling healthy targets.
Recovery works automatically after target comes back.

## Goal Prevent noisy logs and unstable behavior when targets are unreachable. ## Scope - Add exponential/backoff retry logic. - Add per-target error log throttling. - Keep collector loop stable under repeated failures. ## Acceptance Criteria - No log flood when a target is down. - Collector continues polling healthy targets. - Recovery works automatically after target comes back.

nessi added this to the v1.0 - Stability, Reliability & Security (P0) milestone 2026-02-13 13:12:29 +00:00

nessi added the P0 label 2026-02-13 13:12:29 +00:00

nessi added reference development

2026-02-14 14:51:08 +00:00

nessi commented

2026-02-14 14:52:01 +00:00

Implemented in commit(s) for NX-102.

Goal Achieved

Improved collector resilience to prevent noisy logs and unstable behavior when targets are unreachable.

What Was Done

Added per-target exponential backoff for failed targets.
- Retry delay now increases with consecutive failures (with jitter) up to a max cap.
- Unreachable targets are skipped until next_attempt_at, instead of being retried every poll cycle.
Added/kept per-target log throttling for repeated failures.
- No traceback flood for expected connection failures.
- Warning logs now include retry timing metadata (retry_in_seconds).
Ensured collector loop remains stable and cadence-aware.
- Poll loop now compensates for collection runtime (sleep = poll_interval - elapsed) to reduce interval drift.
Added recovery and cleanup behavior.
- On target recovery, collector logs recovery context (after_failures, downtime_seconds).
- Stale failure state is cleaned when targets are removed.

Acceptance Criteria

✅ No log flood when a target is down.
✅ Collector continues polling healthy targets while failed targets are backoff-scheduled.
✅ Recovery works automatically when target comes back online.

Notes

This change is internal to collector behavior; no API schema changes required.
No DB migration required.

Implemented in commit(s) for **NX-102**. ## Goal Achieved Improved collector resilience to prevent noisy logs and unstable behavior when targets are unreachable. ## What Was Done - Added **per-target exponential backoff** for failed targets. - Retry delay now increases with consecutive failures (with jitter) up to a max cap. - Unreachable targets are skipped until `next_attempt_at`, instead of being retried every poll cycle. - Added/kept **per-target log throttling** for repeated failures. - No traceback flood for expected connection failures. - Warning logs now include retry timing metadata (`retry_in_seconds`). - Ensured collector loop remains stable and cadence-aware. - Poll loop now compensates for collection runtime (`sleep = poll_interval - elapsed`) to reduce interval drift. - Added recovery and cleanup behavior. - On target recovery, collector logs recovery context (`after_failures`, `downtime_seconds`). - Stale failure state is cleaned when targets are removed. ## Acceptance Criteria - ✅ No log flood when a target is down. - ✅ Collector continues polling healthy targets while failed targets are backoff-scheduled. - ✅ Recovery works automatically when target comes back online. ## Notes - This change is internal to collector behavior; no API schema changes required. - No DB migration required.

nessi closed this issue

2026-02-14 14:52:01 +00:00

Sign in to join this conversation.

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: nessi/NexaPG#8