NX-10x: Reliability, error handling, runtime UX hardening, and migration safety gate (NX-101, NX-102, NX-103, NX-104) #32

Merged
nessi merged 7 commits from development into main 2026-02-14 15:28:44 +00:00
Owner

PR Description

Summary

This PR merges the full NX-10x package into main:

  • NX-101: Consistent API error format
  • NX-102: Collector resilience under repeated target failures
  • NX-103: Clean handling of expected runtime connectivity failures
  • NX-104: Alembic migration safety CI gate

NX-101 — Consistent API Error Format

Implemented

  • Introduced shared API error payload:
    • code
    • message
    • details
    • request_id
  • Added centralized exception handling and request-id middleware.
  • Replaced ad-hoc error details in affected routes/services with structured error payloads.
  • Updated frontend API error parsing to reliably consume code/details/request_id.
  • Documented common error behavior and structure.

Outcome

  • API errors are now predictable and machine-parseable across endpoints.
  • Frontend can consistently render user-facing errors.

NX-102 — Collector Stability and Log Flood Prevention

Implemented

  • Added per-target exponential backoff with jitter for repeated failures.
  • Added/retained per-target error log throttling for unreachable targets.
  • Collector now skips failed targets until next_attempt_at.
  • Healthy targets continue polling normally.
  • Added recovery logging context (after_failures, downtime_seconds).
  • Collector loop cadence improved (sleep = poll_interval - elapsed) to reduce drift/freshness flapping.
  • Cleaned stale internal failure state for removed targets.

Outcome

  • No repetitive log flood when targets are down.
  • Collector remains stable and self-recovers automatically.

NX-103 — No Generic Runtime Failure UX for Expected Connectivity Issues

Implemented

  • Connectivity/runtime target failures are surfaced as explicit target_unreachable (503) behavior.
  • Target Detail UI now shows a dedicated Target Offline state instead of raw error text.
  • Includes actionable guidance and context (host, port, optional request_id).
  • UI continues to behave cleanly for expected downtime/network refusal scenarios.

Outcome

  • No noisy generic user-facing failures for expected target-down conditions.
  • Clear and actionable runtime messaging.

NX-104 — Migration Safety CI Gate

Implemented

  • Added workflow: .github/workflows/migration-safety.yml
  • New CI job validates migration roundtrip:
    • alembic upgrade head
    • alembic downgrade -1
    • alembic upgrade head
  • Added schema consistency check using pg_dump --schema-only + diff.
  • Fixed false-positive schema diff by filtering dynamic dump tokens:
    • \restrict
    • \unrestrict

Outcome

  • Unsafe migration downgrade/re-upgrade behavior now fails CI.
  • Schema drift after roundtrip is detected before merge.

Acceptance Criteria Mapping

  • NX-101: Consistent JSON API error structure; frontend parsing reliability.
  • NX-102: No log flood, healthy target polling continues, automatic recovery.
  • NX-103: No generic 500-style runtime UX for expected target connectivity failures.
  • NX-104: Migration safety job exists, validates roundtrip + schema consistency, blocks on failure.

Deployment / Ops Notes

  • No special rollout sequence required beyond standard deploy.
  • No additional manual DB migration steps required for this PR itself.
  • Set the migration safety job as a required status check in branch protection to enforce merge blocking.
## PR Description ### Summary This PR merges the full NX-10x package into `main`: - **NX-101**: Consistent API error format - **NX-102**: Collector resilience under repeated target failures - **NX-103**: Clean handling of expected runtime connectivity failures - **NX-104**: Alembic migration safety CI gate --- ### NX-101 — Consistent API Error Format #### Implemented - Introduced shared API error payload: - `code` - `message` - `details` - `request_id` - Added centralized exception handling and request-id middleware. - Replaced ad-hoc error details in affected routes/services with structured error payloads. - Updated frontend API error parsing to reliably consume `code/details/request_id`. - Documented common error behavior and structure. #### Outcome - API errors are now predictable and machine-parseable across endpoints. - Frontend can consistently render user-facing errors. --- ### NX-102 — Collector Stability and Log Flood Prevention #### Implemented - Added per-target **exponential backoff** with jitter for repeated failures. - Added/retained per-target **error log throttling** for unreachable targets. - Collector now skips failed targets until `next_attempt_at`. - Healthy targets continue polling normally. - Added recovery logging context (`after_failures`, `downtime_seconds`). - Collector loop cadence improved (`sleep = poll_interval - elapsed`) to reduce drift/freshness flapping. - Cleaned stale internal failure state for removed targets. #### Outcome - No repetitive log flood when targets are down. - Collector remains stable and self-recovers automatically. --- ### NX-103 — No Generic Runtime Failure UX for Expected Connectivity Issues #### Implemented - Connectivity/runtime target failures are surfaced as explicit `target_unreachable` (`503`) behavior. - Target Detail UI now shows a dedicated **Target Offline** state instead of raw error text. - Includes actionable guidance and context (`host`, `port`, optional `request_id`). - UI continues to behave cleanly for expected downtime/network refusal scenarios. #### Outcome - No noisy generic user-facing failures for expected target-down conditions. - Clear and actionable runtime messaging. --- ### NX-104 — Migration Safety CI Gate #### Implemented - Added workflow: `.github/workflows/migration-safety.yml` - New CI job validates migration roundtrip: - `alembic upgrade head` - `alembic downgrade -1` - `alembic upgrade head` - Added schema consistency check using `pg_dump --schema-only` + `diff`. - Fixed false-positive schema diff by filtering dynamic dump tokens: - `\restrict` - `\unrestrict` #### Outcome - Unsafe migration downgrade/re-upgrade behavior now fails CI. - Schema drift after roundtrip is detected before merge. --- ### Acceptance Criteria Mapping - ✅ **NX-101**: Consistent JSON API error structure; frontend parsing reliability. - ✅ **NX-102**: No log flood, healthy target polling continues, automatic recovery. - ✅ **NX-103**: No generic 500-style runtime UX for expected target connectivity failures. - ✅ **NX-104**: Migration safety job exists, validates roundtrip + schema consistency, blocks on failure. --- ### Deployment / Ops Notes - No special rollout sequence required beyond standard deploy. - No additional manual DB migration steps required for this PR itself. - Set the migration safety job as a **required status check** in branch protection to enforce merge blocking.
nessi added 7 commits 2026-02-14 15:26:14 +00:00
Introduced standardized error response formats for API errors, including middleware for consistent request IDs and exception handlers. Updated the frontend to parse and process these error responses, and documented the error format in the README for reference.
Replaced all inline error messages with the standardized `api_error` helper for consistent error response formatting. This improves clarity, maintainability, and ensures uniform error structures across the application. Updated logging for collector failures to include error class and switched to warning level for target unreachable scenarios.
Introduced an exponential backoff mechanism with a configurable base, max delay, and jitter factor to handle retries for target failures. This improves resilience by reducing the load during repeated failures and avoids synchronized retry storms. Additionally, stale target cleanup logic has been implemented to prevent unnecessary state retention.
Previously, the loop did not consider the time spent on `collect_once`, potentially causing delays. By adjusting the sleep duration dynamically, the poll interval remains consistent as intended.
Introduced a mechanism to detect and handle when a target is unreachable, including a detailed offline state message with host and port information. Updated the UI to display a card notifying users of the target's offline status and styled the card accordingly in CSS.
[NX-104 Issue] Add migration safety CI workflow
Some checks failed
Migration Safety / Alembic upgrade/downgrade safety (pull_request) Failing after 30s
PostgreSQL Compatibility Matrix / PG14 smoke (pull_request) Successful in 9s
PostgreSQL Compatibility Matrix / PG15 smoke (pull_request) Successful in 7s
PostgreSQL Compatibility Matrix / PG16 smoke (pull_request) Successful in 7s
PostgreSQL Compatibility Matrix / PG17 smoke (pull_request) Successful in 8s
PostgreSQL Compatibility Matrix / PG18 smoke (pull_request) Successful in 7s
cbe1cf26fa
Introduces a GitHub Actions workflow to ensure Alembic migrations are safe and reversible. The workflow validates schema consistency by testing upgrade and downgrade operations and comparing schemas before and after the roundtrip.
[NX-104 Issue] Filter out restrict/unrestrict lines in schema comparison.
All checks were successful
Migration Safety / Alembic upgrade/downgrade safety (pull_request) Successful in 22s
PostgreSQL Compatibility Matrix / PG14 smoke (pull_request) Successful in 7s
PostgreSQL Compatibility Matrix / PG15 smoke (pull_request) Successful in 7s
PostgreSQL Compatibility Matrix / PG16 smoke (pull_request) Successful in 8s
PostgreSQL Compatibility Matrix / PG17 smoke (pull_request) Successful in 7s
PostgreSQL Compatibility Matrix / PG18 smoke (pull_request) Successful in 7s
6de3100615
Updated the pg_dump commands in the migration-safety workflow to use `sed` for removing restrict/unrestrict lines. This ensures consistent schema comparison by ignoring irrelevant metadata.
nessi merged commit f614eb1cf8 into main 2026-02-14 15:28:44 +00:00
Sign in to join this conversation.
No Reviewers
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: nessi/NexaPG#32