NX-10x: Reliability & API hardening (NX-101, NX-102, NX-103, NX-104) #31

Closed
nessi wants to merge 0 commits from development into main
Owner

Included Changes

NX-101: Consistent API error format

  • Introduced a shared error shape:
    • code
    • message
    • details
    • request_id
  • Added request-id middleware + centralized exception handling.
  • Replaced ad-hoc HTTPException(detail="...") usage in key routes with structured error payloads.
  • Updated frontend API client parsing to reliably consume structured backend errors.

NX-102: Collector stability under repeated target failures

  • Added per-target exponential backoff with jitter.
  • Added/kept per-target error log throttling (prevents flood when target is down).
  • Added target recovery logging context (after_failures, downtime_seconds).
  • Made collector loop cadence-aware (poll_interval - elapsed) to reduce drift and freshness flapping.

NX-103: Runtime connectivity UX hardening

  • Connectivity failures now surface as explicit target_unreachable/503 behavior.
  • Target Detail UI now shows a clean Target Offline state with actionable guidance.
  • Avoids exposing raw runtime exception text to users for expected network/down scenarios.

NX-104: Alembic migration safety in CI

  • Added workflow: .github/workflows/migration-safety.yml
  • CI job performs:
    • alembic upgrade head
    • alembic downgrade -1
    • alembic upgrade head
  • Adds schema consistency validation via pg_dump --schema-only before/after roundtrip + diff.
  • Designed to be used as a required branch protection check.

Acceptance Criteria Mapping

  • NX-101: All API errors now follow consistent JSON shape; frontend can parse reliably.
  • NX-102: No log flood on down targets; healthy targets continue polling; recovery is automatic.
  • NX-103: No generic user-facing runtime failure for expected target connectivity outages.
  • NX-104: Migration safety check exists and blocks on failed downgrade/re-upgrade.

Migration / Deployment Notes

  • No manual DB migration steps beyond normal startup migration flow.
  • For branch protection, set migration safety job as required status check.
  • Safe to deploy as standard backend/frontend rollout.

Suggested Validation After Merge

  1. Stop one monitored target DB and confirm:
    • no collector log flood
    • target offline state is shown clearly in UI
  2. Restore target DB and confirm automatic recovery.
  3. Trigger CI and verify migration-safety job passes.
  4. Confirm structured API errors include request_id and stable code values.
### Included Changes #### NX-101: Consistent API error format - Introduced a shared error shape: - `code` - `message` - `details` - `request_id` - Added request-id middleware + centralized exception handling. - Replaced ad-hoc `HTTPException(detail="...")` usage in key routes with structured error payloads. - Updated frontend API client parsing to reliably consume structured backend errors. #### NX-102: Collector stability under repeated target failures - Added per-target **exponential backoff** with jitter. - Added/kept per-target **error log throttling** (prevents flood when target is down). - Added target recovery logging context (`after_failures`, `downtime_seconds`). - Made collector loop cadence-aware (`poll_interval - elapsed`) to reduce drift and freshness flapping. #### NX-103: Runtime connectivity UX hardening - Connectivity failures now surface as explicit `target_unreachable`/`503` behavior. - Target Detail UI now shows a clean **Target Offline** state with actionable guidance. - Avoids exposing raw runtime exception text to users for expected network/down scenarios. #### NX-104: Alembic migration safety in CI - Added workflow: `.github/workflows/migration-safety.yml` - CI job performs: - `alembic upgrade head` - `alembic downgrade -1` - `alembic upgrade head` - Adds schema consistency validation via `pg_dump --schema-only` before/after roundtrip + `diff`. - Designed to be used as a required branch protection check. --- ### Acceptance Criteria Mapping - **NX-101**: All API errors now follow consistent JSON shape; frontend can parse reliably. - **NX-102**: No log flood on down targets; healthy targets continue polling; recovery is automatic. - **NX-103**: No generic user-facing runtime failure for expected target connectivity outages. - **NX-104**: Migration safety check exists and blocks on failed downgrade/re-upgrade. --- ### Migration / Deployment Notes - No manual DB migration steps beyond normal startup migration flow. - For branch protection, set migration safety job as required status check. - Safe to deploy as standard backend/frontend rollout. --- ### Suggested Validation After Merge 1. Stop one monitored target DB and confirm: - no collector log flood - target offline state is shown clearly in UI 2. Restore target DB and confirm automatic recovery. 3. Trigger CI and verify migration-safety job passes. 4. Confirm structured API errors include `request_id` and stable `code` values.
nessi added this to the v1.0 - Stability, Reliability & Security (P0) milestone 2026-02-14 15:20:02 +00:00
nessi added the P0 label 2026-02-14 15:20:02 +00:00
nessi added 6 commits 2026-02-14 15:20:02 +00:00
Introduced standardized error response formats for API errors, including middleware for consistent request IDs and exception handlers. Updated the frontend to parse and process these error responses, and documented the error format in the README for reference.
Replaced all inline error messages with the standardized `api_error` helper for consistent error response formatting. This improves clarity, maintainability, and ensures uniform error structures across the application. Updated logging for collector failures to include error class and switched to warning level for target unreachable scenarios.
Introduced an exponential backoff mechanism with a configurable base, max delay, and jitter factor to handle retries for target failures. This improves resilience by reducing the load during repeated failures and avoids synchronized retry storms. Additionally, stale target cleanup logic has been implemented to prevent unnecessary state retention.
Previously, the loop did not consider the time spent on `collect_once`, potentially causing delays. By adjusting the sleep duration dynamically, the poll interval remains consistent as intended.
Introduced a mechanism to detect and handle when a target is unreachable, including a detailed offline state message with host and port information. Updated the UI to display a card notifying users of the target's offline status and styled the card accordingly in CSS.
[NX-104 Issue] Add migration safety CI workflow
Some checks failed
Migration Safety / Alembic upgrade/downgrade safety (pull_request) Failing after 30s
PostgreSQL Compatibility Matrix / PG14 smoke (pull_request) Successful in 9s
PostgreSQL Compatibility Matrix / PG15 smoke (pull_request) Successful in 7s
PostgreSQL Compatibility Matrix / PG16 smoke (pull_request) Successful in 7s
PostgreSQL Compatibility Matrix / PG17 smoke (pull_request) Successful in 8s
PostgreSQL Compatibility Matrix / PG18 smoke (pull_request) Successful in 7s
cbe1cf26fa
Introduces a GitHub Actions workflow to ensure Alembic migrations are safe and reversible. The workflow validates schema consistency by testing upgrade and downgrade operations and comparing schemas before and after the roundtrip.
nessi added 1 commit 2026-02-14 15:23:09 +00:00
[NX-104 Issue] Filter out restrict/unrestrict lines in schema comparison.
All checks were successful
Migration Safety / Alembic upgrade/downgrade safety (pull_request) Successful in 22s
PostgreSQL Compatibility Matrix / PG14 smoke (pull_request) Successful in 7s
PostgreSQL Compatibility Matrix / PG15 smoke (pull_request) Successful in 7s
PostgreSQL Compatibility Matrix / PG16 smoke (pull_request) Successful in 8s
PostgreSQL Compatibility Matrix / PG17 smoke (pull_request) Successful in 7s
PostgreSQL Compatibility Matrix / PG18 smoke (pull_request) Successful in 7s
6de3100615
Updated the pg_dump commands in the migration-safety workflow to use `sed` for removing restrict/unrestrict lines. This ensures consistent schema comparison by ignoring irrelevant metadata.
nessi closed this pull request 2026-02-14 15:23:14 +00:00
All checks were successful
Migration Safety / Alembic upgrade/downgrade safety (pull_request) Successful in 22s
PostgreSQL Compatibility Matrix / PG14 smoke (pull_request) Successful in 7s
PostgreSQL Compatibility Matrix / PG15 smoke (pull_request) Successful in 7s
PostgreSQL Compatibility Matrix / PG16 smoke (pull_request) Successful in 8s
PostgreSQL Compatibility Matrix / PG17 smoke (pull_request) Successful in 7s
PostgreSQL Compatibility Matrix / PG18 smoke (pull_request) Successful in 7s

Pull request closed

Sign in to join this conversation.
No Reviewers
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: nessi/NexaPG#31