Infra Maintenance & Support

Your Infrastructure.
Our Responsibility.
Always On.

We act as your dedicated infrastructure team — 24×7 monitoring, incident response, patching, capacity planning, and cost optimisation on an ongoing retainer. You build features. We keep the lights on.

Get Support Retainer View Support Plans

What's Included

Everything Your Infra
Needs to Stay Healthy

A retainer engagement covers every dimension of infrastructure operations — not just firefighting when things break.

📊

24×7 Monitoring

Prometheus + Grafana dashboards with AlertManager. Every metric, every service, every pod watched around the clock. Automated alerts before issues become outages.

PrometheusGrafanaAlertManagerPagerDuty

🚨

Incident Response

On-call engineer available 24×7 for critical incidents. Defined SLA response times. Full incident postmortems with root cause analysis and prevention steps.

On-call SLARCA ReportsRunbooksWar Room

🔄

Patch Management

OS security patches, Kubernetes version upgrades, base image updates, and dependency patches — all tested on staging first, then applied to production with zero downtime.

OS PatchingK8s UpgradesAMI UpdatesCVE Fixes

📈

Capacity Planning

Monthly review of resource utilisation trends. Proactive scaling recommendations before you hit limits. Right-sizing over-provisioned resources to control AWS costs.

Resource TrendsScale PlanningCost ReportsRight-sizing

💾

Backup & DR

Automated RDS snapshots, EBS backups, and Velero for Kubernetes state. Regular restore drills to verify backup integrity. Documented DR runbooks with tested RTO/RPO.

RDS SnapshotsVeleroDR DrillsRTO/RPO

💰

Cost Optimisation

Monthly AWS cost review — Reserved Instance recommendations, unused resource cleanup, Savings Plan analysis, and budget alerts. Average clients save 15–25% on ongoing costs.

RI RecommendationsBudget AlertsSavings PlansCleanup

Support Plans

Choose the Right
Level of Coverage

Three tiers of managed infrastructure support — from essential monitoring to full dedicated engineering.

Essential

Monitoring + Alerts

For teams that want visibility and email alerts but handle incidents themselves.

✓24×7 Prometheus + Grafana monitoring
✓AlertManager + email/Slack alerts
✓Monthly health report
–On-call incident response
–Patch management

Get a Quote

How We Handle
Production Incidents

A defined, repeatable incident response process so every outage is handled calmly and systematically.

Detect

AlertManager fires within 60 seconds of threshold breach. On-call engineer paged via PagerDuty or Slack immediately.

Respond

Engineer acknowledges within SLA. Immediate triage — is it affecting users? Can we mitigate now? War room opened if P1.

Resolve

Mitigation applied — rollback, scale-up, failover, or fix deployed. Service restored. You're updated throughout via Slack.

Postmortem

Blameless postmortem written within 48 hours — root cause, timeline, impact, and concrete prevention steps.

FAQ

Support Questions

What does '24×7 support' actually mean?

It means an on-call engineer can be reached and will respond to critical incidents any hour of the day or night, including weekends and holidays. For the Growth and Enterprise plans, we commit to a defined SLA response time. For Essential, 24×7 refers to automated monitoring — human response is within business hours.

What counts as a 'critical incident'?

A P1 critical incident is anything actively impacting production users — site down, API failing, data unavailable, payment processing broken, or a security breach in progress. P2 is degraded performance or a time-sensitive bug. P3 is anything that can wait until business hours.

Do you manage infrastructure you didn't build?

Yes. We do an onboarding audit to understand your existing infrastructure, document it, set up monitoring, and then take it over on a support retainer. We've onboarded dozens of legacy environments this way.

How are we kept informed during an incident?

You get a dedicated Slack channel for all infrastructure communication. During incidents, we post updates every 15–30 minutes. For Enterprise clients, we join your existing incident management tool (OpsGenie, PagerDuty, or Jira).

Your Infrastructure.Our Responsibility.Always On.

Everything Your InfraNeeds to Stay Healthy