Observability Specialist
GesternAngaben zum Job
| Firma | Deel |
| Kategorie | IT | Pensum | 100% |
| Home Office | 100% Remote |
| Einsatzort | Remote |
Job-Inhalt
Key Responsibilities
Design, implement, and maintain scalable observability solutions for cloud-native environments
Own monitoring across AWS and Kubernetes (EKS) environments, covering clusters and workloads
Operate and maintain self-hosted monitoring stacks (e.g., Prometheus, Grafana, Mimir, Loki, Tempo)
Manage and optimize DataDog (metrics, logs, APM, alerts, cost monitoring)
Improve observability architecture to support high availability, scalability, and fault tolerance
Implement monitoring cost optimization strategies (log/trace sampling, retention policies, storage optimization)
Automate observability infrastructure using Infrastructure as Code (Terraform, Helm, etc.)
Integrate monitoring and alerting into CI/CD pipelines (GitHub Actions is an advantage)
Support capacity planning and performance tuning initiatives
Collaborate with DevOps, SRE, and Engineering teams to embed observability best practices
Drive continuous improvement of monitoring standards, tooling, and reliability practices
Required Skills & Experience
5+ years of hands-on experience in monitoring / observability engineering within cloud-native environments
Strong experience with AWS services 5+ years of hands-on experience working with Kubernetes
Solid knowledge of Kubernetes monitoring, including metrics, logs, and traces for clusters and workloads, alerting, SLOs, SLIs, and dashboards.
Proven experience operating and maintaining self-hosted monitoring stacks, advantage: Prometheus, Grafana, Mimir, Loki, Tempo Experience designing or improving observability architectures at scale
Experience with DataDog (metrics, logs, APM, alerts, and cost monitoring)
Strong understanding of high availability, scalability, and fault-tolerant architectures
Experience with monitoring cost optimization, including log and trace sampling strategies, storage and retention optimization
Ability to automate monitoring tasks using Infrastructure as Code and scripting (Terraform, Helm, etc.)
Familiarity with CI/CD pipelines and integrating monitoring into deployment workflows (GitHub Actions is an advantage).
Experience with capacity planning and performance tuning
Soft Skills
Strong problem-solving and analytical skills
Ability to work independently and take ownership of complex systems
Good communication skills, able to collaborate with DevOps, SRE, and other teams
Proactive mindset with a focus on continuous improvement
Benefits
- Stock grant opportunities dependent on your role, employment status and location
- Additional perks and benefits based on your employment status and country
- The flexibility of remote work, including optional WeWork access