Observability Specialist

Gestern

Key Responsibilities

Design, implement, and maintain scalable observability solutions for cloud-native environments
Own monitoring across AWS and Kubernetes (EKS) environments, covering clusters and workloads
Operate and maintain self-hosted monitoring stacks (e.g., Prometheus, Grafana, Mimir, Loki, Tempo)
Manage and optimize DataDog (metrics, logs, APM, alerts, cost monitoring)
Improve observability architecture to support high availability, scalability, and fault tolerance
Implement monitoring cost optimization strategies (log/trace sampling, retention policies, storage optimization)
Automate observability infrastructure using Infrastructure as Code (Terraform, Helm, etc.)
Integrate monitoring and alerting into CI/CD pipelines (GitHub Actions is an advantage)
Support capacity planning and performance tuning initiatives
Collaborate with DevOps, SRE, and Engineering teams to embed observability best practices
Drive continuous improvement of monitoring standards, tooling, and reliability practices

Required Skills & Experience

5+ years of hands-on experience in monitoring / observability engineering within cloud-native environments
Strong experience with AWS services 5+ years of hands-on experience working with Kubernetes
Solid knowledge of Kubernetes monitoring, including metrics, logs, and traces for clusters and workloads, alerting, SLOs, SLIs, and dashboards.
Proven experience operating and maintaining self-hosted monitoring stacks, advantage: Prometheus, Grafana, Mimir, Loki, Tempo Experience designing or improving observability architectures at scale
Experience with DataDog (metrics, logs, APM, alerts, and cost monitoring)
Strong understanding of high availability, scalability, and fault-tolerant architectures
Experience with monitoring cost optimization, including log and trace sampling strategies, storage and retention optimization
Ability to automate monitoring tasks using Infrastructure as Code and scripting (Terraform, Helm, etc.)
Familiarity with CI/CD pipelines and integrating monitoring into deployment workflows (GitHub Actions is an advantage).
Experience with capacity planning and performance tuning

Soft Skills

Strong problem-solving and analytical skills
Ability to work independently and take ownership of complex systems
Good communication skills, able to collaborate with DevOps, SRE, and other teams
Proactive mindset with a focus on continuous improvement

Benefits

Stock grant opportunities dependent on your role, employment status and location
Additional perks and benefits based on your employment status and country
The flexibility of remote work, including optional WeWork access

Bewerben Sie sich direkt auf der Webseite von Deel.

Brauchst du Hilfe bei der Bewerbung?