Staff Site Reliability Engineer I
17 Tage altAngaben zum Job
| Firma | Remote |
| Kategorie | IT | Pensum | 100% |
| Home Office | 100% Remote |
| Benefits | Flexible Arbeitszeiten Vielfältige Weiterbildung Erfolgsbeteiligung |
| Einsatzort | Remote |
Job-Inhalt
What this job can offer you
As a Staff SRE at Remote, you will own the technical direction of our SRE platform, shaping its architecture, reliability strategy, and long-term evolution. This is a leadership role as much as a technical one: you'll drive platform-wide initiatives, set the reliability bar for engineering teams across the organisation, and be a force multiplier for the engineers around you.
A key part of this role is identifying and leading opportunities to leverage AI: from reducing operational toil to enabling engineering teams to build, ship, and operate software more effectively. You'll work with a high degree of autonomy, translating technical risks into business impact and aligning with Engineering Managers, Team Leads, and Product teams to ensure reliability and engineering efficiency are built into everything we do.
What you bring
Technical
- 8+ years of experience in Site Reliability Engineering, DevOps, or Platform Engineering
- Deep expertise in Kubernetes: operating, designing, and scaling production clusters
- Proven experience designing and managing cloud infrastructure on AWS (or other cloud providers) at scale
- Strong infrastructure-as-code practice with Terraform
- Experience defining and operating reliability frameworks: SLOs, SLIs, error budgets, alerting strategies
- Solid observability background: Datadog, Grafana/Prometheus, or similar
- Proficiency with CI/CD platforms (GitLab CI, GitHub Actions, or similar) and deployment automation
- Comfortable with Bash and scripting for automation; broader programming skills are a plus
- Experience with container tooling (Docker) and the broader ecosystem around it
- Curiosity and practical experience applying AI tools to infrastructure, operations, or developer tooling: whether through AI-assisted automation, LLM-powered workflows, or intelligent observability
Leadership & behavioural
- Proven track record of driving platform-wide technical initiatives and influencing engineering direction without formal authority
- Strong communicator: able to tailor messaging to technical and non-technical audiences, write clearly, and align stakeholders across teams
- Self-directed: able to identify what needs attention, define the path forward, and execute with minimal supervision
- Experience mentoring senior engineers and creating space for others to lead and grow
- Comfortable navigating ambiguity, translating vague requirements into concrete solutions
- Approaches technical problems with a business lens, understands the cost and value of engineering decisions
Nice to have
- Excellent communication and interpersonal skills
- Holistic debugging skills
- Security knowledge and capabilities from a defensive and offensive standpoint
Key Responsibilities
- Own the technical direction of Remote's SRE/Platform domain, its architecture, tooling, and long-term roadmap
- Define and drive the reliability strategy across the platform: SLOs/SLIs, error budgets, observability, and incident management maturity
- Lead complex, cross-team infrastructure initiatives from discovery through delivery, delegating effectively and keeping projects aligned with business goals
- Identify and lead AI enablement initiatives across the engineering organisation, exploring where AI can reduce operational overhead, accelerate development workflows, improve incident response, and unlock new capabilities for engineering teams
- Drive AI-powered automation for platform operations: intelligent alerting, automated incident triage, self-healing infrastructure, and AI-assisted runbooks, reducing toil and freeing engineers to focus on higher-leverage work
- Contribute to capacity planning and cost-efficiency of Remote's infrastructure
- Mentor senior engineers, raising the technical bar through code reviews, design feedback, and hands-on guidance
- Collaborate with the Security team on platform hardening, threat mitigation, and compliance
- Be a steward of engineering quality across the SRE team, championing best practices, managing technical debt deliberately, and raising standards over time
- Contribute to hiring, onboarding, and continuously improving how the SRE team operates
Benefits
- work from anywhere
- flexible paid time off
- flexible working hours (we are async)
- 16 weeks paid parental leave
- mental health support services
- stock options
- learning budget
- home office budget & IT equipment
- budget for local in-person social events or co-working spaces