Senior Site Reliability Engineer
Our Client is looking for a Senior Site Reliability Engineer to join our team and help build, automate, and secure the infrastructure that powers a cutting-edge cyber range platform. You will work across the full breadth of our SRE practice — spanning traditional site reliability, DevOps, and DevSecOps — supporting deployments across self-hosted data centers, customer-provided hardware, and pre-packaged appliance environments. Collaboration is key, as you’ll partner with engineering teams across the organization, contribute to infrastructure planning, and mentor junior team members. This role balances hands-on delivery with long-term automation thinking, and requires someone who builds well-engineered tooling at scale rather than relying on manual fixes.
Who you are:
- You bring strong software engineering skills beyond scripting — you write production-quality code and build maintainable tooling
- You think in terms of reliability and operability, always looking to automate and improve
- You enjoy helping other engineers solve problems and can context-switch between deep project work and support requests
- You’re security-minded, with a practical approach to hardening infrastructure and deployments
- You’re comfortable operating in complex, multi-environment setups including on-premises and air-gapped models
- You build trust across teams through reliability, follow-through, and clear communication
- You’re open to feedback and passionate about sharing knowledge to raise the team’s overall capability
What you’ll be doing:
- Designing and building infrastructure automation for consistent, repeatable deployments across SimSpace-hosted, customer-provided, and appliance environments
- Developing and maintaining CI/CD pipelines using GitHub Actions and ArgoCD to improve build reliability and developer experience
- Managing and evolving Kubernetes-based infrastructure, including application packaging and deployment workflows using Grafana Tanka and Kustomize
- Building and maintaining observability tooling using the Grafana stack for monitoring, alerting, logging, and dashboards
- Identifying and resolving performance and reliability issues including pod scaling, resource allocation tuning, and latency bottleneck analysis
- Hardening deployment pipelines and runtime environments through container security, network segmentation, image scanning, and vulnerability management
- Serving as a hands-on infrastructure partner to engineering teams across the organization
- Contributing to incident response through a light on-call rotation and driving post-incident improvements
- Mentoring junior and mid-level SRE team members
Languages and Tools we use:
- GitHub Actions, ArgoCD, Kubernetes, Grafana Stack, Grafana Tanka, Kustomize, VMware
Requirements:
- 5–7 years of experience in site reliability, DevOps, or infrastructure engineering
- Hands-on experience with Kubernetes in production, including deployment tooling, cluster operations, and performance tuning
- Solid experience building and maintaining CI/CD pipelines, preferably with GitHub Actions, ArgoCD, or similar GitOps tooling
- Practical understanding of infrastructure-as-code principles and configuration management
- Experience with observability and monitoring tools, preferably the Grafana stack
- Security-minded approach to infrastructure, including container and network security
- Working knowledge of VMware virtualization environments
- Strong written and verbal communication skills
Nice to have:
- Experience delivering software to customer-managed, on-premises, or air-gapped environments
- Familiarity with compliance-driven or security-hardened deployment environments
- Experience with vulnerability scanning tools and security automation
- Background in cybersecurity or networking contexts