Site Reliability Engineer – Cloud Services
Aspect
Job Description
GENERAL SCOPE & SUMMARY
- The Site Reliability Engineer is responsible for automation and infrastructure buildout deploying Kubernetes clusters and migrating the workload from traditional VM’s into containers in a PCI compliant environment.
- This involves working to get applications into a CI/CD platform, ensuring uptime, and identifying service level objectives.
- This position will work closely with the R&D team to build and run large-scale, massively distributed, and fault-tolerant systems through mechanisms like automation and evolve systems by pushing for changes that improve reliability and velocity.
PRIMARY ROLE & RESPONSIBILITIES
- Support the implementation and design of containerized systems utilizing CI/CD and Kubernetes
- Deploying and administering Kubernetes clusters in multi-cloud environments
- Configure Linux systems and software packages
- Collaborate with the team on procedural knowledge of what is to be automated
- Automate operational tasks and assist in the transition to service ownership models
- Collaborate across project teams to simplify and improve software lifecycle processes
- Manage and maintain infrastructure as code
- Provision and configure cloud assets using scripts, API’s, CLI’s and management consoles
- Conduct Production Readiness Reviews (PRRs) to determine the reliability of systems before release
- Participate in postmortem reviews
Qualifications
REQUIREMENTS
- Recent practical experience and background working with 3 years of Kubernetes experience
- Expertise in designing, analyzing, and troubleshooting large-scale distributed systems
- Ability to debug, optimize scripts, and automate routine tasks.
- Ideally have a systematic problem-solving approach coupled with strong communication skills and a sense of ownership and drive.
- Cloud experience: AWS, Azure, and other cloud platforms
- Experience with: Cassandra, Argo CD, Chart Museum, ChatOps
- Experience with Terraform, Ansible, Vault, and Jenkins administration
- Experience with Linux operating systems and networking engineering
- Experience with monitoring systems such as DataDog and Prometheus
- Experience with Continuous Delivery
- Ability to define Service Level Indicators (SLIs) and Service Level Objectives (SLOs)
- Bachelor’s degree or equivalent level of experience