Site Reliability Engineer – Cloud Services

Aspect

Job Description
GENERAL SCOPE & SUMMARY

The Site Reliability Engineer is responsible for automation and infrastructure buildout deploying Kubernetes clusters and migrating the workload from traditional VM’s into containers in a PCI compliant environment.
This involves working to get applications into a CI/CD platform, ensuring uptime, and identifying service level objectives.
This position will work closely with the R&D team to build and run large-scale, massively distributed, and fault-tolerant systems through mechanisms like automation and evolve systems by pushing for changes that improve reliability and velocity.

PRIMARY ROLE & RESPONSIBILITIES

Support the implementation and design of containerized systems utilizing CI/CD and Kubernetes
Deploying and administering Kubernetes clusters in multi-cloud environments
Configure Linux systems and software packages
Collaborate with the team on procedural knowledge of what is to be automated
Automate operational tasks and assist in the transition to service ownership models
Collaborate across project teams to simplify and improve software lifecycle processes
Manage and maintain infrastructure as code
Provision and configure cloud assets using scripts, API’s, CLI’s and management consoles
Conduct Production Readiness Reviews (PRRs) to determine the reliability of systems before release
Participate in postmortem reviews

Qualifications

REQUIREMENTS

Recent practical experience and background working with 3 years of Kubernetes experience
Expertise in designing, analyzing, and troubleshooting large-scale distributed systems
Ability to debug, optimize scripts, and automate routine tasks.
Ideally have a systematic problem-solving approach coupled with strong communication skills and a sense of ownership and drive.
Cloud experience: AWS, Azure, and other cloud platforms
Experience with: Cassandra, Argo CD, Chart Museum, ChatOps
Experience with Terraform, Ansible, Vault, and Jenkins administration
Experience with Linux operating systems and networking engineering
Experience with monitoring systems such as DataDog and Prometheus
Experience with Continuous Delivery
Ability to define Service Level Indicators (SLIs) and Service Level Objectives (SLOs)
Bachelor’s degree or equivalent level of experience