Site Reliability Engineer – Cloud Services

Aspect

Job Description
GENERAL SCOPE & SUMMARY

  • The Site Reliability Engineer is responsible for automation and infrastructure buildout deploying Kubernetes clusters and migrating the workload from traditional VM’s into containers in a PCI compliant environment.
  • This involves working to get applications into a CI/CD platform, ensuring uptime, and identifying service level objectives.
  • This position will work closely with the R&D team to build and run large-scale, massively distributed, and fault-tolerant systems through mechanisms like automation and evolve systems by pushing for changes that improve reliability and velocity.

PRIMARY ROLE & RESPONSIBILITIES

  • Support the implementation and design of containerized systems utilizing CI/CD and Kubernetes
  • Deploying and administering Kubernetes clusters in multi-cloud environments
  • Configure Linux systems and software packages
  • Collaborate with the team on procedural knowledge of what is to be automated
  • Automate operational tasks and assist in the transition to service ownership models
  • Collaborate across project teams to simplify and improve software lifecycle processes
  • Manage and maintain infrastructure as code
  • Provision and configure cloud assets using scripts, API’s, CLI’s and management consoles
  • Conduct Production Readiness Reviews (PRRs) to determine the reliability of systems before release
  • Participate in postmortem reviews

Qualifications

REQUIREMENTS

  • Recent practical experience and background working with 3 years of Kubernetes experience
  • Expertise in designing, analyzing, and troubleshooting large-scale distributed systems
  • Ability to debug, optimize scripts, and automate routine tasks.
  • Ideally have a systematic problem-solving approach coupled with strong communication skills and a sense of ownership and drive.
  • Cloud experience: AWS, Azure, and other cloud platforms
  • Experience with: Cassandra, Argo CD, Chart Museum, ChatOps
  • Experience with Terraform, Ansible, Vault, and Jenkins administration
  • Experience with Linux operating systems and networking engineering
  • Experience with monitoring systems such as DataDog and Prometheus
  • Experience with Continuous Delivery
  • Ability to define Service Level Indicators (SLIs) and Service Level Objectives (SLOs)
  • Bachelor’s degree or equivalent level of experience