Sr Site Reliability Engineer
In this role as a Sr. Site Reliability Engineer, you will be an integral member of a dynamic SRE team continuously improving our AWS cloud deployment platform, “automation first”, in support of our rapid expansion.
· Lead team initiatives to continuously refine our AWS deployment practices for improved reliability, repeatability and security. You’ll create/contribute to plans, collaborate with other DevOps team members. These high-visibility initiatives will help to increase service levels, lower costs, and deliver features more quickly.
· Design effective monitoring / alerting (for conditions such as application-errors, high memory usage) and log aggregation approaches (to quickly access logs for troubleshooting, or generate reports for trend analysis) to proactively notify business stakeholders of issues and communicate metrics, working closely with these stakeholders, using tools including AWS CloudWatch, Datadog, ClearData etc.
· Write code and scripts to automate provisioning of AWS services and to configure services, using tools and languages including AWS CLI / API, Terraform, Ansible, Chef, Python, Bash, and Git.
· Configure build pipelines to support automated testing and deployments using tools including Jenkins, CircleCI, AWS CodeDeploy. You’ll configure these pipelines for specific products and help optimize them for performance and scalability.
· Help refine DevSecOps security practices (including regular security patching, minimum-permissions accounts and policies, encrypt-everything) in compliance with Health IT, government and other standards regulations, implement, and verify them, using tools like Sonarqube, VeraCode to analyze and verify compliance.
· Document and diagram deployment-specific aspects of architectures and environments, working closely with Software Engineers, Software Engineers in Test, and others in DevOps.
· Troubleshoot issues in production and other environments, applying debugging and problem-solving techniques (e.g., log analysis, non-invasive tests) , working closely with development and product teams.
· Suggest deployment patterns & practices improvements based on learnings from past deployments and production issues; collaborate with DevOps team to implement these.
· Promote a DevOps culture, including building relationships with other technical and business teams.
· Work closely with InterOps to deploy and configure the platform to on-board clinics.
· Work closely with Engineering-Data team to automate deployment and configuration of infrastructure to support roll out of data products/projects.
· Work to ensure system and data security is maintained at a high standard, ensuring the confidentiality, integrity and availability of the applications is not compromised.
· Ability to automate away manual interactions and have a passion for helping enable developers to write code that works
· A strong understanding of Linux administration including Bash scripting
· Networking expertise including VPCs, SDNs (e.g., Amazon / Azure) / VLANs, routers and firewalls
· Familiarity with at least one IAC / CM tool such as Terraform, Ansible, Chef, or Puppet
· Familiarity with at least one code build / deploy tool such as Jenkins, Circle CI
· Familiarity with DB setup, configuration and monitoring
· A bachelor's degree in science, technology, engineering, or a similar field is required.
· Work in terms of enabling capabilities through a blend of process and technology
· 6+ years AWS administration experience / training including provisioning EC2 instances, VPCs, Elastic Beanstalk, Lambda functions, RDS Aurora Server/serverless databases, S3 storage, IAM security, ECS containers, Cloudwatch metrics & logs
· 5+ years of experience developing and / or deploying serverless functions using AWS Lambda, Azure Functions, or Google Cloud Functions
· Experience developing and / or deploying Docker Containers on ECS/EKS or Kubernetes
· Experience in automating provisioning of Infra to enable complete application ecosystem on demand
· 7+ years Experience with SQL; Adept in using RDS-PostgreSQL or other DBMS
· Experience with monitoring / alerting tools such as New Relic, Grafana, Prometheus, Sysdig
· Experience with log aggregation tools such as Datadog, ELK, Splunk