A Site Reliability Engineer (SRE) combines software engineering practices with operational expertise to ensure systems are reliable, scalable, and performant. They define and track Service Level Objectives (SLOs), Service Level Indicators (SLIs), and error budgets to maintain a balance between rapid feature deployment and system stability. Leveraging automation, infrastructure as code (IaC), monitoring, and incident response frameworks, SREs proactively identify and resolve issues, optimize resource usage, and reduce downtime. By continually improving reliability through automation, process refinement, and collaboration with development teams, SRE Engineers enable organizations to deliver exceptional user experiences while effectively managing system complexity and operational risk.
Common Tasks and Duties:
- System Performance and Reliability: Enhancing system performance and reliability through software engineering practices.
- Monitoring and Observability: Implementing and managing monitoring, observability, and logging tools to ensure system health and performance.
- Collaboration: Working with clients, contractors, and internal teams to manage projects from planning to completion.
- Maintenance and Optimization: Providing engineering support for maintenance and assisting site maintenance with cost-effective solutions that minimize production impacts.
Essential Skills and Experience:
- Software Development: Strong coding skills, particularly in languages like .NET and C#.
- Cloud Platforms: Experience with cloud services, especially Azure.
- Monitoring Tools: Proficiency with monitoring and logging tools such as DataDog, Prometheus, Grafana, and Splunk.
- Linux Systems: Strong experience with Linux operating systems.
- Project Management: Ability to lead and manage projects, collaborating effectively with various stakeholders.
Sample Job Listings:
- Lead Site Reliability Engineer (Sydney, NSW):
- Responsibilities: Lead the SRE team to continuously improve the reliability of ABC’s audience-facing platforms and services on web, mobile, and streaming media.
- Requirements: Flexible working environment, experience with the latest DevOps engineering technologies, and the ability to make a difference in a national broadcaster.
- Site Reliability Engineer (Melbourne, VIC):
- Responsibilities: Apply software engineering experience to enhance system performance and reliability.
- Requirements: Development background with strong coding skills, experience with monitoring tools like Prometheus, Grafana, and Splunk, and strong Linux experience.
- Site Reliability Engineer (Perth, WA):
- Responsibilities: Join the team building the world’s largest radio telescope, providing engineering support for maintenance and assisting site maintenance with cost-effective solutions.
- Requirements: Collaborative, flexible, innovative workplace that values your career.
- Reliability Engineer (Mackay, QLD):
- Responsibilities: Enhance equipment reliability and performance, analyze failure modes, and implement maintenance strategies.
- Requirements: Experience in reliability engineering within the mining sector, strong analytical skills, and proficiency in maintenance management systems.
- Reliability Engineer (Perth Airport, WA):
- Responsibilities: Provide engineering support for maintenance, assist site maintenance with cost-effective solutions, and minimize production impacts.
- Requirements: Background in mechanical engineering, experience in maintenance strategies, and strong problem-solving skills.
- Reliability Engineer (Perth, WA):
- Responsibilities: Set up preventative maintenance strategies for a large fleet, analyze equipment performance, and implement reliability improvements.
- Requirements: Experience in reliability engineering, knowledge of mining equipment, and strong analytical abilities.
- Site Reliability Engineer (Melbourne, VIC):
- Responsibilities: Apply software engineering experience to enhance system performance and reliability, work with monitoring tools, and manage Linux-based systems.
- Requirements: Development background with strong coding skills, experience with monitoring tools such as Prometheus, Grafana, and Splunk, and strong Linux experience.