Job Description

Purpose of the role

To apply software engineering techniques, automation, and best practices in incident response, to ensure the reliability, availability, and scalability of the systems, platforms, and technology through them. 

Accountabilities

  • Availability, performance, and scalability of systems and services through proactive monitoring, maintenance, and capacity planning.
  • Resolution, analysis and response to system outages and disruptions, and implement measures to prevent similar incidents from recurring.
  • Development of tools and scripts to automate operational processes, reducing manual workload, increasing efficiency, and improving system resilience.
  • Monitoring and optimisation of system performance and resource usage, identify and address bottlenecks, and implement best practices for performance tuning.
  • Collaboration with development teams to integrate best practices for reliability, scalability, and performance into the software development lifecycle, and work closely with other teams to ensure smooth and efficient operations.
  • Stay informed of industry technology trends and innovations, and actively contribute to the organization's technology communities to foster a culture of technical excellence and growth.

Vice President Expectations

  • To contribute or set strategy, drive requirements and make recommendations for change. Plan resources, budgets, and policies; manage and maintain policies/ processes; deliver continuous improvements and escalate breaches of policies/procedures..
  • If the position has leadership responsibilities, People Leaders are expected to demonstrate a clear set of leadership behaviours to create an environment for colleagues to thrive and deliver to a consistently excellent standard. The four LEAD behaviours are: L – Listen and be authentic, E – Energise and inspire, A – Align across the enterprise, D – Develop others..
  • OR for an individual contributor, they will be a subject matter expert within own discipline and will guide technical direction.  They will lead collaborative, multi-year assignments and guide team members through structured assignments, identify the need for the inclusion of other areas of specialisation to complete assignments. They will train, guide and coach less experienced specialists and provide information affecting long term profits, organisational risks and strategic decisions..
  • Advise key stakeholders, including functional leadership teams and senior management on functional and cross functional areas of impact and alignment.
  • Manage and mitigate risks through assessment, in support of the control and governance agenda.
  • Demonstrate leadership and accountability for managing risk and strengthening controls in relation to the work your team does.
  • Demonstrate comprehensive understanding of the organisation functions to contribute to achieving the goals of the business.
  • Collaborate with other areas of work, for business aligned support areas to keep up to speed with business activity and the business strategies.
  • Create solutions based on sophisticated analytical thought comparing and selecting complex alternatives. In-depth analysis with interpretative thinking will be required to define problems and develop innovative solutions.
  • Adopt and include the outcomes of extensive research in problem solving processes.
  • Seek out, build and maintain trusting relationships and partnerships with internal and external stakeholders in order to accomplish key business objectives, using influencing and negotiating skills to achieve outcomes.

All colleagues will be expected to demonstrate the Barclays Values of Respect, Integrity, Service, Excellence and Stewardship – our moral compass, helping us do what we believe is right. They will also be expected to demonstrate the Barclays Mindset – to Empower, Challenge and Drive – the operating manual for how we behave.

Step into the role of Lead Site Reliability Engineer (SRE) at Barclays, where you will be a senior technical expert responsible for driving end-to-end resilience, reliability, and scalability across our mission-critical virtual platform. This role focuses on ensuring systems are designed for fault tolerance, observability, and operational excellence.

You will perform deep technical reviews, troubleshoot complex issues, and define patterns for resiliency by design. As a hands-on engineer, you will collaborate with development and production support teams, advocate chaos engineering, and build a culture of designing for failure. This position requires strong technical breadth across infrastructure, applications, networks, databases, and integrations, combined with expertise in modern reliability engineering practices.

Key responsibilities:

  • Reliability Engineering: Drive strategies to improve reliability, maintainability, and scalability across platform components.
  • Architecture and Design Review: Conduct deep technical assessments of system architectures, identifying risks and recommending improvements for fault tolerance and disaster recovery.
  • Observability & Monitoring: Design and implement full-stack observability solutions, including metrics, logging, distributed tracing, and alerting.
  • Incident Management & Root Cause Analysis: Act as a senior escalation point for production incidents, lead RCA, and implement permanent fixes to prevent recurrence.
  • Chaos Engineering & Failure Testing: Advocate and implement chaos engineering principles to validate system resilience under real-world failure scenarios.
  • Automation & Tooling: Develop automation for failover, capacity management, and self-healing mechanisms to reduce operational risk.
  • Continuous Improvement: Analyse service risk assessments and production incidents to identify systemic issues and drive long-term improvements.

This role is based in Knutsford, with a hybrid working model of working a minimum of 2/3 days per week in the office.

Some other highly valued skills may include:

  • Understanding of cloud solutions, preferably VMWare products.
  • Exposure to coding in Python.
You may be assessed on the key critical skills relevant for success in role, such as risk and controls, change and transformation, business acumen, strategic thinking and digital and technology, as well as job-specific technical skills.

To be successful as a Lead Site Reliability Engineer (SRE), you should have experience with:

  • Technical Expertise: Proven experience building and operating fault-tolerant, highly available systems at scale.
  • Architecture & Design: Strong knowledge of distributed systems, resiliency patterns (circuit breakers, retries, failover), and disaster recovery strategies.
  • Problem-Solving: Ability to troubleshoot complex technical issues across distributed systems and perform deep root cause analysis.
  • Collaboration & Influence: Skilled at working with development, operations, and architecture teams to embed reliability into design and delivery.
Barclays

Barclays