FDM is a global business and technology consultancy seeking an SRE Lead to support a major global financial services organisation as it establishes its first formal Site Reliability Engineering function. This is initially a 12 month contract with the potential of going permeant and will be a hybrid role based in Bristol.
This role offers a unique opportunity to build SRE almost from first principles within a large, complex enterprise environment. The successful candidate will lead the transition from largely reactive production support toward a proactive, engineering‑led reliability model, while influencing both legacy platforms and a new, standards‑driven future environment.
You will act as the founding SRE leader, setting the vision, operating model, and priorities for the function, while driving improvements in service stability, resilience, observability, and overall operational maturity. Lead the organisation’s shift from reactive firefighting to data‑driven, preventative reliability practices, and influence platform and application design so that reliability is engineered in from the outset rather than addressed after issues arise.
Responsibilities:
- Define and embed a scalable SRE operating model aligned to organisational culture and maturity, with clear roles, responsibilities, and ways of working across SRE, platform, infrastructure, and application teams.
- Build and grow the SRE capability, shaping team structures, backlog priorities, and operating rhythms to improve reliability across both legacy environments and a modern, automated target platform.
- Establish meaningful service reliability measurement, implementing Critical User Journeys, SLIs, and SLOs to create baselines and provide end‑to‑end service health visibility.
- Embed SLO‑driven decision making, enabling balanced, data‑led trade‑offs between availability, delivery velocity, and operational risk.
- Reduce operational toil and increase resilience through automation initiatives, including runbook automation, self‑healing approaches, and improved deployment, patching, as well as recovery strategies.
- Strengthen incident and problem management, enhancing major incident response, coordination, and communication, and driving high‑quality root cause analysis to prevent recurrence.
- Define and implement pragmatic observability standards across logging, metrics, tracing, alerting, and dashboards to reduce noise and improve signal quality.
- Lead stakeholder engagement and organisational change, influencing senior leaders to adopt SRE principles and acting as a trusted authority on reliability and operational excellence.