Senior Site Reliability Engineer - Chandler, AZ

Apply

Home
Vacancies
Senior Site Reliability Engineer - Chandler, AZ

About The Role

This position requires the successful candidate to work on a W2 directly with FDM. We cannot accept C2C, 1099 or employment sponsorship (e.g. H1-B) for this position.

FDM is a global business and technology consultancy delivering client and industry driven solutions through our 5 core specialist Practices; Software Engineering, Data & Analytics, IT Operations, Change & Transformation, and Risk, Regulation & Compliance.

FDM is seeking a Senior Site Reliability Engineer located in Chandler, AZ to support a project in the Financial sector. Involvement in this project is anticipated to last initially 24 months but may be extended.

This role will be hybrid with requirements to be in office 3 days per week.

About You

Role Description:

FDM is seeking Site Reliability Engineer. This role requires a strong IT professional focused on establishing and improving monitoring to measure end-to-end performance and end-user availability of systems via a suite of common monitoring tools. Interface with business partners and operations teams to develop business and technical monitoring requirements. As part of this role, the person will primarily be responsible for supporting production or operations of critical applications. They will ensure the application’s operational readiness by evaluating its performance, reliability, scale, resiliency & observability. They will be responsible for identifying issues in production, triaging identified issues, partnering with other engineers on the team to identify the root cause. Possess strong analytical ability in solving IT problems, working towards automation, and elimination of systems and or process bottlenecks.

Responsibilities:

As part of the SRE team, perform full stack triaging of alerts and engage other engineers to identify root cause of application performance & stability issues.
Work with stakeholders such as product owners to define service level objectives (SLOs) for application features and services.
Track performance against SLOs in partnership with development teams or other stakeholders, and ensure systems continue to meet SLOs over time.
Design, develop dashboards and reports to communicate key metrics.
Identify opportunities to improve alerting posture and create/update alerts accordingly.
Work closely with the Engineering team to understand application architecture and perform Single point of failure analysis and create scenarios for testing resiliency of the application.
Create/derive NFR/Workload model and ensure performance & resiliency is considered early in the SDLC.
Execute performance/chaos tests, analyze using APM and other tools to identify performance & stability issues.
Document any findings/analysis/results, communicate and present to stakeholders.
Perform analytics on previous incidents to understand root causes and use automation to reduce the probability and/or impact of problem recurrence.
Demonstrate proficiency with DevOps tools, JIRA, ServiceNow, MS Project and perform tasks using the tools.

Required Hard and Soft Skills Experience

8 - 10 years of information technology experience with 6+ years working on DevOps or SRE team or performance engineering team
Experienced in triaging of production issues using APM tools such as Dynatrace or AppDynamics or New Relic and log aggregation tools such as Splunk, ELK, etc.
Be a technical expert with expertise across multiple technology areas and the ability to diagnose complex issues throughout many technologies and apply this knowledge to effective monitoring of applications.
Strong experience in Java and Front-end development (UI and UX) (React JS, Angular)
Experience with Apache/tomcat Middleware and Java/RESTful services framework (mulesoft is a plus)
Backend Database experience is a must - Oracle, sqlserver, hadoop
Strong Python, UNIX, Wintel, Perl/Shell scripting
Strong experience working with CI/CD tools - bitbucket, JFrog Artifactory, Jenkins, Artifactory, Terraform/Packer, Ansible
Experience working with Business and Technical leaders to develop KPIs for application monitoring.
Experience with SRE concepts like SLI/SLOs & error budgets and working with developers to track and improve them on a continuous basis.
Must be able to provide oral and written discussion of analytical findings using narrative and graphic forms.
Must be able to use qualitative and quantitative analytical skills to assess the effectiveness of the operations.
Identifying symptoms for process improvement.
Analytical and investigation, and organization skills
Communications including being able to craft content for executive level presentations.

Preferred Skills/Experience

Great soft skills – People and communications skills are essential.
Good proficiency in system, network, security and database operations, protocols, and industry standard technologies.
Experience with tools such as Tanium, Artifactory, BMC TrueSight Orchestration
Experience in command line interfaces (CLI), third party APIs and integration.
Experience in server administration with Red Hat Enterprise Linux and Windows Server
Good understanding of developing fault tolerant solutions and knowledge in horizontal scaling and resiliency/HA
Ability to juggle competing priorities and adapt to changes in project scope.

College Degree or Higher or equivalent work experience

Characteristics of Top Performer

Improve and optimize deployment challenges and help in delivering reliable solution.
Interact with technical leads and architects to discover solutions that help solve challenges faced by Product Engineering teams.
Be part of an enriching team and solve real Production engineering challenges.
Improve knowledge in the areas of DevOps Cloud Engineering by using enterprise tools and contributing to projects success.
Provide work breakdown and estimates for tasks on agreed scope and development milestones to meet overall project timelines.
Experience with the Agile/Scrum methodology.
Strong verbal and written communication skills.
Highly detailed oriented.
Self-motivated, with the ability to work independently and as part of a team.
Strong willingness comfort taking on and challenging development approaches.
Strong analytical and communication skills, ability to effectively work with both technical and non-technical resources.
Must have strong debugging and troubleshooting skills.
Assisting the team in instrumenting code for system

About Us

About FDM

FDM powers the people behind tech and innovation. We spot trends, find top talent, and help businesses stay ahead.

With 35+ years of experience, we coach, mentor, and launch fresh thinkers from diverse backgrounds into world-class careers. Partnering with top global companies, we deliver the right talent at the right time—while guiding our people toward exponential growth.

🌍 Global impact – 18 centers across North America, APAC, the UK, and Europe
🚀 25,000+ careers launched – and counting
🤝 300+ trusted client partners

Committed to Diversity, Equity and Inclusion

Tech careers should be for everyone. With 75+ nationalities represented, FDM thrives on diversity, fuels innovation through unique perspectives, and celebrates success together. As an Equal Opportunity Employer and FTSE4Good-listed company, we ensure every qualified applicant gets a fair shot—no barriers, just opportunities.

Additional Considerations

FDM Group, Inc. is registered to operate and hire employees in select states within the US. We will consider employment applications exclusively from candidates who are either residing in one of the following states or willing to relocate to them: Arizona, California, Colorado, Delaware, Florida, Georgia, Illinois, Indiana, Massachusetts, Maryland, Minnesota, North Carolina, New Jersey, New York, Pennsylvania, Tennessee, Texas, Utah, and Virginia.

Apply

Other jobs like this

Similar