As a Site Reliability Engineer, the incumbent will operate as part of the Site Reliability Engineering (SRE) function within the Developer Services department of the Enterprise Architecture organization.
Developer Services is reimagining the way we tackle operational challenges through an engineering mindset. You will have a rare opportunity to be a part of a growing, global, cross-functional Site Reliability Engineering team tasked with building SRE from the ground up into a first-class organization. You will be empowered to engineer and own solutions that foster scalable and resilient hybrid cloud solutions (both AWS and On-prem). You will build tooling which enable developers to eliminate toil, increase observability of their services, and streamline deployments into AWS.
Requires in-depth knowledge and expertise in their own job discipline and working knowledge of related disciplines
- Leads projects or work streams within broader projects
- Accountable for work of self and sometimes others, provides process and standards advice in area of specialization
- Works independently, receives minimal guidance
- Acts as a resource for colleagues with less experience
- Proactively identifies problems and can present and implement solutions to these problems
- Engages with the TRP Developer Community and regularly assess the quality of the Developer Services Platform
Upholds the responsibilities associated with the librarian function for the firm’s Definitive Software Library (DSL). All binary software assets used by the developer community of T. Rowe Price will be hosted in DSL, and proper management of the assets.
- Analyzes, documents and designs prototypes, implements and maintains products that address business needs and achieve strategic goals. Executes and manages all aspects and phases of testing initiatives for moderate to large-sized system implementations and/or enhancement projects. Regularly and independently interacts with business partners of varying associate/management levels to ensure clarity of the problem/opportunity and elicit business requirements.
- Works with SRE manager and cross-team to deliver enterprise observability frameworks that benefit increasing observability both within Developer Services and across Global Technology. Continually define and right-size Service Level Objectives for all Unity Services to ensure the aforementioned levels of availability are reached. This will include: supporting the defined robust set of reporting metrics; ensuring services are deployed with metrics out of the box to improve observability across the stack, and assisting the SRE Manager with monitoring and scaling of all services and assisting with producing and reporting metrics.
- Participate actively in a regional on-call rotation supporting the Unity SDLC Platform with a focus on customer support, improving observability, automation, and actively raising code fixes where possible including a clear focus on improving support for the next rotation.
- Works with SRE Manager and cross-team to deliver enterprise automation frameworks that benefit eliminating toil both within Developer Services and across Global Technology. Engineer automation to eliminate the operational challenges of the platform and reach zero downtime deployment capabilities. An efficient operations team and a lead will make a tremendous difference to the customer experience during their adoption of Unity.
PERSONAL ATTRIBUTES / SKILLS / QUALIFICATIONS
- Extensive Experience defining and right sizing Service Level Objectives
- Strong understanding and experience implementing observability tools
- Well-versed in code versioning and management, including Git
- Well-versed in practicing blameless post-mortem practices
- Demonstrable understanding of SDLC methodologies
- Strong interpersonal skills. Must be able to effectively work with people within Technology and outside of the enterprise
- Adaptable and able to learn technology and business quickly
- Excellent written and oral communication skills
- Deep knowledge of AWS resources, networking, security, services, APIs, and billing.
- Engineering SDLC pipelines including supporting CI/CD based cloud deployments via Terraform or AWS API’s, Git-based SCMs, and Artifact Repositories such as Artifactory and Docker Registry at scale (over 50TB of data at over 9000 transactions per second).
- A solid core foundation in infrastructure and systems engineering including unix/linux compute, networking, storage, and monitoring stacks.
- Experience developing micro services in both multi-tenant and dedicated micro service runtime environments. Is familiar with the Open Container Initiative and both docker and containerd runtimes.
- Hands-on object oriented development and/or scripting experience such as Java, Python, and Go.
- For Observability, experience implementing Federated Prometheus, Grafana, and Sentry preferred.
- For Automation, experience with GitLab CI/CD, Terraform, and Ansible AWX preferred.
- For Cloud, AWS experience preferred.
- Contributions to Open Source projects especially related to Observability and Automation.
- Thorough knowledge and experience of building highly available distributed systems, consensus protocols, service discovery, multi-tenancy paradigms, and operating in AWS resiliently at scale with low operational overhead.
- A keen interest in keeping ahead of the technological advances in the SRE space and proven success at incorporating new technology into existing systems.
- Previous experience mentoring and managing small teams with a desire to grow vertically with us with the potential of expanding a team of SREs reporting to you. Of course, growing as an Individual Contributor with us is equally appealing!
A proven champion and evangelist of driving and adopting SRE Best Practices.
Please email James@mccabebarton.com