Site Reliability Engineer

Product Development - Operations · London, Greater London
Department Product Development - Operations
Employment Type Full time
Minimum Experience Experienced

The role

 

The Product Development team plays a pivotal role in delivering the overall consumer and customer digital experience underpinning the Rightmove business.  We recognise that to deliver the best products and features for our consumers and customers we need to work effectively as a healthy, high performing team.  We work collaboratively across a mix of product and centralised teams, together working towards Rightmove’s strategy. 

 

The SRE Team sits within the Technology Operations and Platform Engineering Group and is responsible for building and facilitating world class application fitness and reliability within the product teams. Alongside this they are responsible for providing operational changes and support for product teams ensuring they can focus on delivering product value to Rightmove Customers and Consumers.

 

The role will report to the Engineering Manager – Operations and will work alongside the Infrastructure, Platforms, DBA and Architecture teams. The SRE team will also be working very closely with Rightmove development teams to help ensure the continued industry leading availability, performance and security of Rightmove’s services.

 

Key responsibilities

 

  • Proactively engage with product teams e.g. attend sprint planning, early design meetings or set up reliability reviews to help them prioritise, plan for, and manage Application Fitness and Reliability (including security)
  • Reduce handoffs and improve flow/lead times within development teams by providing operational/infrastructure support e.g. building new microservices, migrating traffic between old and new microservices, infrastructure capacity management, load balancer changes.
  • Triage and action security related tickets from our managed threat detection service - working with infra/platforms/product or on-call teams to deliver fixes
  • Incident/problem management process ownership and management, ensuring teams and people know what to do when during an incident or production issues
  • Support in-depth analysis of live service problems, always pushing to restore service as the priority
  • Alerting, on-call and incident management tooling and software ownership and management
  • Drives the adoption of SLOs, SLIs and error budgets so teams can balance speed of new features with reliability
  • Strive to continuously improve Service Level KPIs such as MTTD and MTTR
  • Eliminate operational toil & engineering knowledge silos through automation Customer and Consumer focus
  • Actively develop a deep understanding of Rightmove, our customers and consumers and how your role supports our goals.
  • Ensure that the actions you take have our customers and consumers at heart; consider internal and external impacts and support our business goals.
  • Pro-actively contribute to initiatives that aim to improve our customer or consumer experience, both identifying your own contribution and supporting others in theirs.

 

 Collaboration

 

  • Promote integration between the SRE Team and other areas (for example Development teams) and help develop a ‘DevOps’ mindset and culture in the wider ‘technology team’ through automation, measurement and sharing.
  • Help build an environment where all teams have easy access to KPIs, metrics and telemetry relevant to the services they are tasked with delivering and maintaining so that they can be fully empowered to fully ‘own’ and take responsibility for their services within Rightmove.
  • Be a key part of the Technical Operations and Platform Engineering team; working closely with the DBA, Infrastructure, Architecture and Development Teams to ensure Rightmove continues to deliver industry leading levels of availability and performance while helping deliver our business plan and commercial objectives each year.

 

 Required competencies Technical Capabilities

 

  • Deep, low-level debugging/troubleshooting and analytical skills: ability to isolate/identify root causes between network, infrastructure, application, and database stacks
  • Deep understanding of what it takes to reliably monitor and manage web applications and infrastructure e.g. java, nodejs, Kubernetes, Docker, Linux, Google Cloud Platform
  • Experience working with logging, monitoring, and alerting tools e.g. Nagios/Xymon, Elastic APM, Kibana, Prometheus, Grafana, PagerDuty
  • Experience with Infrastructure-as-Code and automation tooling (i.e. Terraform, Puppet, Ansible, Rundeck)
  • Software development or in-depth scripting experience with languages such as Python, Golang, or Bash
  • Experience working with an Incident Management process and helping others in high pressure situations
  • Excellent understanding of IT security principles (specifically as they apply to web applications) and experience diagnosing and troubleshooting security related issues
  • Experience using continuous delivery tools and processes in an organisation with multiple delivery teams e.g. Gitlab, Jenkins, Bitbucket
  • Knowledge and understanding of operational best practice – ideally in a high traffic web services environment.
  • Experience with agile ways of working e.g. Scrum, Kanban.

 

Engineering and Operational Excellence

 

  • Can make good judgement calls on prioritisation of incidents or urgent and important work. Will own the resolution through to completion, consistently communicates status and escalates any issues promptly
  • Fosters an environment of action bias by picking up the phone, asking for help, escalating and pushing for resolution. Doesn’t let themselves or the team be stuck for days at a time, does not procrastinate, put off work for another time
  • Takes ownership of delivering work to agreed timelines, anticipate risks, communicates status, manages all dependencies, works to remove any technical and organisational blockers or obstacles, even when it’s “not your job.”
  • Stays inspired, acquires new knowledge, and innovates in their work
  • Can generate multiple solutions to a problem, weigh up pros and cons of each across technical and non-technical dimensions and can make a recommendation
  • Always tries to simplify processes or technology for scalability, operability and to reduce lead/delivery timelines as much as possible
  • Sees problems as challenges and opportunities, seeks to understand them and ignores boundaries between jobs and departments if necessary, to help resolve them
  • Actively seeks ways to increase software delivery velocity and capacity whilst ensuring we meet agreed service levels for customers and consumers
  • Actively seeks opportunities to make people, process and technology more efficient
  • Actively seeks to improve their own standard of work and that of the team

 

Working with others

 

  • Encourages contributions from everyone and shows an interest in what others have to say
  • Builds long-term relationships with colleagues based on mutual trust and respect
  • Takes ownership for their actions and the effect they have on our business and colleagues
  • Deals openly and honestly with others
  • Supports a culture of accountability & transparency
  • Completely trusted to keep confidences
  • Demonstrates consistency between messaging and actions
  • Works collaboratively and flexibly with others to achieve timely, predictable delivery
  • Goes the extra mile for their teammates by taking an even share of unplanned, project or support work even if the work isn’t desirable
  • Helps to drive an increase the output of the team as a whole
  • Positively commits to change and helps to lead others through it
  • Be a positive driving force within the team and externally. Treats people with respect and understanding, considers the situation and point of view of others
  • Engages with others in a positive, pro-active and collaborative manner

 

Communication

 

  • Consistently communicates status changes on project or support work
  • Communicates confidently, clearly and concisely to a range of audiences
  • Chooses methods of communication that are appropriate to the situation and the audience
  • Encourages two-way conversation through active listening and questioning
  • Writes clearly and succinctly in a variety of communication settings and styles

 

Thank You

Your application was submitted successfully.

  • Location
    London, Greater London
  • Department
    Product Development - Operations
  • Employment Type
    Full time
  • Minimum Experience
    Experienced