Sr. Site Reliability Engineer (SRE)

Company Description

NBCUniversal is one of the worlds leading media and entertainment companies. We create world-class content, which we distribute across our portfolio of film, television, and streaming, and bring to life through our theme parks and consumer experiences. We own and operate leading entertainment and news brands, including NBC, NBC News, MSNBC, CNBC, NBC Sports, Telemundo, NBC Local Stations, Bravo, USA Network, and Peacock, our premium ad-supported streaming service. We produce and distribute premier filmed entertainment and programming through Universal Filmed Entertainment Group and Universal Studio Group, and have world-renowned theme parks and attractions through Universal Destinations & Experiences. NBCUniversal is a subsidiary of Comcast Corporation.

Our impact is rooted in improving the communities where our employees, customers, and audiences live and work. We have a rich tradition of giving back and ensuring our employees have the opportunity to serve their communities. We champion an inclusive culture and strive to attract and develop a talented workforce to create and deliver a wide range of content reflecting our world.

Comcast NBCUniversal has announced its intent to create a new publicly traded company (Versant) comprised of most of NBCUniversals cable television networks, including USA Network, CNBC, MSNBC, Oxygen, E!, SYFY and Golf Channel along with complementary digital assets Fandango, Rotten Tomatoes, GolfNow, GolfPass, and SportsEngine. The well-capitalized company will have significant scale as a pure-play set of assets anchored by leading news, sports and entertainment content. The spin-off is expected to be completed during 2025.

Job Description

As a Principal Site Reliability Engineer (SRE) overseeing our digital application portfolio, you will lead efforts to ensure the reliability, scalability, and performance of the platforms behind our web, mobile, and OTT experiences. Youll work across a diverse ecosystem of products and technologies—helping with architectural decisions, shaping reliability standards, and championing operational excellence at scale.

You will serve as a strategic partner to engineering, product, security, and infrastructure teams—guiding system design for high availability, leading incident response across critical services, and embedding SRE best practices across the software development lifecycle. Your role will include evolving observability frameworks, advancing infrastructure-as-code maturity, and automating tool to accelerate delivery while maintaining stability.

Success in this role is defined by your ability to influence engineering culture, mentor teams, and drive systemic improvements that raise the bar for operational resilience. Youll take a proactive, data-driven approach to identifying and addressing risks before they impact users. Collaboration across teams—including video engineering, content delivery, data, and customer experience—is key to delivering digital products that are not only innovative but consistently reliable.

What We Value

Site Reliability Engineers are the champions of reliability and customer trust in production. We value engineers who are driven by a desire to deliver the best possible customer experience—ensuring that every interaction across our web, mobile, CTV, and video platforms is fast, seamless, and dependable. We look for systems thinkers who act with urgency, collaborate deeply, and apply a data-driven mindset to everything they do. Curiosity, clear communication, and continuous improvement are at the heart of our culture. As a Principal SRE, youll lead by example—mentoring others, shaping best practices, and helping us build resilient systems that scale.

Responsibilities:

  • Design and implement tools, processes, and frameworks to proactively monitor, measure, and improve the performance, availability, and reliability of production applications.
  • Define and maintain key Service Level Indicators (SLIs) and Service Level Objectives (SLOs) to uphold system reliability and user experience targets.
  • Evaluate applications and services for production readiness—ensuring they meet operational, security, and customer experience requirements before launch.
  • Establish comprehensive observability practices—including real-time monitoring, alerting, and telemetry—to ensure deep visibility into system health and user impact.
  • Serve as a feedback loop to engineering teams—analyzing production behavior, identifying reliability gaps, and driving architectural and operational improvements.
  • Collaborate with security and infrastructure teams to proactively address vulnerabilities and maintain compliance across production systems.
  • Partner with product and platform teams to ensure operational insights inform development priorities and release strategies.
  • Lead post-incident reviews and foster a culture of continuous learning, improvement, and resilience.
  • Participate in a 24/7 on-call rotation to support critical services and ensure rapid incident response.
Qualifications

Must-Haves:

  • Willingness to work onsite and participate in a 24/7 on-call rotation, including evenings, overnights, weekends, and holidays with minimal notice.
  • Demonstrated experience supporting digital news and content platforms across web, mobile, CTV, and video-rich environments, with a strong focus on performance and user experience.
  • 10+ years of experience managing and optimizing large-scale, high-traffic websites.
  • 10+ years of hands-on experience with application deployment processes and CI/CD pipelines.
  • 5+ years improving performance and reliability for OTT (Connected TV) and mobile applications.
  • 5+ years supporting microservices and multi-tier distributed systems.
  • 5+ years implementing software automation frameworks for reliability and operational efficiency.
  • 5+ years of experience with cloud platforms, including AWS and Google Cloud Platform (GCP).
  • 5+ years working with observability and APM tools such as Datadog, New Relic, AppDynamics, Sysdig, or Zabbix.
  • 3+ years working with reverse proxies like Varnish and Content Delivery Networks (CDNs) such as Akamai.
  • 5+ years scripting with languages such as Bash, Python, Perl, or Groovy.
  • 5+ years using configuration management tools such as Ansible, SaltStack, Chef, or Puppet.
  • 5+ years configuring and managing application servers (e.g., Tomcat, NGINX, Apache).
  • 5+ years of extensive experience with load and performance testing tools/frameworks such as JMeter, k6, or similar.
  • Hands-on Experience using tools like Charles Proxy or Fiddler to triage and debug issues with Web, Mobile apps and OTT devices.
  • High level understanding of video streaming techniques and ability to triage issues with Mobile and OTT streaming applications.
  • 3+ years using performance validation tools such as Selenium, TestNG, or equivalent to drive improvements in production.

Preferred Qualifications:

  • 3+ years implementing and monitoring application/infrastructure security controls, including WAFs, site shields, and other perimeter protections.
  • 3+ years applying code and infrastructure security practices, including vulnerability remediation and secure deployment pipelines.
  • Relevant certifications in Performance Engineering or Site Reliability Engineering (SRE) are a plus.

Hybrid: This position has been designated as hybrid, generally contributing from the office a minimum of three days per week.

What well offer: 

At CNBC Headquarters in Englewood Cliffs, NJ, youll have access to great perks and amenities: 

  • Sweat it out -- Free onsite fitness center with state-of-the-art equipment, plus daily group classes 
  • Eat up -- Gourmet cafeteria with daily specials plus soup and salad bars 
  • Extras -- Dry cleaning, shoe shining and sneak peeks  

Dont have a car? No problem! We offer free shuttle transportation to and from multiple locations in Manhattan, Brooklyn, Hoboken and Jersey City .

This position is eligible for company sponsored benefits, including medical, dental and vision insurance, 401(k), paid leave, tuition reimbursement, and a variety of other discounts and perks. Learn more about the benefits offered by NBCUniversal by visiting the Benefits page of the Careers website.

Salary Range$155,000 - $175,000


Information :

  • Company : NBCUniversal
  • Position : Sr. Site Reliability Engineer (SRE)
  • Location : Englewood Cliffs, NJ
  • Country : US

Attention - In the recruitment process, legitimate companies never withdraw fees from candidates. If there are companies that attract interview fees, tests, ticket reservations, etc. it is better to avoid it because there are indications of fraud. If you see something suspicious please contact us: support@jobkos.com

Post Date : 2025-07-25 | Expired Date : 2025-08-24