Site Reliability Engineer
Worlds
Site Reliability Engineer (SRE)
Location: Hybrid (some at-home and some in Worlds Plano HQ) with preference given to candidates in proximity to the Dallas Fort Worth Metroplex.
Reporting to: Leader of Client Delight
The Client Delight team is responsible for comprehensive service delivery and ensuring ultimate customer satisfaction, encompassing Implementation, Operations, Security, Customer Support, Change Management, and IT. This role involves close collaboration with Worlds Engineering & Development, Sales & Solutions Architecture, and other cross-functional teams to provide holistic management of the Worlds customer experience.
About Worlds: Worlds is an AI Platform that enhances visibility and automates physical operations by applying AI across existing camera networks. Our end-to-end solution enables enterprises to model, train, and build automation into their physical environments, helping them develop applications that measure, detect, and track objects in real-time, impacting efficiency, safety, and security. Learn more at worlds.io.
Job Summary: The Site Reliability Engineer (SRE) is a critical hands-on role responsible for the deployment, monitoring, and operational support of the Worlds AI platform for our customers. The SRE ensures the reliability, scalability, and security of customer solutions, acting as the primary technical resource for implementation and incident management. This role requires a blend of cloud infrastructure expertise, automation skills, and a passion for customer success, ensuring our Fortune-500 clientele receive best-in-class service and support.
Key Responsibilities:
- Solution Implementation: Deploy, configure, and update new customer solution environments in Azure Kubernetes Service (AKS) and other cloud platforms (AWS, GCP, private cloud), utilizing infrastructure-as-code tools like Bicep and Helm charts.
- Custom Solution Integration: Work with our Forward Deployment Engineering team on custom development and integration of the Worlds app within the customer's operation. Provide monitoring guidance, support documentation, and solution health dashboards as required to manage the custom solution.
- Monitoring & Alerting: Implement, tune, and manage monitoring and alerting solutions using Prometheus and Grafana to meet customer SLAs and ensure optimal performance. Collaborate with core engineering to define and integrate application telemetry.
- Incident Management & Support: Provide production support and lead incident management processes following ITIL guidelines. Troubleshoot and resolve issues, escalating to DevOps for tool-related issues or to Core Engineering for any Worlds app stack issues (functionality or performance), with the goal of gaining knowledge to reduce escalations over time.
- Knowledge Management: Develop and maintain comprehensive customer runbooks in Confluence, documenting unique solution architectures and return-to-service procedures to ensure operational readiness.
- System & Performance Testing: Test new configurations, including performance and load testing, to validate solution stability and scalability.
- Security & Compliance: Adhere strictly to Worlds’ Acceptable Use Policy (AUP) and Access Control Policy, operating with the principle of least privilege to ensure the security and compliance of customer environments.
- Customer Communication: Serve as a key technical point of contact for customers, communicating effectively on project status, incidents, and operational performance.
Qualifications & Experience:
- Networking (5+ years): Deep experience configuring and troubleshooting TCP/IP networks, including subnetting, routing, firewalls, and VPN solutions (OpenVPN, WireGuard).
- Linux Administration (3+ years): Proficient in building, troubleshooting, and managing Linux servers, including remote access, service verification, and log analysis.
- Cloud Administration (2+ years): Demonstrable experience managing cloud solutions in Azure (required), with familiarity in AWS or Google Cloud as a bonus. Expertise in containerized solutions (Docker, Kubernetes), IaaS (VMs), DNS, IAM, and logging services is essential.
- Automation & Scripting: Experience with configuration management tools such as Ansible, Docker, Kubernetes, and Helm is highly preferred. Proficiency in scripting with Bash and Python for automation is required.
- Database: Ability to write and execute basic SQL select queries for troubleshooting and data verification.
- IT Service Management: Experience with ITSM frameworks (ITIL) and tools (e.g., Jira Service Management) for incident and problem management.
- AI/ML: Experience with Artificial Intelligence (AI) and Machine Learning (ML) concepts is a plus; Worlds will provide training on our specific platform.
Personal Qualities:
- Passionate about delivering Client Delight and taking ownership of the customer experience.
- Ability to thrive in a fast-paced startup environment, iterating quickly on solutions and processes while driving the maturation of operations for security and efficiency.
- A proactive and collaborative mindset, with excellent problem-solving and communication skills.
Perks and Benefits:
- 100% employer-paid medical coverage for employees and dependents.
- Comprehensive benefits including dental, vision, 401k, and disability.
- Flexible PTO policy.
- Employee stock options.
Qualified candidates should send a cover letter and resume to careers@worlds.io.
The above statements are intended to describe the general nature and level of work performed by employees assigned to this job. They are not intended to be an exhaustive list of all duties, responsibilities, and qualifications.
Join us at Worlds and help shape the future of industrial operations with cutting-edge technology!