About this position
We’re a Portland-based team! This role is mostly remote, but candidates must be located in or near Portland, Oregon to facilitate occasional in-office collaboration.
Cloud Operations is a fast-paced team responsible for ensuring the reliability, security, and performance of IDX’s 24x7x365 SaaS platform.
As a Production Support Engineer, you will develop deep product and platform knowledge and apply your technical expertise to investigate, triage, and resolve production incidents and system defects. You will play a critical role in maintaining and improving the availability and performance of services our clients rely on, using monitoring, analysis, and automation to drive measurable improvements.
In this role, you will perform detailed defect analysis to assess impact and urgency, leverage specialized tools to support root cause investigations, and partner closely with Engineering and other teams to drive resolution. You will also contribute to continuous improvement efforts by streamlining manual processes through automation and cloud technologies.
You’ll be part of a highly collaborative team that responds to live incidents, proactively monitors business-critical systems, and operates with a strong sense of ownership and accountability. Together, we’re focused on protecting our clients from identity risk while building a company and workplace we’re proud of.
Role and Responsibilities
Monitoring and Alerting
- Proactively monitor production systems to ensure environment health, stability, and availability using AWS-native and third-party monitoring tools (e.g., CloudWatch, Dynatrace).
- Respond to and triage major production incidents, gathering logs, metrics, and other data to validate, reproduce, and assess impact.
- Drive incident investigations by providing clear summaries, validating issues across platforms, and escalating to appropriate teams for resolution.
Defect Management
- Assess, log, categorize, and track system defects in accordance with established defect management processes and standard operating procedures.
- Monitor defect status through resolution and communicate progress to stakeholders.
- Use specialized tools and analysis to provide actionable insights that support root cause identification and ongoing platform improvements.
Communications and Reporting
- Communicate incident and defect analysis to Management, Information Security, Client Services, and Engineering.
- Support outage communications, including drafting and distributing updates for internal teams, executive leadership, and external customers as appropriate.
- Contribute to Root Cause Analysis (RCA) documentation, detailing incident timelines, impact, contributing factors, and corrective actions.
- Generate and maintain service availability and performance metrics for reporting and trend analysis.
Engineering Collaboration and Automation
- Partner closely with Engineering to troubleshoot and resolve production issues.
- Develop and maintain automation to improve reliability, reduce manual effort, and ensure repeatable operational outcomes.
- Address Information Security–related requests through manual intervention or Infrastructure as Code (Terraform).
- Participate in code and infrastructure reviews with a focus on security, reliability, and cost efficiency.
Operational Support
- Monitor service request queues and manage support tickets in alignment with established SLAs.
- Document systems, processes, and operational procedures to ensure clarity and knowledge sharing.
- Use diagnostic tools to identify root causes and implement effective resolutions.
- Perform additional duties as assigned to support the stability and success of the platform.
- 4+ years of progressive experience in SaaS operations, application administration, or technical support roles supporting production systems.
- Bachelor’s degree in a technical field or equivalent hands-on experience supporting SaaS technologies in AWS.
- Technical Skills: Hands-on experience with AWS cloud services, Azure DevOps, Linux-based systems, containerized workloads (Docker), application monitoring tools, Infrastructure as Code (Terraform), scripting, and Git.
- Availability & Teamwork: Willingness and ability to participate in after-hours incident response and on-call rotations. Flexible, dependable, and comfortable collaborating across teams.
- Performance & Ownership: Self-motivated and results-driven, with the ability to apply specialized knowledge to solve complex operational problems.
- Attention to Detail: Able to work independently while maintaining strong communication and a collaborative presence within the team.
IDX is an Equal Opportunity Employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, age, disability, veteran status, or any other legally protected characteristic.