EOP - System Reliability Engineer - TS/SCI Required

cFocus Software Incorporated Washington, District of Columbia, United States Full-Time Engineering

About this position

cFocus Software seeks a System Reliability Engineer to join our program supporting the Executive Office of the President. This position is remote. This position requires a TS/SCI clearance.

Qualifications:
  • 5+ years and Bachelor's Degree in Computer Programming, Science, Engineering or a related technical discipline, or the equivalent combination of education, technical training, or work/military experience, including:
  • 3+ years of related systems programming experience
  • Experience maintaining an operational environment and use of monitoring tools and dashboard interfaces (ie. Kibana, Grafana)
  • Experience working with container images and platforms (Kubernetes/Docker)
  • Strong understanding of DevOps and software/application development processes
  • Understanding of GitLab, Jenkins, ArgoCD, and other DevOps/Continuous Integration tools for Kubernetes
  • Understanding of microservice design and architectural pattern best practices
  • Understanding of Python, Bash, and Shell scripting
  • Knowledge of network technologies, common infrastructure components, load balancers, firewalls, virtual and physical infrastructure design
  • problem solving and troubleshooting skills
  • communication and interpersonal skills
  • Must possess excellent time management skills and the drive to work unsupervised
  • Experience with deploying to on prem/data center infrastructure
  • Experience using Jira and Confluence on a daily basis
  • Experience in building processes for deploying to a Kubernetes based environment using Gitlab and Helm
  • Understanding of access management and security groups (i.e. IAM, S3 bucket, SSH, VPN, etc.)
  • Ability to write and use unit and functional testing
  • Technical Skills: Proficiency in programming languages (such as Python, Go, or Bash) is essential for scripting and automation tasks. Knowledge of Linux/Unix systems is also crucial, as SREs often work in these environments.
  • Problem-Solving: analytical and problem-solving skills are necessary to diagnose and resolve complex system issues effectively.
  • Understanding of SRE Principles: Familiarity with key SRE concepts such as Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budgets is important for measuring and maintaining system reliability.
  • Reliability and Availability: SRE practices help ensure that services are consistently available and reliable, which is critical for user satisfaction and business success.
  • Scalability: SREs implement strategies that allow systems to scale efficiently as demand increases, ensuring that performance remains optimal even under heavy load.
  • Cost Management: By optimizing resource usage and reducing downtime, SREs contribute to cost savings for organizations.
  • Programming and Scripting: Proficiency in languages like Python, Go, or Ruby is crucial for automating tasks and managing infrastructure.
  • Operating Systems: A strong understanding of Linux/Unix systems is essential for troubleshooting and managing servers.
  • Cloud Computing: Familiarity with cloud platforms like AWS, Azure, or Google Cloud is vital for deploying and managing applications in distributed environments.
  • Containers & Orchestration: Understanding containerization tools like Docker and managing containerized workloads with Kubernetes is crucial for cloud-native applications.
  • Monitoring and Logging: Proficiency in tools like Prometheus, Grafana, or Elasticsearch, Logstash, and Kibana (ELK) Stack is necessary for tracking metrics, setting up alerts, and analyzing logs.
  • Networking: Knowledge of networking protocols and configurations is essential for maintaining system health and performance.
  • Configuration Management: Skills in managing and maintaining system configurations are critical for ensuring system reliability.
  • Incident Response: Ability to respond quickly and effectively to incidents, including documenting and learning from them.
  • Security Best Practices: Understanding security protocols and best practices to protect systems from vulnerabilities.
  • These skills are essential for SREs to maintain high availability and performance, balancing the demands of development and operations.
  • Support required during core business hours of 8am – 5pm, Monday through Friday. 
  • On-call for evenings or weekends, if needed for outages, application upgrades, security patches or other unplanned activities.   
Duties:
  • Monitor system health, availability, and performance using centralized monitoring and logging tools.
  • Administration of accounts (role-based access and rights).
  • Manage accessibility to the application through EOP’s authentication systems.
  • Manage the workflow templates to ensure consistent and predictable task flows.
  • Configure workflow management for new or adjustments based on user requests, while adhering to EOP template standards. 
  • Maintain configurations and configurable fields for users and workflows.
  • Maintain the test environment to mimic production and conduct test and evaluation in the environment prior to deployments.
  • Design and maintain a secure and reliable form of backups, ensuring High Availability (HA) and resiliency.
  • Develop a Disaster Recovery (DR) or Incident Response (IR) plan for specific applications and services in the event of a disaster or unexpected downtime.
  • Maintain unique instances that support various offices.
  • Configure and support integrations with complementary systems.
  • Establish and Improve system monitoring while maintaining established security protocols within development, test, and production systems.
  • Architect, build and maintain on premise and/or cloud infrastructure to support team and customer initiatives.
  • Maintain and improve existing infrastructure (build out autoscaling, support new services, optimize for cost efficiencies/authentication/search, etc.).
  • Administer production, staging and development environments.
  • Manage and aggregate server logs and monitor for security and system related incidents.
  • Monitor and analyze system performance, such as server load and resource usage.
  • Maintain and improve existing build and deployment processes using CI/CD tools.
  • Apply configuration management disciplines to maintain software revisions, security patches, hardening, and documentation.
  • Enforce best practices for security and reliability, and drive security initiatives, like access control and vulnerability testing.
  • Maintain up to date documentation of designs/configurations, ensuring team members have continuity of recurring tasks.
  • Maintain status of operations at all times: perform after actions reporting on all outages and work with engineering teams to determine solution and root cause analysis. Present findings to management for prioritization and tasking.
  • Create and determine required metrics for dashboards and service health.
  • Follow up on engineering tasks for operational solutions, and validate completion
  • Manage operational readiness board – present at weekly meetings and determine if development services are ready for automation based on best practices and maintainability.
  • Track and ensure routine operations maintenance tasks are completed in a timely manner.
  • Align to the customer's strategies for configuration of workflows, without compromising the integrity of the workflow tool and templates.
  • Build, maintain, and utilize the customer's enterprise Development, Security, and Operations (DevSecOps) pipeline.
  • Work with other service providers to support areas of common interest.
  • On-call support may be required.

Powered by JazzHR