i4DM Logo

i4DM

Sys/Cloud Admin/Incident Response Engineer

Posted 2 Days Ago
Remote
Hiring Remotely in USA
Mid level
Remote
Hiring Remotely in USA
Mid level
Provide hands-on cloud and system administration, monitoring, and 24x7 operational support. Detect, triage, and escalate incidents, coordinate cross-team response, restore services, maintain observability, automate routine tasks, and contribute to post-incident reviews and continuous improvement within a VA mission environment.
The summary above was generated by AI
Description

About Our Team 

Our employees thrive in a culture that is fast-paced, collaborative, and ego-free, where innovation and teamwork are encouraged at every level. We provide Federal agencies with immediate access to highly skilled professionals who understand complex mission challenges and deliver efficient, scalable solutions. By continuously investing in talent, technology, and specialized capabilities, we maintain expert teams prepared to support evolving Federal missions through tailored technical solutions and modern service delivery approaches. 

We value diverse perspectives and strive to attract talent from all backgrounds. We are seeking professionals who are passionate about technology, mission success, and solving complex operational challenges with creativity and purpose. If you enjoy expanding your technical expertise while supporting impactful Federal initiatives, you will thrive within our organization. Veterans and military spouses are strongly encouraged to apply and bring their valuable experience to our team. 


About the Role 

We are seeking an experienced and highly motivated Sys/Cloud Admin/Incident Response Engineer to support enterprise monitoring operations, incident detection, response activities, and operational situational awareness for a mission-critical platform within the Department of Veterans Affairs (VA) environment. 

In this role, you will provide hands-on administration and operational support to help ensure monitoring and incident management processes effectively sustain system reliability, operational continuity, and rapid restoration of services across a large-scale, 24x7 enterprise healthcare platform. 

You will work closely with the Monitoring & Incident Management Manager, Program Manager, Technical Directors, DevSecOps & SRE teams, and VA stakeholders to identify, escalate, communicate, and help resolve incidents in alignment with strict service-level expectations and operational standards. 


RESPONSIBILITIES 

Monitoring, Administration & Operational Support 

  • Administer, monitor, and support cloud and platform services, virtual infrastructure, and hosted applications to maintain system health, availability, and performance. 
  • Configure, tune, and maintain monitoring, logging, and alerting solutions to improve visibility across infrastructure, applications, and service dependencies. 
  • Validate alert accuracy, reduce noise, and help ensure operational issues are detected proactively through effective observability practices. 
  • Perform routine system administration tasks such as environment checks, service restarts, access support, patch coordination, and operational maintenance activities. 

Incident Response & Service Restoration 

  • Monitor incident queues and system alerts, perform initial triage, document impact, and execute defined escalation procedures for incidents affecting mission-critical services. 
  • Participate in major incident response activities, including troubleshooting, log review, coordination with engineering teams, and support for service restoration efforts. 
  • Follow incident response playbooks, severity models, and communication protocols to support timely resolution and accurate status reporting. 
  • Document incident timelines, actions taken, recovery steps, and supporting evidence to enable post-incident review and continuous improvement. 

Operational Coordination & Stakeholder Support 

  • Support coordination during operational events by working across infrastructure, application, DevSecOps, SRE, and service management teams. 
  • Provide clear, timely updates on incident status, service impact, troubleshooting progress, and recovery actions to internal stakeholders. 
  • Escalate issues appropriately based on impact, urgency, and established operational procedures. 
  • Maintain accurate operational records in ticketing, incident, and knowledge management systems. 

Observability, Automation & Continuous Improvement 

  • Partner with engineers and platform teams to improve dashboards, alerts, runbooks, and operational procedures supporting reliable service delivery. 
  • Identify recurring operational issues, alert gaps, and system weaknesses, and recommend practical improvements to reduce incident frequency and response time. 
  • Support automation efforts for routine operational tasks, alert correlation, remediation workflows, and incident response activities where applicable. 
  • Contribute to post-incident reviews, root cause analysis activities, and implementation of corrective or preventive actions. 

Reporting, Compliance & Operational Readiness 

  • Help maintain operational reporting on incidents, system health, availability, and response metrics to support service-level objectives and operational reviews. 
  • Ensure incident records, escalation paths, standard operating procedures, and response documentation remain current and usable. 
  • Support compliance with operational policies, security requirements, and change management practices in cloud and enterprise environments. 
  • Participate in on-call or after-hours operational support, as required, in a 24x7 mission-driven environment. 

TAG: #LI-I4DM

TAG: INDMJC

Requirements

QUALIFICATIONS 

  • Bachelor’s degree in Information Technology, Computer Science, Engineering, Cybersecurity, or a related field; equivalent relevant experience may be considered. 
  • 3+ years of experience in systems administration, cloud operations, site reliability, network operations, incident response, or enterprise production support roles. 
  • Hands-on experience supporting Windows and/or Linux server environments, cloud-hosted infrastructure, and enterprise application platforms. 
  • Experience with monitoring, logging, and observability tools used to detect, investigate, and troubleshoot service disruptions. 
  • Working knowledge of incident management processes, ticketing workflows, escalation practices, and service restoration procedures in ITIL-aligned environments. 
  • Ability to analyze logs, alerts, and system behavior to support troubleshooting and rapid issue resolution. 
  • Strong written and verbal communication skills, with the ability to document incidents and coordinate effectively across technical and non-technical stakeholders. 
  • Ability to work in a 24x7, SLA-driven environment and participate in operational response activities under time-sensitive conditions. 
  • Candidates must be eligible to obtain and maintain a Public Trust clearance. 

PREFERRED QUALIFICATIONS 

  • Experience supporting VA or other Federal Government environments, including familiarity with operational reporting, service management, and compliance expectations. 
  • Experience with cloud and platform technologies such as AWS, Azure, Kubernetes, container platforms, virtualization, or hybrid infrastructure. 
  • Familiarity with enterprise monitoring and observability platforms such as Splunk, Dynatrace, CloudWatch, Azure Monitor, Grafana, or similar tools. 
  • Experience using scripting or automation tools such as PowerShell, Python, Bash, or infrastructure automation frameworks to streamline operational tasks. 
  • Exposure to DevSecOps, Site Reliability Engineering (SRE), SAFe Agile, or modern incident response and post-incident review practices. 
  • Relevant certifications such as AWS Certified SysOps Administrator, Azure Administrator Associate, CompTIA Security+, ITIL Foundation, Splunk, or similar credentials. 

Similar Jobs

11 Minutes Ago
Remote or Hybrid
USA
125K-180K Annually
Expert/Leader
125K-180K Annually
Expert/Leader
Cloud • Computer Vision • Information Technology • Sales • Security • Cybersecurity
Manage a team of TPRM analysts to run the vendor risk lifecycle, improve tooling and automation (ServiceNow TPRM, AI), perform assessments and audits, develop TPRM policies aligned to frameworks (NIST/ISO/SOC 2), partner with procurement/legal/IT, track KPIs, and support audit and reporting to leadership.
Top Skills: Ai/Ml ToolsCloud EnvironmentsCrowdstrike ProductsFairIso 27001Nist 800-53Nist CsfSecure CodingServicenowServicenow TprmSigSoc 2
An Hour Ago
Remote or Hybrid
255K-445K Annually
Expert/Leader
255K-445K Annually
Expert/Leader
Artificial Intelligence • Cloud • HR Tech • Information Technology • Productivity • Software • Automation
Set technical direction for a multi-cloud, cloud-native platform: design control planes, multi-cluster topology, workload isolation, identity/trust fabrics, and reliability at scale. Solve ambiguous platform problems, build critical components (operators, control planes), influence architecture across orgs, and mentor senior engineers.
Top Skills: AksAWSAzureCniCrossplaneEksGCPGitopsGkeGoInfrastructure-As-CodeKata ContainersKubernetesMtlsObservability (MetricsOci BundlingOperator/Controller PatternOperatorsService MeshSlos)SpiffeSpireTracing
An Hour Ago
Remote or Hybrid
255K-445K Annually
Expert/Leader
255K-445K Annually
Expert/Leader
Artificial Intelligence • Cloud • HR Tech • Information Technology • Productivity • Software • Automation
Set technical direction for a multi-cloud, Kubernetes-based platform; solve control-plane, multi-cluster, multi-tenant, identity, and reliability problems; design and build core control planes, operators, and infrastructure abstractions; influence architecture across orgs and mentor senior engineers.
Top Skills: AksAWSAzureCniCrossplaneEksGCPGitopsGkeGoInfrastructure-As-CodeKata ContainersKubernetesMetricsMtlsObservabilityOperatorsService MeshSlosSpiffeSpireTracing

What you need to know about the Chicago Tech Scene

With vibrant neighborhoods, great food and more affordable housing than either coast, Chicago might be the most liveable major tech hub. It is the birthplace of modern commodities and futures trading, a national hub for logistics and commerce, and home to the American Medical Association and the American Bar Association. This diverse blend of industry influences has helped Chicago emerge as a major player in verticals like fintech, biotechnology, legal tech, e-commerce and logistics technology. It’s also a major hiring center for tech companies on both coasts.

Key Facts About Chicago Tech

  • Number of Tech Workers: 245,800; 5.2% of overall workforce (2024 CompTIA survey)
  • Major Tech Employers: McDonald’s, John Deere, Boeing, Morningstar
  • Key Industries: Artificial intelligence, biotechnology, fintech, software, logistics technology
  • Funding Landscape: $2.5 billion in venture capital funding in 2024 (Pitchbook)
  • Notable Investors: Pritzker Group Venture Capital, Arch Venture Partners, MATH Venture Partners, Jump Capital, Hyde Park Venture Partners
  • Research Centers and Universities: Northwestern University, University of Chicago, University of Illinois Urbana-Champaign, Illinois Institute of Technology, Argonne National Laboratory, Fermi National Accelerator Laboratory

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account