All posts
Healthcare · HIPAA
February 10, 202610 min read

Deploying HIPAA-Compliant AI: What an Air-Gapped LLM Architecture Actually Looks Like

M

Moneeb Abbas

AI Systems Architect

Healthcare organizations are under pressure to adopt AI. They are also under legal obligation to ensure Protected Health Information (PHI) never leaves their control. These two facts create a specific engineering problem: you need an LLM that works offline, passes a compliance review, and does not require your clinical staff to get a degree in DevOps to operate.

What HIPAA Actually Requires for AI Systems

HIPAA does not prohibit AI in healthcare. It does require specific controls around any system that processes or stores PHI. For an LLM deployment, the relevant requirements are:

  • Data never leaves your covered entity boundary without a signed Business Associate Agreement (BAA) — and most major AI providers do not offer BAAs for API access
  • All access to PHI must be logged with user identity, timestamp, and the data accessed
  • Encryption at rest (AES-256 minimum) and in transit (TLS 1.2+)
  • Access controls limiting which staff can query which patient data
  • Breach notification procedures if PHI is exposed

The key implication: if you are sending clinical notes to OpenAI, Anthropic, or any cloud AI provider's standard API without a BAA, you are likely in violation. Some providers offer HIPAA-eligible tiers with BAAs — but for many healthcare organizations, the risk appetite for any third-party data processing is zero.

Warning:A BAA transfers compliance obligations contractually — it does not technically prevent PHI from transiting a third-party network. Organizations with strict data residency requirements often find that 'HIPAA-eligible cloud' is not sufficient.

The Air-Gapped Architecture

An air-gapped LLM deployment means the model runs on hardware you control, in your facility or private cloud, with no outbound network calls to inference APIs during operation. Here is the architecture we deployed for a clinical documentation system:

  • On-premises server: 2x NVIDIA RTX 4090 (or 1x A40 for production-grade deployments) — sufficient for 70B models with quantization
  • vLLM: Serving engine — runs the LLM locally, exposes an OpenAI-compatible API on the local network only
  • No internet egress: Firewall rules block all outbound traffic from the inference server; model weights are loaded once at setup
  • Audit logger: Every request and response is logged to an on-prem database with user identity and timestamp — satisfies HIPAA access log requirements
  • TLS termination: NGINX proxy handles TLS on the internal network; all PHI encrypted in transit even within the facility
  • Role-based access: Staff authenticate against Active Directory; the proxy enforces which endpoints each role can access

Model Selection for Healthcare

Not every open-weight model is suitable for clinical use. The factors that matter:

  • Instruction following accuracy: Clinical staff will phrase queries in many ways — the model must be robust to informal language and medical abbreviations
  • Hallucination rate: In healthcare, a confident wrong answer is dangerous. We prioritized models with lower hallucination rates on medical benchmarks (MedQA, MedMCQA) over raw benchmark performance
  • Context window: Clinical notes and discharge summaries can be long. A minimum 32K context window is required; 128K is preferable
  • License: Must allow commercial deployment and not require sharing fine-tune weights

For the deployment we built, Llama 3.1 70B (instruction-tuned) passed all criteria. We applied 4-bit GPTQ quantization to reduce the VRAM requirement to fit within the hardware budget without meaningful accuracy degradation on the medical task suite we tested against.

The Compliance Checklist That Passed the Audit

  1. 1Written system design document describing data flows, storage locations, and network topology — provided to the compliance officer
  2. 2PHI data flow diagram showing that no PHI leaves the on-premises network boundary
  3. 3Encryption attestation: AES-256 at rest (full-disk encryption on inference server), TLS 1.3 in transit
  4. 4Access log schema and retention policy — minimum 6 years per HIPAA requirement
  5. 5User access control policy: role assignments, access review cadence, offboarding procedure
  6. 6Incident response plan: steps to take if a breach is detected, notification timelines
  7. 7Staff training documentation: what the AI system does, what PHI it can access, how to report issues
  8. 8Vendor assessment: only open-source components with auditable code; no third-party SaaS in the data path
Tip:Compliance reviewers care as much about documentation as they do about technical controls. A well-architected system without documentation will fail a HIPAA audit as easily as a poorly-architected one.

Operational Considerations

Deploying on-premises means owning the operational overhead that cloud providers normally absorb. For healthcare organizations considering this path:

  • Model updates: Plan a quarterly model evaluation cycle. You need a process for testing a new model version before promotion to production.
  • Hardware maintenance: GPU servers require more attention than cloud instances. Work with your IT team to define SLAs and failure procedures.
  • Backup and recovery: The model weights can be re-downloaded, but the audit logs and configuration must be backed up and recoverable.
  • Monitoring: Set up alerting for server health, GPU utilization, and inference latency. Grafana + Prometheus works well for this.

The system we deployed passed its compliance review on the first submission. In the clinical setting, it reduced documentation time for nurses by an average of 22 minutes per shift — the outcome the organization was looking for. The compliance overhead paid for itself quickly.

Working on something similar?

I take on 1–2 new projects per month. If you have a use case that needs this kind of engineering, tell me about it.

Get in touch