Enterprise-grade file uploads, part 3: the security layer

May 6, 2026

Third article in a series on what "enterprise-grade" actually means for user file uploads.

The first two articles in this series fixed the architecture. Part 1 moved file bytes off the API; part 2 made the permission-granting service stateless and elastic. With that foundation, the API can't really hurt you — it never touches user bytes, and its blast radius is bounded by what you sign for.

But the bytes are now in your object store, and at some point someone has to look at them. Every interesting attack against a file upload pipeline happens between the moment S3 receives the upload and the moment another user downloads it. This article is about that interval — what attackers actually try, how each defense works, and the pipeline shape that makes "we accept user uploads" a survivable design decision rather than a perpetual incident factory.

We'll pay special attention to two threats that get glossed over in the AWS tutorials: malware (the textbook one, harder than it looks) and decompression bombs (the one nobody plans for until it eats their worker fleet).

What attackers actually want

A useful threat model for user-file uploads, ranked by frequency in production breaches:

Serve malware to other users. Upload an .exe or .docm and get the download link distributed via your own product. Your domain reputation is part of the payload.
Stored XSS. Upload a malicious SVG, get it rendered inline in another user's session, exfiltrate cookies or perform actions on behalf of the victim.
DoS the processing pipeline. Upload a 10 KB PNG that decompresses to 50 GB, kill the worker. Repeat. Cheap to send, expensive to absorb.
SSRF via document formats. SVG, PDF, or OOXML files with external references that your worker dutifully fetches — from inside your VPC, with whatever credentials your worker has.
Tenant-boundary violation. Upload to your own folder, but somehow get the file readable from someone else's account. Or write to someone else's folder by tampering with the upload key.
Exfiltrate via filename, metadata, or path. PII or secrets embedded in a filename that ends up in your logs or in a third-party tool.
Cryptocurrency miner installation. If your worker runs untrusted user code (e.g., you process arbitrary scripts), uploads become a compute-theft vector.
Legal liability vehicle. CSAM, copyright-infringing material, illegal content. The legal team's concern more than the security team's, but the technical control surface overlaps.

This article addresses 1-6 directly. 7 is mostly about not running user code in the worker (don't do it). 8 is content moderation, which is a separate discipline.

The architectural pillar: two buckets

The single most important pattern, before any individual defense:

                  upload (signed URL)
              ┌────────────────────────┐
              │                        ▼
       ┌─────────────┐         ┌────────────────┐
Client │  Browser /  │         │  quarantine    │
       │  mobile     │         │  bucket        │ ◄── ACL: workers only,
       └─────────────┘         │  (write-only)  │     no public read,
                               └────┬───────────┘     short TTL on objects
                                    │
                                    │ event (S3 / GCS Pub/Sub)
                                    ▼
                              ┌─────────────┐
                              │  Worker     │  validate → scan → strip → check
                              │  (Lambda /  │
                              │  Cloud Run) │
                              └────┬────────┘
                                   │
                       pass ┌──────┴──────┐ fail
                            ▼             ▼
                    ┌───────────┐   ┌────────────┐
                    │ Production│   │ Forensics  │ ◄── locked-down,
                    │ bucket    │   │ bucket     │     access via incident
                    │ (served)  │   │ (retained) │     response process
                    └───────────┘   └────────────┘

The properties this gives you:

The download surface is never the upload surface. A user can never directly access what they just uploaded — only what your worker promoted to production. Even a perfect bypass of every validator in the worker still doesn't allow direct serve, because the production bucket is fed by a different path.
Failures are quarantined, not deleted. If a worker flags a file, you keep it for incident response, you don't drop it. Attackers don't get to delete evidence by uploading once and walking away.
Workers don't have write access to production. The IAM model is: workers can read quarantine, write production. They can't write quarantine (nothing writes there except the signed-URL flow). They can read production (for re-processing). Tight.
The quarantine bucket has lifecycle rules. Objects that aren't promoted within N hours auto-delete. Stale quarantine never becomes a forgotten attack vector.

The naive "upload directly to the production bucket" pattern with an "is_ready" column in the database has the same idea but loses the bucket-level separation. A misconfigured signed URL, a leaked credential, or a race in your "is_ready" check, and the unprocessed file is suddenly servable. Two buckets give you a hard wall instead of a software flag.

The validation pipeline

Once the worker has the file, the order of checks matters. Cheap-and-bounded first, expensive-and-unbounded later:

1. Filename / path policy           ← microseconds, in worker
2. Size check                       ← microseconds
3. Magic-byte content-type          ← milliseconds (read first 4-8 KB)
4. Format-specific structural check ← milliseconds (parse headers)
5. Decompression bomb check         ← milliseconds (declared dimensions)
6. Malware scan                     ← seconds (whole-file ClamAV/GuardDuty)
7. Sanitization (strip EXIF, etc.)  ← seconds, can be parallel with 6
8. Variant generation               ← seconds-to-minutes (separate stage)

If any step fails, stop, move to forensics, log structured event, return. Don't continue processing a file you've already flagged.

The rest of this article walks each step in detail.

Step 1-2: filename and size

The filename is not metadata, it is an attack surface. Sanitize it as if it came from a determined attacker, because eventually it will:

Never use it as the object key. Use a UUID or content hash. Keep the original filename in a database column for display purposes only.
Reject filenames with null bytes (\x00), control characters, path separators (/, \), parent references (..), and reserved Windows names (CON, PRN, AUX, NUL, COM1-9, LPT1-9, with or without extensions).
Cap length at 255 bytes after UTF-8 encoding.
Normalize Unicode (NFC) before comparison. Don't compare raw bytes.
Be aware of Unicode RTL override attacks: invoice‮gpj.exe displays as invoiceexe.jpg. Strip or reject ‎, ‏, ‪-‮, ⁦-⁩.

Size: enforce declared size at sign time (S3 POST policy content-length-range) and re-verify in the worker (the actual object size from S3 metadata). A mismatch means tampering somewhere and is itself a security event.

Step 3: magic-byte content-type verification

Never trust the Content-Type the client declared. Open the first 4-8 KB of the file and identify it by content. In practice this is one of:

libmagic (the library behind file(1)) — battle-tested, accurate, available everywhere
Go: net/http.DetectContentType for basic cases, h2non/filetype or gabriel-vasile/mimetype for richer detection
Python: python-magic
Node: file-type

The check has two parts:

Detected type matches declared type. Reject mismatches outright. A "PDF" that's actually a Windows PE binary is a clear attack.
Detected type is in the allow-list for this upload kind. Avatar uploads accept only image/jpeg, image/png, image/webp. KYC documents accept additionally application/pdf. Invoices accept application/vnd.openxmlformats-officedocument.spreadsheetml.sheet plus PDF. Be explicit; reject everything else.

The deny-list version of this check (block known-bad types) is wrong. Attackers will always find a new type you didn't think to block. Allow-lists are the only viable shape.

Step 4-5: structural checks and decompression bombs

This is where the AWS tutorial ends and reality begins.

A decompression bomb is a small file that, when processed normally, consumes enormous resources. The classic example: a 42 KB .zip that contains nested zips totaling 4.5 PB (42.zip). Modern variants:

PNG bombs. A PNG file declares its dimensions in a fixed-position IHDR chunk near the start of the file. The image data is zlib-compressed. A malicious PNG can declare 30,000 × 30,000 pixels and compress to 10 KB. When Pillow loads it: 30,000 × 30,000 × 4 bytes = 3.6 GB of memory. Worker OOM, instance reaped, repeat for $0.0001 per attack.

Defense — read the dimensions before decoding pixels, and reject above a policy threshold:

python

# Python / Pillow
from PIL import Image, ImageFile
import io

# Lower the default DecompressionBomb threshold (default 89M pixels)
Image.MAX_IMAGE_PIXELS = 25_000_000   # 25 megapixels

def safe_load(raw_bytes: bytes) -> Image.Image:
    img = Image.open(io.BytesIO(raw_bytes))
    img.verify()  # parse headers without decoding
    # Re-open after verify (it consumes the stream)
    img = Image.open(io.BytesIO(raw_bytes))
    w, h = img.size
    if w * h > Image.MAX_IMAGE_PIXELS:
        raise ValueError("image dimensions exceed policy")
    img.load()  # now safe to decode
    return img

Better yet, use libvips instead of Pillow/ImageMagick for resize work. libvips processes images in streaming tiles rather than decoding the full bitmap into memory, with order-of-magnitude lower memory ceiling. libvips 8.13+ also has an "untrusted operations" mode that disables risky format features at runtime.

GIF bombs. Same idea, different format. GIF can declare a logical screen and individual frame dimensions; a multi-frame GIF can declare hundreds of huge frames.

Zip bombs. Both the classic recursive variety and the non-recursive 4.5 PB single-zip bomb. Defense: never decompress in-place. Stream-extract with a max_decompressed_size cap, drop the connection when it's exceeded.

XML billion-laughs / quadratic blowup. SVG, DOCX, XLSX, PPTX are all XML inside a zip. An entity expansion like:

xml

<!DOCTYPE lolz [
  <!ENTITY lol "lol">
  <!ENTITY lol1 "&lol;&lol;&lol;&lol;&lol;&lol;&lol;&lol;&lol;&lol;">
  <!ENTITY lol2 "&lol1;&lol1;&lol1;&lol1;&lol1;&lol1;&lol1;&lol1;&lol1;&lol1;">
  ...
]>
<lolz>&lol9;</lolz>

A 1 KB file expands to ~1 GB during parsing. Defense: disable DOCTYPE / external entity processing in every XML parser. In Python that's defusedxml. In Go, encoding/xml doesn't expand entities by default, but if you use a different XML library, check it. Reject XML with a <!DOCTYPE declaration outright unless you have a specific reason to allow it.

PDF bombs. Recursive object references, deeply nested arrays, oversized images embedded. Defense: parse with a library that has resource limits (pdfminer.six with maxpages, or qpdf in a sandbox with ulimit for memory/CPU).

Office docs. Same XML issues plus embedded OLE objects, plus macros, plus the format is a zip so all zip-bomb defenses apply.

Polyglot files. A file that's valid as multiple formats simultaneously (a .png that's also a valid .html, a .pdf that's also a .jar). Magic-byte detection sees one type; the browser, server-side renderer, or downstream consumer sees another. Defense: re-encode to a canonical format. If a user uploads a "PNG," your worker decodes it and re-encodes a clean PNG. The polyglot property is destroyed by the round trip.

Step 6: malware scanning

This is the most-skipped step and the most reputation-damaging when it goes wrong. Three viable shapes in 2026:

Option A — ClamAV in a Lambda / Cloud Run function

The serverless ClamAV pattern is well-documented and works. Architecture:

S3 quarantine → S3 event → SQS → Lambda (ClamAV) → scan result tag
                                          │
                                          │ definitions
                                          ▼
                                  EFS / S3 with daily update
                                  via EventBridge cron

Implementation notes that matter:

ClamAV's virus definitions are ~400 MB and change daily. Bake them into the Lambda image at deploy time and you'll be scanning with stale signatures within a week.
Mount EFS to the Lambda, or use S3 as the definition store with a separate "refresh-definitions" Lambda triggered every 6-24 hours via EventBridge.
Lambda's /tmp is ephemeral; download the object to scan, scan, delete. Cap at 10 GB (Lambda max ephemeral storage).
For files larger than 10 GB, stream-scan via ClamAV's clamd socket — but at that point a separate VM running clamd is often simpler than Lambda.
ClamAV is signature-based. It catches known malware. It does not catch novel malware or behavioral threats.

AWS provides a CDK construct that wires this up correctly. Use it if you're on AWS — it's not worth rebuilding.

Option B — AWS GuardDuty Malware Protection / GCP equivalents

GuardDuty Malware Protection for S3 is the managed alternative — you turn it on, it scans new objects, tags them with the result. Costs more per GB scanned than DIY ClamAV but you don't operate it.

On GCP, the equivalent is Google Cloud Storage Malware Scanner (part of Security Command Center) or third parties like Cloud Storage Security (which works on both AWS and GCP).

Use the managed option unless you have a specific reason not to. The reasons not to: regulatory requirement for specific scanning engines, very high throughput where managed pricing gets prohibitive, air-gapped environments.

Option C — multi-engine commercial scanning

OPSWAT MetaDefender, VirusTotal API, and similar services run uploads through dozens of scanning engines. The detection rate is materially higher than ClamAV alone. The cost is materially higher too — typically $0.001-$0.01 per scan.

For high-value content (KYC documents, signed contracts, anything you'd have to apologize for if it served malware), multi-engine scanning is worth the cost. For avatar images, ClamAV or GuardDuty is enough.

What scanning doesn't catch

Be honest about the ceiling of antivirus scanning:

Novel malware (no signature yet) — scan tomorrow, it'll be detected, but it's too late if you already served it. Mitigation: scan again on download for high-value content, accept some staleness for the rest.
Targeted malware (specifically written for your customers' environments) — won't have a signature.
Macros that aren't malware-flagged but still leak data via mhtml: redirects, external references, etc. AV doesn't look at this.
Steganographically-hidden payloads in images. The image scans clean because the payload isn't executable from the image alone.

The right framing: AV scanning is necessary but not sufficient. It's one layer.

Step 7: format-specific sanitization

After scan, before promotion to production, run format-specific sanitization:

Images:

Strip EXIF (contains GPS coordinates, device IDs, timestamps — PII)
Strip ICC profiles unless you specifically need them
Re-encode to canonical format. This destroys polyglots, removes hidden chunks, normalizes dimensions
For SVGs that you accept: see next section

SVGs:

SVGs are XML with embedded HTML and JavaScript capabilities. The list of recent CVEs is sobering: CVE-2026-25648 (Traccar), CVE-2026-33172 (Statamic), CVE-2026-29924 (Grav), all 2026, all SVG-upload stored XSS or XXE.

Hard rules:

Never serve user-uploaded SVG with Content-Type: image/svg+xml. Convert to PNG via libvips/rsvg-convert and serve the PNG. The SVG-as-image use case (icon libraries, logos) is solved.
If you must serve SVG inline (e.g., theming features for premium users), sanitize with a library that strips <script>, <foreignObject>, event handlers (on*), external references (href, xlink:href to non-data: URIs), and CSS expressions. Use defusedxml in Python or DOMPurify-equivalent server-side libraries — and audit it quarterly because bypasses keep being found.
Disable DOCTYPE in your SVG parser. Without this, XXE attacks work even on otherwise-sanitized SVGs.
Set Content-Security-Policy: default-src 'none' on the served file even if you trust the sanitizer.

If you can avoid accepting SVG, do.

PDFs:

Strip embedded JavaScript (pdfcpu or qpdf can do this)
Strip embedded files / attachments
Reject encrypted PDFs (you can't scan them; ask the user to upload unencrypted)
Re-encode through qpdf --linearize to canonicalize

Office documents (DOCX, XLSX, PPTX):

Reject anything with macros (.docm, .xlsm, .pptm extensions, and check for the macro bit in DOCX zip contents — extension is suggestive, not authoritative)
Reject OLE objects and embedded files
Strip external references (linked images, data connections)

Video / audio:

Re-encode through ffmpeg with a known profile (-vcodec h264 -acodec aac -movflags +faststart and a fixed bitrate ceiling)
Resource limits via cgroups or container quotas — ffmpeg has historically been a source of memory-corruption CVEs

The principle in all of these: canonicalize, don't sanitize. Decoding and re-encoding a file in a known-good pipeline destroys most embedded threats automatically, because the threat depended on file-format ambiguity that the canonical encoder doesn't reproduce.

Signed URL hygiene

The signed URL is a bearer token. Treat it accordingly:

Short expiry. 5-15 minutes for uploads, ≤ 1 hour for downloads. Refresh on demand, don't pre-issue.
Never log them. Strip them from access logs, error logs, analytics tags. They appear in Referer headers on outbound requests from rendered files; configure Referrer-Policy: no-referrer on the served HTML.
Bind to the user where the platform supports it. S3 doesn't natively bind a presigned URL to a user, but you can include a one-shot UUID in the URL path and check it server-side on the redirect.
Cache-Control: private, no-store on responses that contain signed URLs. Otherwise CDN edges, browser cache, or shared proxies hold them past their useful life.
Audit issuance. Log who got which signed URL, valid until when, for which object. When a leaked URL shows up in a referer or a customer's screenshot, you can trace the issuance.

Bucket-level controls (the floor)

Below the application logic, the storage configuration is your safety net:

S3 Block Public Access at the account level, not just per-bucket. The number of breaches caused by a single misconfigured BucketPolicy allowing Principal: * is uncountable. Block at the account level and any future bucket inherits it.
GCS Uniform Bucket-Level Access. Same idea, prevents per-object ACL leakage.
Encryption at rest with customer-managed KMS keys for anything sensitive (KYC, contracts, anything PII). The key being managed gives you instant cryptographic erasure on deletion — destroy the key, the data is unrecoverable even if the bucket isn't wiped.
Object versioning + lifecycle rules + delete markers. Defends against accidental deletion and ransomware that targets storage. Combine with Object Lock in compliance mode for retention-mandated content.
CloudTrail / GCS Audit Logs on every bucket, shipped to a separate account/project that the same IAM principal cannot tamper with.
Access analyzer (AWS) or equivalent (GCP Security Command Center) — automated detection of public exposure or unusual access patterns.

If the application layer ever breaks, these controls keep the breach blast radius small.

Quotas, rate limits, and abuse

The whole pipeline assumes you actually want to process each upload. At scale, attackers will try to overwhelm it. Defenses:

Per-user / per-tenant upload rate limits, enforced at the sign-URL step. Not just count — also aggregate bytes per hour/day.
Per-IP rate limits for unauthenticated uploads (signup avatars, contact forms). Cloudflare / API Gateway / Cloud Armor at the edge.
CAPTCHA on suspicious patterns — first-time accounts uploading dozens of files in the first hour, geographically anomalous signups.
Worker concurrency caps with backpressure. If the scan queue depth grows, slow down sign-URL issuance — better to refuse new uploads than to fall further behind and serve hours-old scan results.
Cost alerting on the scan engine. GuardDuty / multi-engine scanning bills per scan. An attacker uploading a million tiny PDFs is more painful as a bill than as a security event.

Audit trail and lawful deletion

For each upload, you'll eventually want to answer:

Who uploaded it, from what IP, with what session, at what time?
What was the magic-byte type, the declared type, the scan result?
When was it promoted? When was it served? To whom?
When was it deleted? Was the deletion propagated to versions? To backups? To the KMS key?

Every step in the worker pipeline should emit a structured event with the file's persistent ID. Don't log the file content, don't log the signed URL, do log everything else.

GDPR-grade deletion (and CCPA, and equivalents):

Soft delete in the database (audit retention)
Bucket object delete + version cleanup
KMS key access removal for the user's keys (if you scope per-user)
Async job to verify deletion propagated through replicas, backups, CDN cache
Re-confirm at 30 days; if not propagated, page someone

The harder part is making sure derived artifacts (thumbnails, transcoded variants, OCR text, ML embeddings) are also deleted. Track parentage in the database: every derived object has a parent_file_id, and deletion cascades.

The "actually safe" pipeline, end to end

Putting it all together, this is what a defensible upload pipeline looks like in 2026:

1.  Client → API: "I want to upload, here's my metadata"
2.  API: authenticate → authorize → check quotas → write `pending` row →
        sign POST policy (size cap, content-type prefix, key prefix locked)
3.  Client → quarantine bucket: PUT with policy constraints (S3 enforces)
4.  Quarantine bucket → event → SQS / Pub/Sub
5.  Worker (Lambda / Cloud Run) picks up event:
      a. Read object metadata (size, declared type)
      b. Sanitize filename
      c. Read first 8 KB, detect magic-byte type → reject mismatch
      d. Format-specific structural check → reject malformed
      e. Decompression bomb check (declared dimensions, etc.)
      f. Malware scan (GuardDuty / ClamAV / multi-engine)
      g. Format-specific sanitization (strip EXIF, re-encode, etc.)
      h. Promote to production bucket with computed name (UUID)
      i. Update DB row: `ready`
6.  On any failure in 5a-g: move to forensics bucket, DB row → `quarantined`,
    structured event → SIEM
7.  Lifecycle rule on quarantine: delete unpromoted objects after 24 hours
8.  Lifecycle rule on forensics: retain 90 days, then transition to Glacier

The pipeline has the property that every single defense can be bypassed individually, and the file still doesn't reach a user. Defense in depth is the entire game — no single check is the whole story.

What I'd actually do today, in order

If you're starting from a working upload pipeline that lacks most of this:

Two-bucket pattern. Even before any scanning, separate upload destination from serving destination. This single change makes everything else viable.
Block public access at account level. Free, prevents the dumbest class of breach.
Magic-byte content-type verification in the worker. Hours of work, eliminates a large class of attack.
Filename sanitization, UUID object keys. Half a day. Eliminates path-traversal entirely.
Decompression bomb checks for the image formats you accept. Half a day. Eliminates the cheapest DoS.
Enable GuardDuty Malware Protection / Cloud Storage Security. Hours of work, managed solution, catches the textbook 99%.
Stop serving user SVGs as image/svg+xml. Rasterize to PNG. Eliminates a CVE category.
Signed URL audit logging + short expiry. Half a day. Investigation capability later when something goes wrong.
Quotas + rate limits at the sign-URL step. Day or two. Caps your attack-cost exposure.
Lifecycle rules on quarantine bucket. One line of config. Closes the abandoned-uploads attack surface.

The 80/20 is steps 1-3 and 6. Most production breaches involve at least one of "served from upload destination," "no content-type verification," and "no malware scanning."

What's coming next

The remaining articles in this series will dig into:

Multipart and resumable uploads in the browser — Uppy, tus, the edge cases.
Variant generation: thumbnails, transcoding, OCR — the worker layer in cost depth.
Signed download URLs, CDN integration, access control — serving at scale with the API out of the path.
Lifecycle, retention, and GDPR deletion — the full deletion story.
Observability and forensics — what to log, how to find a leaked URL, how to prove a file was scanned.

Series

Direct-to-S3 uploads — moving file bytes off the API.
Serverless at the front door — running the API on Lambda / Cloud Run.
The security layer (this article) — what attackers actually try and how to defend.
Multipart and resumable uploads in the browser. (next)
Variant generation: thumbnails, transcoding, OCR.
Signed download URLs, CDN integration, access control.
Lifecycle, retention, and GDPR-compliant deletion.
Observability and forensics for file pipelines.

What attackers actually want ​

The architectural pillar: two buckets ​

The validation pipeline ​

Step 1-2: filename and size ​

Step 3: magic-byte content-type verification ​

Step 4-5: structural checks and decompression bombs ​

Step 6: malware scanning ​

Option A — ClamAV in a Lambda / Cloud Run function ​

Option B — AWS GuardDuty Malware Protection / GCP equivalents ​

Option C — multi-engine commercial scanning ​

What scanning doesn't catch ​

Step 7: format-specific sanitization ​

Signed URL hygiene ​

Bucket-level controls (the floor) ​

Quotas, rate limits, and abuse ​

Audit trail and lawful deletion ​

The "actually safe" pipeline, end to end ​

What I'd actually do today, in order ​

What's coming next ​

Series ​

References ​