Enterprise-grade file uploads, part 4: event-driven architecture for small, mobile teams
Fourth article in a series on what "enterprise-grade" actually means for user file uploads. This one steps out from the file-upload pipeline specifically to talk about the architectural pattern underneath it.
The first three articles in this series quietly built an event-driven system without naming it. In part 3 the diagram had S3 fire an event when a file landed in the quarantine bucket; a worker picked it up; if the file passed, it moved to production. Adding virus scanning, EXIF stripping, thumbnail generation, OCR — none of those required changing the upload API. Each was a new subscriber on the same event.
That property — adding features without changing the code that triggers them — is the single largest reason small teams and mobile teams inside big organizations should default to event-driven architecture. This article makes that case explicitly, with the file-upload pipeline as the running example.
It's not a deep dive on Kafka tuning or CQRS — there are plenty of those. It's a working argument that a 3-person team building a product inside a 300-person company gets more leverage from EDA than from any other single architectural choice, and a startup that adopts it early avoids most of the rewrites that come at series A.
The team shape that this article is about
Two team shapes get the most value from EDA:
The startup engineering team. 3-10 engineers. The product is one or two services today, will be ten in eighteen months. You don't have a platform team. You don't have time to refactor things later. Every architectural decision has to survive growth.
The mobile team inside a big organization. A 4-8 person product team operating inside a larger company. You don't own the platform. You don't own the auth service. You don't own the billing service. You do own a specific business capability — KYC, payments, file handling, fraud — and you ship it. Your velocity is constrained by everything you have to coordinate.
These look different from the outside but face the same problem: you can't move fast if your code is synchronously coupled to other teams' code. Every synchronous call is a tax — a deploy window you have to align with, a contract you have to negotiate, a failure mode you don't control, a rollback you can't trigger.
Event-driven architecture is the most direct technical answer to "stop blocking my deploy on someone else."
The naive coupling problem, in code
Let's start with how this looks when a team writes the obvious code. KYC service needs to know when a user uploads a document:
// In the upload API
func (h *Handler) finalizeUpload(c echo.Context) error {
file := writeToDB(...)
storage.Promote(file)
// tell the KYC service
resp, err := http.Post("https://kyc-api/v1/documents", ...)
if err != nil {
// ...what now?
}
// tell the audit service
resp2, err := http.Post("https://audit-api/v1/events", ...)
if err != nil {
// ...and now?
}
// tell the ML service to start embedding
resp3, err := http.Post("https://ml-api/v1/embed", ...)
if err != nil {
// ...still?
}
return c.JSON(200, file)
}Reading this in the small, every line is reasonable. Reading it from outside, every line is a problem:
- Latency stacks. Three HTTP calls before the user gets a response. Each is 50-200 ms on a good day.
- Failure semantics are undefined. If
kyc-apiis down, do you fail the upload? Retry? Buffer? Tell the user something went wrong? Each downstream gets its own answer in the same handler. - The upload API owns the dependency graph. Adding a new consumer (fraud detection, billing event, analytics pipeline) means adding code here and re-deploying the upload service.
- Coupled deploy cycles. The KYC team renames their endpoint. Your upload API breaks. You discover it from your users.
- Coordination overhead. You can't change the upload service's response shape without checking with three other teams, because three other teams' code paths depend on what comes back from this handler.
The team-level effect: every feature you ship requires you to touch other teams' code. Or worse: you ship a feature that requires other teams to change their code in lockstep with yours. Cross-team Jira tickets become the bottleneck.
The same flow, event-driven
// In the upload API
func (h *Handler) finalizeUpload(c echo.Context) error {
file := writeToDB(...)
storage.Promote(file)
bus.Publish("file.promoted", FilePromotedEvent{
FileID: file.ID,
UserID: file.UserID,
Kind: file.UploadKind, // "avatar" / "kyc_doc" / "invoice"
Mime: file.DetectedMime,
Bytes: file.Size,
Path: file.Path,
At: time.Now().UTC(),
})
return c.JSON(200, file)
}That's it. One publish. KYC, audit, ML, fraud, billing, analytics — none of them are mentioned in this handler. Each consumes file.promoted independently:
// In the KYC service (a different team, a different repo, a different deploy)
func (h *KYCHandler) OnFilePromoted(e FilePromotedEvent) error {
if e.Kind != "kyc_doc" { return nil } // not for me
return h.scheduleReview(e.FileID, e.UserID)
}The shape of the change matters more than the syntax. The upload team stopped knowing about the KYC team. The KYC team stopped depending on the upload team's HTTP API surface. They share one thing: the contract of file.promoted. As long as both teams respect that contract, neither team can break the other.
The benefits, ranked by what actually changes for the team
1. Independent deploys
The single most important property. The upload team can ship at 2 PM Tuesday. The KYC team can ship at 11 AM Wednesday. Nobody coordinates. The platform team's calendar is not your bottleneck.
In the synchronous world, "deploying" includes "making sure everyone we call is up." In the event-driven world, deploying is shipping a producer; consumers process the events when they're ready, including catching up after their own downtime.
For a small team in a big org, this is the difference between shipping in a sprint and shipping in a quarter. The deploy window stops being a meeting.
2. Adding features without touching old code
The file-upload codebase has gone through, in series:
- Add virus scanning → new subscriber on
file.uploaded(orfile.promoted) - Add image resizing → new subscriber
- Add EXIF stripping → new subscriber
- Add OCR for KYC documents → new subscriber
- Add ML embedding for search → new subscriber
- Add audit log → new subscriber
The upload API has not been touched for any of these. The upload API's code at the end of this list is essentially identical to the upload API's code at the beginning. The codebase stops accumulating cross-cutting changes.
This is the property that makes EDA scale to organizations. You can add the 17th feature without making the 17th change to the central producer. Compare to the synchronous version, which would have 17 HTTP calls in the handler by now, each with its own timeout, retry, and failure mode.
3. Failures become local
When the ML embedding service is down in the synchronous version, the upload API has to decide: fail the upload? Skip embedding? Queue it? Each handler accumulates its own answer to "what if downstream X is down right now?" — and the answers are inconsistent because they were written by different people on different days.
When the ML embedding service is down in the event-driven version, the events accumulate in the ML service's queue. The upload API doesn't know and doesn't care. When ML comes back up, it drains the queue and catches up. The user got their successful upload response three days ago and never knew anything was wrong.
The system's failure modes are localized to the consumer that's having the problem. The dead-letter queue is your visibility into what's broken; the rest of the pipeline keeps running.
4. The event contract is the API
Once you commit to publishing file.promoted, the event becomes the API. It's:
- Discoverable. A schema registry (or just a
events/directory in a shared repo) lists every event in the system. New team joins, reads the directory, knows what's available. - Versioned explicitly.
file.promoted.v1,file.promoted.v2. You can deprecate slowly with consumers that subscribe to both. No "we coordinated a coordinated migration on a coordinated Tuesday." - Testable. A test that sends a
file.promotedevent and asserts the KYC service updated its DB is a clean contract test. No HTTP mocks of the upload service's response shape. - Observable. Every event is a log line you can replay. The audit trail of "what happened in the system today" is the event log.
REST APIs have all of these in theory, but in practice teams treat REST endpoints as implementation details that can be changed without notice and rarely document them well. Events have the cultural property that they're an explicit promise to other teams. This sounds soft but it matters a lot when you're trying to ship.
5. The org chart and the architecture align
Conway's Law: organizations design systems that mirror their communication structure. A startup with a 5-person team that talks every day naturally produces a monolith. A 30-person company with five product squads naturally produces five services that are too coupled because the teams are too coordinated.
The inverse Conway maneuver says: design the team structure you want, and the architecture will follow. Event-driven architecture is the technical substrate that makes this maneuver actually work. Each team owns:
- The services that produce events for its bounded context
- The events themselves (the contract)
- Some consumers of other teams' events
And nothing else. Cross-team communication is the event contract, not Slack pings asking "can you add a field to your response?"
For a small team in a big org, this is liberating: you only need to coordinate with other teams at the point where you commit to an event schema. Daily standups stay inside the team. Cross-team meetings happen for new event contracts, not for routine work.
Where this concretely shows up in the file pipeline
Here's the event flow this codebase already implements (with one new event added for illustration):
┌────────────┐ POST /signurl
client │ upload │ ──────────────► API (Cloud Run)
│ │ │
└────────────┘ │ generates signed URL,
│ │ writes pending row,
│ │ publishes nothing yet
│ ▼
│ ┌──────────┐
│ PUT (bytes) │ pending │
└────────────────► │ in DB │
└──────────┘
│
│ S3 event on object create
▼
┌──────────────────────┐
│ file.uploaded.v1 │ ← bus topic
└──────────────────────┘
│
┌──────────────────────────────┼─────────────────────────┐
▼ ▼ ▼
┌────────────┐ ┌────────────┐ ┌────────────┐
│ Validation │ │ Variant │ │ Audit │
│ + scan │ │ generation │ │ logger │
│ (Cloud Run)│ │ (resize │ │ │
└─────┬──────┘ │ worker) │ └────────────┘
│ └─────┬──────┘
│ on pass │
▼ ▼
┌──────────────────────┐ ┌──────────────────────┐
│ file.promoted.v1 │ │ file.variant. │
└──────────────────────┘ │ generated.v1 │
│ └──────────────────────┘
├──► KYC service │
├──► ML embedding (new!) ├──► CDN cache warmup
├──► billing event └──► email notification
└──► search indexerThe lines on the right are the ones the upload team didn't write. The KYC team subscribed. The ML team subscribed. The billing team subscribed. The upload team's contribution is: the event, and the guarantee that it fires reliably.
Adding "tell the user via email when their KYC document is approved" is a third team's work: subscribe to a future kyc.approved.v1 event from the KYC team, send an email. The upload team is not consulted. The KYC team is not consulted on the email service's deploy cadence. Three teams, zero coordination overhead.
The patterns you need (and the ones you don't)
You can write a long book about event-driven architecture. For a small team starting today, you need exactly four patterns:
Pub/sub as the primitive
A topic (file.promoted.v1). Producers publish. Consumers subscribe. The broker handles delivery, ordering (within a partition / key), and persistence. That's it.
Use a managed broker — Google Cloud Pub/Sub, AWS SNS+SQS, Azure Service Bus, NATS-as-a-service. Do not stand up Kafka yourself. Self-hosted Kafka is a job for a team, not a pattern; if you don't have that team, use a managed offering or a simpler broker.
If you're already on GCP: Pub/Sub. If AWS: SNS for fan-out, SQS for delivery to each consumer. Cloud Run / Lambda subscribers are HTTP-pushed from the broker.
The outbox pattern
The single subtle bug in event-driven systems: a service updates the DB and publishes an event, and those two operations aren't atomic. Service crashes between the two → DB updated, event not published → consumers never know. Or event published, DB rollback → consumers told about state that doesn't exist.
The outbox pattern fixes this:
BEGIN;
UPDATE files SET status = 'promoted' WHERE id = $1;
INSERT INTO event_outbox (topic, payload) VALUES ('file.promoted.v1', $2);
COMMIT;A separate process polls event_outbox, publishes events to the broker, marks them sent. The DB transaction is the atomic unit. The publisher is at-least-once (idempotent consumers handle duplicates, more below).
Implement this from day one. Retrofitting after a production incident is a tax you pay several times over.
Idempotent consumers
Every event will be delivered at least once. Sometimes twice. Sometimes the same event in two different invocations of the same consumer because the broker's ack didn't reach back in time. Your consumers must be safe to re-run.
The standard pattern: every event has a unique event_id. The consumer's first action is INSERT INTO processed_events (event_id) ON CONFLICT DO NOTHING RETURNING true. If no row returned, skip — already processed. Otherwise process.
For consumers that produce side-effects on external systems (email, payment), use natural idempotency keys (the email's idempotency-key header, the payment's request-id) so retries don't double-send.
Dead-letter queues and correlation IDs
Every consumer needs:
- A retry policy with exponential backoff (most managed brokers have this built in)
- A dead-letter queue for events that fail repeatedly — these need human attention, not infinite retry
- A correlation ID propagated from the original request through every event in the chain. When an event ends up in the DLQ, you can trace it back to the user request that started it
OpenTelemetry's traceparent works. So does a simple correlation_id string in your event envelope. The form matters less than the discipline of always propagating it.
That's the entire pattern set you need to ship. Event sourcing, CQRS, full saga orchestrators — skip them until you have a specific problem they solve. Most teams never need them.
The honest tradeoffs
The pros are real. So are the cons. Be eyes-open:
- Debugging is harder. No stack trace across services. A failure in the ML embedding service is invisible to the upload team unless you build a tracing system. Plan for distributed tracing (OpenTelemetry to Honeycomb / Datadog / your tool of choice) from day one. Don't ship without correlation IDs.
- Eventual consistency reaches the UI. "I uploaded my KYC, why does the screen still say 'not started'?" The user expects synchronous feedback. You'll need either optimistic UI ("we got your upload, processing"), polling, or a websocket/SSE channel for the "ready" event. None of these are hard, but they're non-trivial frontend work you have to plan for.
- Schema evolution requires discipline. You can add fields. You can't remove fields without versioning. You can't change a field's meaning ever. Teach the team this on day one, before the first breaking change.
- "Where is the source of truth?" becomes a real question. The DB has the current state; the event log has the history; consumers have their own views. New engineers find this confusing for a few weeks.
- Sagas are not free. Multi-step workflows that need rollback (e.g., "create user → charge card → provision resources → if any fails, undo the previous steps") need a saga orchestrator (Temporal, AWS Step Functions, GCP Workflows). For simple linear flows, you don't need this. For complex business processes, you will eventually.
- Observability cost is real. Tracing, metrics per topic, DLQ alerting, event sample logging — this is infrastructure you have to build. Budget the engineering time.
- Storage cost: the event log grows. Either retain limited (7-30 days, the default for most managed brokers) or archive to cold storage for replay. Decide before your bill surprises you.
The total honest take: EDA trades synchronous coupling for asynchronous operability. You stop fighting deploy-window battles and start fighting tracing-and-DLQ battles. The second set is much smaller and much more controllable than the first.
What to do on day one (and what to skip)
For a startup or a mobile team starting fresh:
Do:
- Pick a managed pub/sub. Cloud Pub/Sub, SNS+SQS, EventBridge, NATS-as-a-service. One topic per business event.
- Implement the outbox pattern. From day one, in your shared service template.
- Standardize an event envelope.
event_id,event_type,event_version,correlation_id,created_at,payload. Same envelope across all teams. - Propagate correlation IDs. Every HTTP request gets one; it goes into every event the request produces; it goes into every event those events produce.
- Wire up OpenTelemetry. Trace requests across services. Set up Honeycomb / Tempo / Jaeger — pick one, doesn't matter which.
- DLQs from day one. Every consumer subscription has a DLQ. Alert on DLQ depth.
- Document events in a shared repo.
events/file/promoted.v1.json(JSON Schema), one event per file. The repo is the source of truth for what events exist. - Idempotency tables. Every consumer has a
processed_eventstable indexed byevent_id.
Skip until you actually need it:
- Event sourcing (full event log as source of truth). Most products don't need it.
- CQRS (split read and write models). Useful for specific high-scale problems, not the default.
- Self-hosted Kafka. Managed offerings are great in 2026.
- Saga orchestrators. Add when you have a real multi-step rollback case.
- Schema registry as a service. A repo with JSON Schema files is enough until you have >5 producing teams.
The total day-one investment is maybe two weeks of one engineer's time. The payoff compounds every quarter after that.
For mobile teams inside big orgs specifically
A few things that matter more inside a larger company:
- Get an organization-wide event bus, or convince yourself the platform team can provide one. Without a shared bus, every team builds its own pub/sub, and the integration story devolves to HTTP again. Push for this.
- Negotiate the event contract once, then ship. When you need data from another team, ask "do you publish an event when X happens?" If yes, subscribe. If no, ask them to add it. The contract is small and explicit; the negotiation is short.
- Don't try to consume events you weren't told about. Treating other teams' internal queues as your API is a way to make enemies and break the moment they refactor. Stick to events explicitly declared as public.
- Be a generous publisher. Publish events even before you have consumers. "When my service creates a thing, I'll fire
thing.created" is essentially free to add and saves the next team a quarter of work. - Track event consumers explicitly. If your team publishes events, know who consumes them — not for permission, for empathy. Breaking changes affect them.
The shape of the problem inside a big org is: you have less control of the surrounding systems, less budget for tooling, less leverage to demand changes. EDA reduces the surface where you need that leverage.
What this means for the file-upload pipeline
The codebase already publishes events implicitly through GCS object events. The next layer of work is to publish business events rather than only technical events:
file.uploaded.v1— the technical event from object storage (already exists)file.promoted.v1— file passed all validation, is available for serving (new, semantically richer)file.quarantined.v1— file failed validation, here's why (new, useful for ops)file.deleted.v1— user-initiated or lifecycle-driven deletion (new, GDPR-relevant)
The shift from technical (storage events) to business (domain events) is what makes the events useful to other teams. The KYC team doesn't care that GCS uploaded an object. They care that a KYC document was uploaded, validated, and is ready to review.
What's coming next
The remaining articles return to file-upload-specific concerns, now with the event-driven substrate as a given:
- Multipart and resumable uploads in the browser — Uppy, tus, the edge cases that bite.
- Variant generation: thumbnails, transcoding, OCR — the worker layer at depth.
- Signed download URLs, CDN integration, access control — serving at scale.
- Lifecycle, retention, and GDPR deletion — full deletion story across the event chain.
- Observability and forensics — the tracing and DLQ tooling this article promised.
Series
- Direct-to-S3 uploads — moving file bytes off the API.
- Serverless at the front door — running the API on Lambda / Cloud Run.
- The security layer — what attackers actually try and how to defend.
- Event-driven architecture for small, mobile teams (this article) — the substrate underneath the rest.
- Multipart and resumable uploads in the browser. (next)
- Variant generation: thumbnails, transcoding, OCR.
- Signed download URLs, CDN integration, access control.
- Lifecycle, retention, and GDPR-compliant deletion.
- Observability and forensics for file pipelines.
References
- Conway's Law — Martin Fowler
- Event-Driven Architecture in 2026: Patterns, Tools, and When to Use It — Encore Cloud
- Best practices for implementing event-driven architectures in your organization — AWS Architecture Blog
- Patterns in Event-Driven Architectures — IBM
- Event-Driven Architecture and the Outbox Pattern — Rod Shokrian
- Event-driven architecture and Conway's law — EDA Visuals
- The Ultimate Guide to Event-Driven Architecture Patterns — Solace
- Inverse Conway Maneuver — Team Topologies
- OpenTelemetry: traceparent propagation
- Microservices Pattern: Event-driven architecture — microservices.io