Next.js wasn't built for a fire-hose. So I moved the data path to Go.
FLARE delivers low-latency market data, orderbooks, candlesticks and time-and-sales to trading-floor users. The initial version routed real-time data through Next.js API routes. It worked until volume hit production levels — then the single-threaded Node.js event loop got starved by the constant message firehose, blocking everything else on the server. UI requests timed out. Charts froze.
- Diagnosed: Node's single-threaded model + JSON parse cost on high-frequency feeds = event-loop starvation.
- Moved the entire data path to a Go server (`golang-bullflare-ui-server`) with goroutines + channels for true parallelism.
- Each market data domain — market data, orderbook depth, candlesticks, time-and-sales, symbol messages, positions — got its own WebSocket endpoint backed by goroutines reading from ClickHouse.
- Next.js kept what it's good at: rendering the UI and proxying user-facing requests.
- Result: charts stay live under production load and the UI thread never blocks on data work.
Go + ClickHouse + gRPC + a dashboard built for traders.
The Go server (`gorilla/mux`, `gorilla/websocket`, `clickhouse-go/v2`, `golang-jwt`) exposes parallel HTTP + WebSocket endpoints — `/api/marketData/*` and `/api/marketData/*/ws` — backed by ClickHouse for the time-series side. The Next.js frontend (React 19, Turbopack, Zustand, SWR, lightweight-charts + ECharts) consumes those streams and renders a drag-and-drop dashboard (`@dnd-kit`) of resizable panels (`react-resizable-panels`). User-defined filter groups are persisted server-side per user.
“User-defined filter groups are persisted server-side per user.”
- Market data, orderbook depth, candlestick, time-and-sales — each one a typed WebSocket handler in Go.
- ClickHouse for analytical queries; gRPC for downstream service calls; JWT auth in the middleware.
- Frontend: lightweight-charts (TradingView), ECharts, drag-and-drop layouts, saved filter groups, CSV export.
Real-time risk management around an FPGA core.
NanoShield is a real-time trading risk management system — a Next.js 16 frontend, a Go backend (HTTP + WebSocket + gRPC), and a PostgreSQL schema with composite types, RLS, and NOTIFY/LISTEN triggers. Writes go through an external gRPC service into Postgres; reads come back direct; real-time updates flow through Postgres triggers → Go broadcasters → a WebSocket hub → SWR cache invalidation in the frontend.
- Six Go broadcasters — alerts, orders, positions, limits, health, activity — each one polling Postgres, hashing for change detection, and pushing through a central WebSocket hub.
- Hybrid SWR + WebSocket pattern: SWR for initial fetch + safety-net polling, WebSocket for instant invalidation.
- Custom JWT auth with refresh-token deduplication (shared-promise pattern) so 50 concurrent 401s only refresh once.
- Two Postgres pools — authDB and shieldDB — so auth tokens never compete with risk-data queries for connections.
- Entity hierarchy: Instruments → AccountGroup → Account → User, each with their own limit types as PostgreSQL composite columns.
A 19 GB node process got killed. I had to prove it wasn't us.
One afternoon the kernel killed a 19 GB `node` process on the Shield server. The first instinct on a team is to blame the Next.js service — it's the obvious node process. I wrote up an incident investigation: pulled `journalctl` for the window, audited service memory after recovery, checked socket counts, and looked for files modified at `14:19:37`. Two files lined up: `vivado_lab.log` and `vivado_lab.jou` — modified the same minute as the kill. Vivado (Xilinx FPGA tooling) was running on the same box.
“One afternoon the kernel killed a 19 GB `node` process on the Shield server.”
- Inspected steady-state UI processes: ~100 MB RSS. No persistent leak.
- Verified socket counts (`ss -tanp | grep :3000`) — 63, normal for the workload.
- Time-correlated `find -newermt` to surface what changed in the OOM window.
- Wrote a precise message to the FPGA team with the exact commands they could run to confirm. Added systemd `MemoryMax` limits + a memory-watch logger so the next time we'd have a smoking gun.
- Conclusion: strong evidence the OOM was an FPGA toolchain spike, not the UI. Wrote it up in a 12-page postmortem so the decision was reproducible.
Cross-functional, document-first.
Hardware engineers, quants, and designers all sit at the same table. The product moves fast because we write design docs first — RBAC, alerts architecture, position checks, limits page, OOM postmortem — and code second. Most of my value here is being the person who can talk across all three teams and translate intent into a system contract.
built by one person · case study written by the same person