The Data Race Hiding Behind Correct Atomics
I thought the metrics collector was clean. The previous post
explained why atomic.Pointer was the right choice for the snapshot cache --
lock-free reads, atomic increments, no mutexes on the hot path. I was feeling
good about it. Then I ran go test -race while working on something else
entirely, and the terminal lit up with data race warnings. Not one. Five.
The Circular Buffer: Time-Windowed Request Counts
I added the metrics page because I wanted to see what was happening on the site in real time. Most of the numbers were straightforward -- uptime, memory, goroutine count -- but I felt like something was missing. A short history of requests per minute. Just a nice little indicator of ongoing activity. The design is simple: an array of length 5 that we rotate through and increment on every request to keep a small rolling window.
Three Goroutines, One Plain int
Simple enough, right? But there are a few pieces at play. First, we need to actually record the requests. The middleware is added globally so that every request increments immediately.
func (c *Collector) IncrementRequests() {
atomic.AddUint64(&c.requestCount, 1)
atomic.AddUint64(&c.recentRequestCounter[c.requestIndex], 1)
}
Then we need to pull the data together and shove it onto the cached snapshot.
This happens once a second and is visible on the /metrics tab.
func (c *Collector) RecentRequestCount() uint64 {
var n uint64
for i := 0; i < len(c.recentRequestCounter); i++ {
n += c.recentRequestCounter[i]
}
return n
}
func (c *Collector) OrderedRecentRequests() []uint64 {
size := len(c.recentRequestCounter)
data := make([]uint64, 0, size)
for i := range size {
data = append(data, c.recentRequestCounter[(c.requestIndex + i + 1) % size])
}
return data
}
And finally, the piece that rotates the window forward every minute and zeros out the new bucket.
func (c *Collector) pruneOldestRequestCount() {
ticker := time.NewTicker(1 * time.Minute)
for range ticker.C {
c.requestIndex = (c.requestIndex + 1) % len(c.recentRequestCounter)
c.recentRequestCounter[c.requestIndex] = 0
}
}
At this point, we've introduced the race condition. It might not be immediately apparent, but it's there.
Why It "Works" Despite the Race
This isn't a mission-critical metric on a high-traffic production system. It's
a personal blog. When pruneOldestRequestCount bumps the index, there's nothing
protecting it -- we don't know if anyone else is reading or writing using it at
that exact moment. The worst case? A request or two lands in the wrong bucket,
maybe gets blown away by the zero-out. Nobody's losing money over it. But it
goes against my standards, and it's exactly the kind of thing that
bites you when you stop paying attention.
go test -race: Catching What Code Review Missed
Here's where Go's tooling saved me. The race detector watches memory accesses at runtime and flags anything that isn't properly synchronized. Running it is dead simple:
go test -race ./internal/metrics/
The catch is that it only flags races that are actually exercised during the test. You need a test that hammers all the concurrent code paths at the same time. Here's the one that caught it:
func TestConcurrentAllPaths(t *testing.T) {
c := newTestCollector()
var wg sync.WaitGroup
for i := 0; i < 100; i++ {
wg.Add(1)
go func(i int) {
defer wg.Done()
c.ConnOpen()
defer c.ConnClose()
c.TouchSession(fmt.Sprintf("sess-%d", i%10))
c.IncrementRequests()
c.RecordDuration(time.Duration(i+1) * time.Millisecond)
}(i)
}
for i := 0; i < 50; i++ {
wg.Add(1)
go func() {
defer wg.Done()
_ = c.snapShot()
_ = c.RecentRequestCount()
_ = c.OrderedRecentRequests()
c.ComputeResponseStats()
}()
}
wg.Wait()
}
100 goroutines simulating request handlers, 50 simulating snapshot readers.
The race detector immediately flagged requestIndex -- a bare int being read
and written from different goroutines with zero synchronization.
The Fix: More Than Just the Write Side
My first instinct was to slap atomic.StoreInt64 on the write in the prune
method and call it done:
// First attempt -- I thought this was enough
func (c *Collector) pruneOldestRequestCount() {
ticker := time.NewTicker(1 * time.Minute)
for range ticker.C {
next := (c.requestIndex + 1) % int64(len(c.recentRequestCounter))
atomic.StoreInt64(&c.requestIndex, next)
c.recentRequestCounter[c.requestIndex] = 0
}
}
It wasn't enough. This still has two problems -- the read of c.requestIndex
in the modulo calculation is bare (no atomic load), and the zero-out re-reads
c.requestIndex instead of just using next. But the bigger issue? Every reader of
requestIndex was also doing bare reads. I'd made the write safe and left the
reads wide open. An atomic store doesn't help if the other side doesn't do an
atomic load. Both sides have to participate.
So how do you stop yourself from making half the fix? You change the type. I
switched from a bare int64 with free-function atomics to atomic.Uint64,
which makes bare access impossible -- every read goes through .Load(), every
write through .Store(). You literally can't forget:
// Before: nothing stops you from bare access
requestIndex int64
_ = c.requestIndex // compiles fine, still a race
// After: the compiler won't let you
requestIndex atomic.Uint64
_ = c.requestIndex.Load() // only way to get the value
The prune method becomes:
func (c *Collector) pruneOldestRequestCount() {
ticker := time.NewTicker(1 * time.Minute)
for range ticker.C {
next := (c.requestIndex.Load() + 1) % uint64(len(c.recentRequestCounter))
c.recentRequestCounter[next] = 0
c.requestIndex.Store(next)
}
}
And every reader now goes through .Load():
func (c *Collector) OrderedRecentRequests() []uint64 {
size := uint64(len(c.recentRequestCounter))
data := make([]uint64, 0, size)
curIndex := c.requestIndex.Load()
for i := range size {
ix := (curIndex + i + 1) % size
data = append(data, atomic.LoadUint64(&c.recentRequestCounter[ix]))
}
return data
}
Down the Rabbit Hole
I ran the test again. More races. The requestIndex was just the one I could
see by reading the code -- the detector found the same pattern hiding in three
more places.
snapShot() was reading c.requestCount as a bare field, even though
IncrementRequests writes it with atomic.AddUint64. Same deal with the
recentRequestCounter slice elements -- written atomically in IncrementRequests,
but RecentRequestCount() and OrderedRecentRequests() were just reading them
bare. Every one of those needed an atomic.LoadUint64.
The sneakiest one was the durations ring buffer. RecordDuration writes
time.Duration values into a fixed-size array on every request, and
ComputeResponseStats iterates over that same array to compute averages.
Two goroutines, same memory, no synchronization. Since time.Duration is just
int64 under the hood, I changed the array to [4096]atomic.Int64 and
convert on the way in and out:
// Write
func (c *Collector) RecordDuration(d time.Duration) {
i := c.durationIndex.Add(1)
c.durations[i%uint64(len(c.durations))].Store(int64(d))
}
// Read
for i := range c.durations {
d := time.Duration(c.durations[i].Load())
if d > 0 {
buf = append(buf, d)
sum += d
cnt++
}
}
After all of that, go test -race passes clean.
What I Took Away
Every single one of these races had the same shape: an atomic write paired with a bare read. It's easy to think "I made the write atomic, so it's safe" -- but that's only half the handshake. If the reader doesn't participate, it's still a race.
The atomic.Uint64 and atomic.Int64 types exist specifically to make this
mistake impossible. When the type itself is atomic, there's no way to do a bare
read. The compiler catches what I missed.
I found one race by reading the code. The detector found four more. Write the
test first, run it with -race, and let the tooling tell you what you're wrong
about. That's its own lesson.
Now, these metrics are accurate -- but they disappear every time the server restarts. That's the next problem to solve: persisting them. That's its own post.