Off-Prem

Cloudflare broke its logging-as-a-service service, causing customer data loss

Software snafu took five minutes to roll back. The mess it made took hours to clean up


Cloudflare has admitted that it broke its own logging-as-a-service service with a bad software update, and that customer data was lost as a result.

The network-taming firm admitted in a Tuesday post that, for roughly 3.5 hours on November 14, its Cloudflare Logs service didn't send data it collected to customers – and about 55 percent of the logs were lost.

Cloudflare Logs gathers logs generated by the cloud services and sends them to customers who want to analyze them. Cloudflare suggests the logs may prove helpful "for debugging, identifying configuration adjustments, and creating analytics, especially when combined with logs from other sources, such as your application server."

Cloudflare customers often want logs from multiple servers and, as logfiles can be verbose and voluminous, the provider worries that consuming them all could prove overwhelming.

"Imagine the postal service ringing your doorbell once for each letter instead of once for each packet of letters," the post suggests. "With thousands or millions of letters each second, the number of separate transactions that would entail becomes prohibitive."

Cloudflare therefore uses a tool called Logpush to bundle logs into bundles of predictable size, then push them to customers with a sensible cadence.

Logs that Cloudflare provides to customers are prepared by other tools called Logfwdr and Logreceiver.

On November 14, Cloudflare made a change to Logpush, designed to support an additional dataset.

It was a buggy change – it "essentially informed Logfwdr that no customers had logs configured to be pushed."

Cloudflare staff noticed the problem and reverted the change in under five minutes.

But the incident triggered another bug in Logfwdr that meant, under circumstances like the Logpush mess, all log events for all customers would be pushed into the system – instead of just for those customers who had configured a Logpush job.

The resulting flood of info is what caused the outage, and the loss of some logfiles.

Cloudflare has admonished itself for the incident. It conceded it did most of the work to prevent this sort of thing – but didn't quite finish the job. Its post likens the situation to failing to fasten a car seatbelt – the safety systems are built in and work, but they're useless if not employed.

The networking giant will try to avoid this sort of mess in future with automated alerts that mean misconfigurations "will be impossible to miss" – brave words. It also plans extra testing to prepare itself for the impact of datacenter and/or network outages and system overloads. ®

Send us news
5 Comments

Under Trump 2.0, Europe's dependence on US clouds back under the spotlight

Technologist Bert Hubert tells The Reg Microsoft Outlook is a huge source of geopolitical risk

FYI: An appeals court may kill a GNU GPL software license

Defense of FOSS licensing rests on the shoulders of a guy in Virginia

Open Source Initiative defends disallowing board candidate after timezone SNAFU

Here's another thing AI can do: Cause conflict around whether it's compatible with the very idea of open source

OBS-tacle course: Fedora and Flathub's Flatpak fiasco sparks repo rumble

Dispute settled, but not the causes

How mega city council's failure to act on Oracle rollout crashed its financial controls

Missing assessments, hidden caveats, and overoptimism all contributed to fateful decision, auditors find

Docker delays Hub pull limits by a month, tweaks maximums, stalls storage billing indefinitely

Image fetches to be capped on hourly basis for Personal, unauthenticated use, paid-for plans get unlimited access

GitLab and its execs sued again and again over 'misleading' AI hype, price hikes

Bosses bragged about Duo Chat bot, buyers weren’t buying it – claim

C++ creator calls for help to defend programming language from 'serious attacks'

Bjarne Stroustrup wants standards body to respond to memory-safety push as Rust monsters lurk at the door

UK government's cloud strategy: Pay more, get less, blame vendor lock-in?

Home Office's £450M deal with AWS raises questions over competition and aligning department requirements

Does terrible code drive you mad? Wait until you see what it does to OpenAI's GPT-4o

Model was fine-tuned to write vulnerable software – then suggested enslaving humanity

Mega council officers had no idea what they were buying ahead of Oracle fiasco

Lack of skills left Birmingham officials unable to challenge suppliers and with a system incapable of managing finances

How nice that state-of-the-art LLMs reveal their reasoning ... for miscreants to exploit

Blueprints shared for jail-breaking models that expose their chain-of-thought process