Recovering from an Audit Collection Outage
4 min read
If Burrow's detection has been offline for an extended period — scheduled maintenance, an upstream Microsoft outage, or rare service interruption — there is a gap between when Burrow last collected audit events and now. Burrow handles this automatically, but it is worth understanding what is happening behind the scenes so you can spot when something needs your attention.
What "outage" means here
Burrow polls Microsoft's audit feed on a regular cycle. The cycle keeps a checkpoint of "last completed window" so it knows where to pick up on the next pass. An outage is anything that prevents the cycle from running on schedule:
- Smikar performed a scheduled maintenance window on the service.
- Microsoft's audit feed had a transient error that delayed collection.
- A rare service interruption interrupted collection temporarily.
If the outage is short (a few cycles), the next normal pass catches up automatically and nothing further is required. Longer outages trigger explicit catch-up behaviour.
Automatic catch-up
When Burrow comes back online and sees a gap between its checkpoint and the current time, it enters catch-up mode:
- Instead of waiting for the normal cycle cadence, Burrow pulls 10-minute chunks of historical data back-to-back, with a short pause between each chunk.
- Each chunk completes a full collection → rules-evaluation → alerts-written cycle. Alerts from the historical period land in the dashboard chunk by chunk, so live detection resumes progressively rather than waiting until the entire gap is covered.
- The AI triage step, incident correlation, behavioural profile builder, and email delivery layer continue processing alerts as the chunks come in — so emails for caught-up alerts flow normally.
- Once Burrow is caught up to within one cycle of real time, it resumes the standard cadence.
You do not need to do anything during catch-up. The dashboard's home page shows live counts updating as alerts come in; the Alerts page populates with the historical events.
The 7-day cap
Microsoft retains audit events for 7 days regardless of when they were created. If Burrow's outage was longer than 7 days, the audit data older than 7 days is no longer recoverable from Microsoft. Burrow logs anything it could not recover as a data-loss event visible in the dashboard.
For an outage approaching 7 days, raise a support ticket before catch-up completes — the engineering team can advise on the recovery approach and confirm what is recoverable.
Transient errors during catch-up
If individual chunks hit a transient Microsoft-side error (rate limit, 5xx response, network timeout), Burrow retries each one up to three times with backoff. Chunks that recover after retry log as successful; chunks that fail after all retries log as data-loss events. The next chunk continues regardless — one bad chunk does not stop catch-up.
How to know catch-up is happening
Two signals on the dashboard:
- Home page → Trend chart — the chart shows daily alert counts. If catch-up is running, today's column will be visibly higher than usual as historical alerts back-fill.
- Settings → Diagnostic tab — shows audit-collection progress. The "Audit collection caught up through" timestamp moves forward as chunks complete. When it equals the current time minus one detection pass, catch-up is done.
When to raise a ticket
The catch-up flow handles most outages without intervention. Raise a support ticket if:
- The outage was longer than 7 days.
- Catch-up has been running for more than 24 hours and the "Audit collection caught up through" timestamp is not advancing.
- The dashboard's Diagnostic tab shows data-loss events you do not expect.
- You see no alerts at all for an extended period after catch-up should have completed.
In each case, capture the Diagnostic tab's state in a screenshot when you raise the ticket — that is the fastest way to a diagnosis.
After catch-up — sanity check
When catch-up completes, the historical alerts have been emailed (subject to your normal Notifications gates). Your inbox may have a burst of older alerts. Review them with normal triage — most will be benign by virtue of being old, but anything that looks like an in-progress attack at the time of the outage deserves the standard investigation workflow.
See also
- How an alert flows through Burrow — the normal pipeline that resumes after catch-up.
- Investigating an alert — for the post-catch-up alert review.
Need help? support@smikar.com.