Exchange 2013: EventID 15004 (version buckets), mails are not sent or received

Frank Zöchling

10 years ago

I have already received several inquiries relating to the topic of Back Pressure turn. If Exchange is in the back pressure state, mails are only received or sent slowly or not at all.

The following situation is often described:

Exchange no longer sends or receives mails and the event 15004 is logged, which contains the note that the value for version buckets is too high. The problem may only occur sporadically. After restarting the Exchange services or the server, mails can be sent and received again until the problem occurs again at some point.

Resource utilization has risen from normal to high.

The following resources are overloaded:
Version buckets = 8 [High] [Normal=1 Medium=2 High=3]

The following components have been deactivated due to a backlog:
Incoming message transmission from hub transport servers
Incoming message transmission from the Internet
E-mail transmission from PICKUP directory
E-mail transmission of playback directory
E-mail transmission from mailbox server
E-mail delivery to remote domains
Content aggregation
Resend e-mail via the resend message component.
Message retransmission from the shadow redundancy component

Brief explanation:

Exchange processes mails using batches, stream points and version buckets. A message with a large attachment can be split into several batches, which are then processed. Version buckets are a list of transactions from batches that have not yet been transferred to the transaction logs of the queue database. In principle, it is therefore a transaction log in the server's working memory.

The Back Pressure feature monitors the number of version buckets and batches that only exist in the server's working memory at this time and intervenes in the transfer process if the values become too high.

The most common cause:

Version buckets or batch point values that are too high are caused by mails that are too large for the Exchange server to handle. If, for example, a mail with a 50 MB attachment is received that is to be delivered to several recipients, the values for version buckets and batch points can skyrocket. In this case, Back Pressure intervenes and rejects mails to reduce the load. Exchange then has some time to transfer the changes to the queue database.

Fix the problem:

Probably the simplest solution to the problem is to reduce the maximum size of the messages. There are better ways of exchanging large data than sending it by e-mail. Unfortunately, this is not always possible. So if large mails have to be received, it must be ensured that the peak load caused by large mails can be absorbed. Storage then quickly becomes a bottleneck. A small calculation example:

Mail size	Number of recipients	Average mails per second	Peak load
5 MB	10	10	500 MB / second
20 MB	5	10	1000 MB / second
50 MB	10	10	5000 MB / second

The very brave can also adjust the behavior of Back Pressure by increasing the maximum values (NOT RECOMMENDED). Dazu können die Werte für „VersionBucketsHighThreshold“, „VersionBucketsMediumThreshold“ und „VersionBucketsNormalThreshold“ in folgender Datei höher gesetzt werden:

C:\Program Files\Microsoft\Exchange Server\V15\Bin\EdgeTransport.exe.config

However, it is better to find the bottleneck that is responsible for the high values. I have reproduced the problem in my test environment (I have set the maximum values in the EdgeTransport.exe.config extremely low to reproduce the behavior). At some point I receive the event 15004 with the indication that the version bucket values are too high:

The event took place at 20:42, so you can start with the TrackingLogs searched to find out whether large mails were actually sent during this period:

$start = "19.11.2014 20:40" | get-date
$end = "19.11.2014 20:45" | get-date
Get-MessageTrackingLog -Start $start -End $end -ResultSize unlimited | where {$_.sender -notmatch "health"} | select timestamp,sender,recipients,totalbytes,eventid

Here we can actually see a mail that was slightly larger:

I ran Perfmon in the background and it looks like this:

The performance indicator can be queried via Powershell for all Exchange servers, from which you could also build a small monitoring script:

$exchangeserver = get-mailboxserver

$versionbuckets = @() 
foreach ($server in $exchangeserver)
	{
		$servername = $server.name
		$versionbucketscount = (get-counter "\\$servername\msexchange-datenbank(information store)\zugewiesene versionsbuckets").countersamples.CookedValue
		$versionbuckets += new-object PSObject -property @{Servername="$servername";Versionbuckets="$versionbucketscount"} 
	}
$versionbuckets

Leider gibt es keine Universal Lösung wie in etwa „Hey da ist die Platte zu langsam“ oder „Steck halt mehr RAM rein“, Aber wenn man schon einmal die Uhrzeit und den Auslöser kennt, kann man tiefer in die Analyse einsteigen. Wichtige Faktoren sind:

CPU utilization
RAM
Storage latency
Storage utilization (IOPS)

In a virtual environment, there are a few more factors (I can only speak of VMware now)

Ready values
Storage latency
HBA cues (at FC)
ESX Host utilization

I noticed that the most common cause was the storage, sometimes it was overloaded HBAs or overloaded network cards for iSCSI. Sometimes the storage itself simply couldn't deliver enough IOPS for peak loads. You also often see heavily overbooked ESX hosts.

It is therefore important to look at what is happening in the infrastructure when the event occurs. Are virus scans running? Is a backup running? etc etc.

A good approach is the following: All relevant consoles that can provide data (storage monitoring, esxtop or vSphere client performance data, Windows Perfmon, LAN switch consoles, FC switch consoles, etc.) and try to recreate the problem outside business hours. Simply send a 50 MB mail to 5 recipients and see where bottlenecks occur. These can then be specifically eliminated, or the maximum mail size can be reduced...