backup MX: queue/retry times

Need help? Ask here.

Moderator: Moderators

Post Reply
cbrace
Posts: 23
Joined: Sun Dec 31, 2006 5:01 am
Location: Amsterdam
Contact:

backup MX: queue/retry times

Post by cbrace »

Hi,

I started using rollernet last year primarily for backup MX. On the one hand, I found the three-week retention queue helpful when my mailserver went down while I was on holiday. On the other hand, I subsequently discovered that messages sometimes linger for a long time in the rollernet queue even when the primary server is online.

To be more specific: I have found that a certain percentage of messages get routed to the backup MX even when the primary is online -- for whatever reason. They nonetheless tend to get delivered fairly quickly.

In addition, a certain number of messages get routed to backup MX when the primary is temporarily offline, for a few minutes or hours (normally never more than a day).

Alas, in this case, I have seen that it sometimes takes many days or even a week or more for such messages to get delivered. A couple of times such messages have taken three weeks to get delivered.

I would expect something like the following: when a message can't be delivered, the system retries every twenty minutes for four hours. After hours, once an hour, after a day, once a day...

In practice, this doesn't appear to be the case. Any thoughts on this?

Thanks.
RollerNetSupport
Site Admin
Posts: 598
Joined: Wed Nov 17, 2004 10:05 pm
Location: Nevada
Contact:

Re: backup MX: queue/retry times

Post by RollerNetSupport »

cbrace wrote:Hi,

I started using rollernet last year primarily for backup MX. On the one hand, I found the three-week retention queue helpful when my mailserver went down while I was on holiday. On the other hand, I subsequently discovered that messages sometimes linger for a long time in the rollernet queue even when the primary server is online.

To be more specific: I have found that a certain percentage of messages get routed to the backup MX even when the primary is online -- for whatever reason. They nonetheless tend to get delivered fairly quickly.
That's very, very common. Common enough that it's the top entry in the FAQ: https://acc.rollernet.us/help/faq.php

cbrace wrote:In addition, a certain number of messages get routed to backup MX when the primary is temporarily offline, for a few minutes or hours (normally never more than a day).

Alas, in this case, I have seen that it sometimes takes many days or even a week or more for such messages to get delivered. A couple of times such messages have taken three weeks to get delivered.

I would expect something like the following: when a message can't be delivered, the system retries every twenty minutes for four hours. After hours, once an hour, after a day, once a day...

In practice, this doesn't appear to be the case. Any thoughts on this?

Thanks.
If you're familiar with Postfix, the specific settings we currently use in main.cf that control this are:

queue_run_delay = 300s
maximal_backoff_time= 1w
minimal_backoff_time = 15m
maximal_queue_lifetime= 3w

If you aren't familiar with Postfix, this page - http://www.postfix.org/QSHAPE_README.ht ... rred_queue - explains the process deferred messages take.

In the event individual messages appear to be acting outside the realm of normal behavior, you can always send us the queue ID or instance ID of the offending message (complete copies of the message headers are best) and we can attempt to trace it and see what happened.
Technical Support support@rollernet.us
Roller Network LLC
RollerNetSupport
Site Admin
Posts: 598
Joined: Wed Nov 17, 2004 10:05 pm
Location: Nevada
Contact:

Post by RollerNetSupport »

I should also add that that longest delay anything in the queue should see is once per week, as we use "fast_flush_refresh_time = 1w", which is analogous to doing an ERTN once a week. The fast flush feature is also what's used to generate the queue list and queue status display (https://acc.rollernet.us/mail/queue.php) in the account control center.
Technical Support support@rollernet.us
Roller Network LLC
cbrace
Posts: 23
Joined: Sun Dec 31, 2006 5:01 am
Location: Amsterdam
Contact:

Post by cbrace »

Thanks for the very helpful, detailed reply. I have been running Postfix here as my mail server for a year and half, but I am not yet acquainted with all of its intricacies.

In addition to the reasons listed in the FAQ, which all make sense in my case, I am also using the OpenBSD spamd greyslisting daemon, which would certainly cause some smtp hosts to fall back on the backup MX.

Because the long queue times were interfering with work-related email, I had to switch it off, but here is the complete header from a message that was routed via your servers when backup MX was enabled:

Code: Select all

Return-Path: <xxxxxx@worldskills.org>
X-Original-To: cb@lim.nl
Delivered-To: cb@lim.nl
Received: from localhost (localhost [127.0.0.1])
        by venus (Postfix) with ESMTP id 9CAC79B1C
        for <cb@lim.nl>; Thu, 10 Apr 2008 00:22:13 +0200 (CEST)
X-Spam-Flag: NO
X-Spam-Score: 0
X-Spam-Level: 
X-Spam-Status: No, score=0 tagged_above=-9999 required=4
        tests=[HTML_MESSAGE=0.001, SPF_HELO_PASS=-0.001]
Received: from venus ([127.0.0.1])
        by localhost (venus.lim.nl [127.0.0.1]) (amavisd-new, port 10024)
        with ESMTP id KgBz1lPzpirP; Thu, 10 Apr 2008 00:22:08 +0200 (CEST)
Received: from mail.rollernet.us (mail.rollernet.us [208.11.75.2])
        by venus (Postfix) with ESMTP id E217D9AF6
        for <cb@lim.nl>; Thu, 10 Apr 2008 00:21:58 +0200 (CEST)
Received: from mail.rollernet.us (localhost [127.0.0.1])
        by mail.rollernet.us (Postfix) with ESMTP id 2FACB5853E45
        for <cb@lim.nl>; Thu, 13 Mar 2008 07:14:06 -0700 (PDT)
Received: from wic001ef.exchange.server-login.com (wic001ef.exchange.server-login.com [202.148.129.65])
        by mail.rollernet.us (Postfix) with ESMTP
        for <cb@lim.nl>; Thu, 13 Mar 2008 07:14:05 -0700 (PDT)
Received: from WIC002EBCLV2.exchange.server-login.com ([202.148.131.8] RDNS failed) by wic001ef.exchange.server-login.com with Microsoft SMTPSVC(6.0.3790.3959);
         Thu, 13 Mar 2008 23:42:11 +1000
x-mimeole: Produced By Microsoft Exchange V6.5
Content-class: urn:content-classes:message
MIME-Version: 1.0
Content-Type: multipart/alternative;
        boundary="----_=_NextPart_001_01C8850F.C082215B"
Subject: RE: short text xxxxx
Date: Thu, 13 Mar 2008 23:40:08 +1000
Message-ID: <BA798460F961934C9FD3EC23E22B229B010ADBD9@WIC002EBCLV2.exchange.server-login.com>
In-Reply-To: <aec9371b0803130637i610f4cakd339dd2696032c11@mail.gmail.com>
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
Thread-Topic: short text xxxxxx
Thread-Index: AciFD2Urw7mt4jPST6isIhn4mu6HxgACJO/w
From: "xxxxxx" <xxxxx@worldskills.org>
To: "Colin Brace" <cb@lim.nl>
Cc: "xxxxxx" <xxxxx@worldskills.org>
X-OriginalArrivalTime: 13 Mar 2008 13:42:11.0453 (UTC) FILETIME=[0A0B1AD0:01C88510]
X-Rollernet-Spf: tagging disabled
X-Rollernet-Abuse: Processed by Roller Network Mail Services. See our abuse policy at http://rollernet.us/abuse.php
X-Rollernet-Tracking: Tracking ID 1338.47d936ad.c44d3.0
As you can see, mail.rollernet.us received the message on 13 March. But my server, venus.lim.nl, didn't get it until 10 April.

During this period, venus.lim.nl was online, albeit with the kind of sporadic outages mentioned in your FAQ (ie, according to my logs, my dyn IP, which is configured to have a TTL of 300s, changed twice).

If I understand your information correctly, your postfix installation should have been trying to redeliver the message least every five minutes, but in any case no less than once a week. In this instance, that seems not to be have been the case; or possibly, it tried minimally only three times, over a period of three weeks, at the very moments when my server was momentarily inaccessible.

Obviously I am just guessing. Do you have any ideas what might have happened here?

Thanks.
RollerNetSupport
Site Admin
Posts: 598
Joined: Wed Nov 17, 2004 10:05 pm
Location: Nevada
Contact:

Post by RollerNetSupport »

Here's the errors from our logs for that message. The message was very tiny (10107 bytes, just under 10k). The server was trying, but it was unable to complete the transaction. The end-of-data (EOD) timeout is controlled by the smtp_data_done_timeoutparameter, which we have set to 600 seconds (the default). This means that when we send EOD, the other end has 600 seconds to respond before we give up and close the connection.

Code: Select all

*************************** 1. row ***************************
      qnum: 2FACB5853E45
relay_name: lim.nl
  relay_ip: 78.40.195.38
    status: deferred (lost connection with lim.nl[78.40.195.38] while sending end of data -- message may be sent more than once)
 timestamp: 2008-03-13 07:20:31
*************************** 2. row ***************************
      qnum: 2FACB5853E45
relay_name: lim.nl
  relay_ip: 78.40.195.38
    status: deferred (lost connection with lim.nl[78.40.195.38] while sending end of data -- message may be sent more than once)
 timestamp: 2008-03-13 07:46:19
*************************** 3. row ***************************
      qnum: 2FACB5853E45
relay_name: lim.nl
  relay_ip: 78.40.195.38
    status: deferred (lost connection with lim.nl[78.40.195.38] while sending end of data -- message may be sent more than once)
 timestamp: 2008-03-13 08:26:19
*************************** 4. row ***************************
      qnum: 2FACB5853E45
relay_name: lim.nl
  relay_ip: 78.40.195.38
    status: deferred (lost connection with lim.nl[78.40.195.38] while sending end of data -- message may be sent more than once)
 timestamp: 2008-03-13 09:46:20
*************************** 5. row ***************************
      qnum: 2FACB5853E45
relay_name: lim.nl
  relay_ip: 78.40.195.38
    status: deferred (lost connection with lim.nl[78.40.195.38] while sending end of data -- message may be sent more than once)
 timestamp: 2008-03-13 12:26:19
*************************** 6. row ***************************
      qnum: 2FACB5853E45
relay_name: lim.nl
  relay_ip: 78.40.195.38
    status: deferred (lost connection with lim.nl[78.40.195.38] while sending end of data -- message may be sent more than once)
 timestamp: 2008-03-13 17:46:18
*************************** 7. row ***************************
      qnum: 2FACB5853E45
relay_name: lim.nl
  relay_ip: 78.40.195.38
    status: deferred (lost connection with lim.nl[78.40.195.38] while sending end of data -- message may be sent more than once)
 timestamp: 2008-03-14 04:26:18
*************************** 8. row ***************************
      qnum: 2FACB5853E45
relay_name: lim.nl
  relay_ip: 78.40.195.38
    status: deferred (host lim.nl[78.40.195.38] said: 451 Temporary failure, please try again later. (in reply to DATA command))
 timestamp: 2008-03-15 01:42:08
*************************** 9. row ***************************
      qnum: 2FACB5853E45
relay_name: lim.nl
  relay_ip: 78.40.195.38
    status: deferred (lost connection with lim.nl[78.40.195.38] while sending end of data -- message may be sent more than once)
 timestamp: 2008-03-16 20:18:19
*************************** 10. row ***************************
      qnum: 2FACB5853E45
relay_name: lim.nl
  relay_ip: 78.40.195.38
    status: deferred (lost connection with lim.nl[78.40.195.38] while sending end of data -- message may be sent more than once)
 timestamp: 2008-03-19 15:19:31
*************************** 11. row ***************************
      qnum: 2FACB5853E45
relay_name: lim.nl
  relay_ip: 78.27.18.102
    status: deferred (lost connection with lim.nl[78.27.18.102] while sending end of data -- message may be sent more than once)
 timestamp: 2008-03-25 23:32:59
*************************** 12. row ***************************
      qnum: 2FACB5853E45
relay_name: lim.nl
  relay_ip: 78.27.18.102
    status: sent (250 2.0.0 Ok: queued as E217D9AF6)
 timestamp: 2008-04-09 15:23:08

Now, as for the retry time, you can see the doubling of backoff times in action looking at the timestamps. While the queue is scanned every 5 minutes for messages, the *minimum* time for the first retry is 15 minutes. If that attempt failed, the time is doubled, and the second attempt will be at 30 minutes. (plus or minus the actual next scan start time and the time it takes to run through the queue, or longer if the destination is "dead"). If that fails, then it's an hour, then 2 hours, then 4 hours all the way up to one week between attempts.

I mentioned "dead" destinations - the postfix qmgr keeps track of delivery failures in memory. So, if there was an attempt already in progress to deliver to 78.40.195.38 that failed and it remembers, postfix won't try other messages to the same destination because it may clog up the queue with "dead" messages.
Technical Support support@rollernet.us
Roller Network LLC
cbrace
Posts: 23
Joined: Sun Dec 31, 2006 5:01 am
Location: Amsterdam
Contact:

Post by cbrace »

Interesting. My first inclination is to think: hey, your server was greytrapped by spamd! But that would be impossible, since spamd only greytraps hosts which have sent mail to invalid addresses in my domains which I have entered myself in its database. My rollernet Valid User Table only passes valid addresses.

A quick search of my Postfix logs reveals that during March and April 341 sessions were initiated by Rollernet servers, of which 325 were by mail.rollernet.us:

Code: Select all

$ zcat maillog.200803* maillog.200804* | grep rollernet.us | grep ": connect" | wc -l 
     341

$ zcat maillog.200803* maillog.200804* | grep mail.rollernet.us | grep ": connect" | wc -l 
     325
However, I then take a look in the spamd logs:

Code: Select all

$ zcat spamd*.bz2 | grep 208.11.75.2 | grep ": connected" | grep spamd | wc -l
    1235
Argghh! mail.rollernet.us has been blocked 1235 times between 3 March and 3 April (no hits for 208.79.241.2). I think this corresponds to the first entry in your log entries (all times UTC+1):

Code: Select all

Mar 13 15:13:25 venus spamd[1055]: 208.11.75.2: connected (2/2), lists: spamd-greytrap
[...]
Mar 13 15:19:48 venus spamd[1055]: 208.11.75.2: disconnected after 383 seconds. lists: spamd-greytrap
(07:20:31 =~ 15:19:48; the clock on my server tends to run a little slow)

The last spamd log entry was on 3 April:

Code: Select all

Apr  3 06:16:21 venus spamd[53987]: 208.11.75.2: connected (3/3), lists: spamd-greytrap
Apr  3 06:22:44 venus spamd[53987]: 208.11.75.2: disconnected after 383 seconds. lists: spamd-greytrap
At which point the entry for 208.11.75.2 appears to have expired (spamd expires greytrapped hosts after 25 days).

Then, on 10 April, mail.rollernet.us connected again and delivered five spooled messages, including the one I indicated above that took three weeks to deliver:

Code: Select all

Apr 10 00:21:58 venus postfix/smtpd[57481]: connect from mail.rollernet.us[208.11.75.2]
Apr 10 00:21:58 venus postfix/smtpd[57483]: connect from mail.rollernet.us[208.11.75.2]
Apr 10 00:21:58 venus postfix/smtpd[57484]: connect from mail.rollernet.us[208.11.75.2]
Apr 10 00:21:58 venus postfix/smtpd[57485]: connect from mail.rollernet.us[208.11.75.2]
Apr 10 00:21:58 venus postfix/smtpd[57486]: connect from mail.rollernet.us[208.11.75.2]
The kicker is, rollernet was able to deliver messages during the time it was greytrapped! For example, here is the log entry for a message on 2 April, before the last spamd log entry:

Code: Select all

Apr  2 03:16:11 venus postfix/smtpd[26053]: connect from mail.rollernet.us[208.11.75.2]
Apr  2 03:16:11 venus postfix/smtpd[26053]: B8B479ACE: client=mail.rollernet.us[208.11.75.2]
Apr  2 03:16:12 venus postfix/cleanup[26021]: B8B479ACE: message-id=<rt-3.4.5-12091-1207099263-1070.192129-23-0@cantv.com.ve>
Apr  2 03:16:12 venus postfix/qmgr[98151]: B8B479ACE: from=<xxxxx@seguridad.cantv.com.ve>, size=8714, nrcpt=2 (queue active)
Apr  2 03:16:12 venus amavis[71702]: (71702-18) ESMTP::10024 /var/amavis/tmp/amavis-20080401T175148-71702: <xxxxx@seguridad.cantv.com.ve> -> <cb@lim.nl> SIZE=8
714 BODY=8BITMIME Received: from venus ([127.0.0.1]) by localhost (venus.lim.nl [127.0.0.1]) (amavisd-new, port 10024) with ESMTP; Wed,  2 Apr 2008 03:16:12 +0200 (CEST)
Apr  2 03:16:12 venus amavis[71702]: (71702-18) Checking: DjtfF3hhs2pr <xxxxx@seguridad.cantv.com.ve> -> <cb@lim.nl>
Apr  2 03:16:12 venus amavis[71702]: (71702-18) p001 1 Content-Type: text/plain, size: 6608 B, name: 
Apr  2 03:16:15 venus postfix/smtpd[26053]: disconnect from mail.rollernet.us[208.11.75.2]

[snip amavis stuff]

Apr  2 03:16:17 venus postfix/virtual[26084]: 981B09AF4: to=<cb@lim.nl>, relay=virtual, delay=0.04, delays=0.03/0/0/0, dsn=2.0.0, status=sent (delivered to maildir)
At this point, I don't know what to say. It appears that I inadvertently allowed mail.rollernet.us to be greytrapped, but the ban was only partial (?!?!?); it was blocked 1235 times and allowed through 325 times. I'll ask around on a openBSD list; if I learn anything, I will let you know.

If you see anything in the above that I don't see, please let me know.

In any case, sorry for the trouble, and thanks for your help.
Post Reply