Sendgrid emails getting delayed

Hello,
We are using an emergency alert system blast out emails during an emergency (or drill). That system used sendgrid to send out the emails. Most of the emails got thought perfectly whoever some are delayed. Sendgrid is reproting to us that the MX recored look up is timing out. Side note: We are not Sendgrid customers, our alering system is, they are not using any of our domain names as the sending address. Here are the details:

The Drill started emails ran from 1:41 - 1:58 with one at 2:13. Total of 171 emails.
The Drill ended email ran from 1:48 - 1:51. Total of 171 emails.
I have not found anything useful in the log or actvity sofar, but I’m also new to Cloudflare.

Thanks!

What does the organisation behind the alerting system have to say about the issue with their system?

As such, it doesn’t sound to be problem on your domain’s end.

None of these details gives any information that helps in any way for to troubleshooting email deliveries.

Given your explanation though, it will be between the organisation behind the alerting system and SendGrid to figure out a fix.

What does the organisation behind the alerting system have to say about the issue with their system?

We have been having 3 way call with them since August of last year tring to get this figured out.

As such, it doesn’t sound to be problem on your domain’s end.

It is aslo not on there end or on fortmail’s end. It took us awhile to get it narrowed down to our external DNS and I’m still not convinced that it is. I’m still waiting for them to confim they are getting the same error, but I’m showing the exact same issues on my end. out of over a hundred other districts we are 1 of 7 that are having this issue. There is a lot of finger pointing going on.

None of these details gives any information that helps in any way for to troubleshooting email deliveries.

Yes the details was just to make sure it was not hitting some sort of rate limit that I did not konw about.
Any advice or suggestions would be helpful. Especially a log or somthing that could show them it is not the DNS.
Thanks

Mail took between 1 and 9 minutes to be delivered for 99%+ of the messages if I am reason this correctly.

What delay are they troubleshooting?

So the first on took 17 minutes to deliver all the emails. To be precise 7.6% were delayed more than 10 minutes. That is the best it has done so far! We have had more than 200 emails (out of 312) delayed for more than 10 minutes and several of them taking more than an hour.

They are all going to the same destination domain? Then the delay is highly unlikely to be Mx record lookup related. Much more likely would be the receiving MTA imposing rate limiting on inbound requests. What do your mail server logs show during the test?

1 Like

That was the first thing we worked on. Our mail server is fortimail and we have the rules set up so that the alerts bypass the normal limits. It is only taking the emails a second or two to pass thought the fortimail. Last time I heard from sendgrid the error they were getting was an MX recored timeout. So Sedngrid is getting a timeout, and fortimail only has it for a second or two.

Did they maybe mean that the connection to your mail server timed out and they then tried to connect to a backup mx?

Do you have an address that I can send a few hundred emails to for testing?

We never got the error codes from sendgrid’s side, but from the digging fortimail did we could not see anything worng on that end. We can use [email protected] as a test email.

Did all 200 emails arrive, and how long did it take from first to last?

THANKS!! That was the kind of test I’ve been needing! We got all 200 in less than 10 seconds. That also remove the fortimail as a possable issue. It is down to just sendgrid now!

1 Like

I’d also ensure:

A. Your hosts for your MX record don’t have a ridiculously low TTL (600 seconds would be the min I’d suggest).
B. Any lower priority MX records are appropriately configured as well since network issues could prevent connection to a primary MX and if a failover MX is having issues accepting it can exacerbate issues.

We only have one MX record for each domain. The record was set to “auto”. I moved it to 300 seconds so I know what it is set too. Thanks! y’all have been awsome!

Postfix, which is one popular mail server daemon, is for example by default re-trying the first time after 17 minutes, if the receiving mail server was previously giving a temporary error code, or otherwise an intermittent failure to make the delivery (e.g. some sort of connection timeout, as mentioned above).

The timeout could be simply be due to temporary / intermittent connection failures between SendGrid’s network and the network of the receiving mail server.

There is however also another chance, that It could be due to some sort of anti spam technique used on the receiving mail server.

Nolisting and Greylisting are two popular techniques that can be used on the receiving server, in an attempt to reduce “fire-and-forget” kind of spam, that wouldn’t be re-trying deliveries later, if they see a temporary failure…

Nolisting would normally work in a firewall by rejecting connections with TCP RESET (or, if not explicitly set as returning TCP RESET, it would likely be returning a timeout as mentioned above).

Greylisting works by sending a temporary rejection code, the very first time it sees a specific combination (typically made of sender email, IP address of mail server and receiver email) for the delivery attempt.

Most often though, when the receiver is using Greylisting, some sort of combination will be remembered for a while though, to reduce the (potential recurring) impact of delayed messages.

My personal advice though, I would always suggest to refrain from using SendGrid, they have a decade long history of housing systemic spam / phishing / malware attacks, without doing anything at all to mitigate the situation.

You don’t have to take my word for that, if you don’t wish to. You can also check the Swiss Government Computer Emergency Response Team’s website:

https://www.govcert.admin.ch/blog/28/the-rise-of-dridex-and-the-role-of-esps

TL;DR: It wouldn’t surprise me, if SendGrid simply is seeing the cold shoulder from other organisations, due to their reluctance to attempt to mitigate the situations they have had.

2 Likes

We have set up our mail filter to bypass the 3 IP address sendgrid is using for the alert system. Unfortunately, in is not us using sendgrid. It is the vendor we are using for the alerting so we have not control over that.

I’m crossing my fingers that it will help for you.

Yep, I understood that much, but would just like to let you know that SendGrid’s constant issues are a recurring topic in the email community, and that it might be related, if e.g. FortiGate tries to give them the (well-deserved) cold shoulder.

As mentioned above though, there is also the chance, that it might just be temporary network glitches between the two involved networks, and in that case, no bypass filters will be able to help.

This topic was automatically closed 2 days after the last reply. New replies are no longer allowed.