Adventures with Sendmail

A Story of DevOps and Culture

I know what you’re thinking…

“Sendmail? Why would you choose that?”

Well, I can answer that question– the choice wasn’t mine! The organization I work for has gone through a couple mergers and demergers over the past 30 years. At one point, one of the organizations managed its own mail server. Upon merging, the responsibility of who would manage that mail server was never fully ironed out, and the organization moved to a Google Apps domain, relegating the old email server to a simple internal SMTP relay.

Well today, that relay had some issues. When using our ITSM tool (which is subsequently configured to use this old internal SMTP relay) I received an error that the email could not be sent: because our faithful old mail server returned an error that it had insufficient disk space to handle the email. Oops.

Monitoring without alerting is just metrics.

How did this happen?

Let’s rewind. I lead a team of System Engineers who embrace DevOps. We strive to have all of our systems configured via Chef and monitored with Sensu. We have introduced all sorts of great practices into our organization like Kanban, Standups, Kaizenkais, TDD, CI/CD pipelines, integrations with PagerDuty, StatusPage, Slack etc. We are, for the most part, able to tell when one of our services is about to have or is having an issue before any users are, and able to, if it’s not something catastrophic, fix it before anybody notices. So, how did this core SMTP relay run out of disk space without anybody noticing?

I decided to ask our infrastructure team who manages the VM, were they aware the disk was full? Were they working on it?

Turns out they did not have it configured in their monitoring solution (SolarWinds), but they did have it checking into their Chef organization and running a policy file that basically managed user accounts. They did find a page in their old Big Brother monitoring solution for it that was glowing red, although nobody was paying attention to it.

Members of my team had historically kept up with the care a feeding of the SendMail service running on the VM since we also manage a lot of the other core application services, this led to the infrastructure team just generally expecting us to be managing the application. However, we did not like the dichotomy of being responsible for managing the service, while the VM and OS (RHEL 5!) were out of our control. Also, did I mention it was Sendmail? If it ever became official that our team was responsible for the service, we were going to spin up a new fully Chef’ed server, and use something like PostFix.

Conquering the Queue

Well, that was a project for another day. Instead I set out to fix the issue at hand, as a lot of services were pointing at this SMTP relay. The problem was all of our handy commands to inspect the mail queue were failing as there were just too many messages stuck in the queue. /var/spool/mqueue was 13GB, the maillog was another 2GB, filling up /var to 100% and preventing any more messages from being able to be processed. I quickly moved off some zipped maillogs to a different volume to allow messages to flow again, and then set out to conquer the queue…

Over half a million messages in the queue!

If you are familiar with SendMail, you know there is a command called mailq which gives you a status for all the messages in the queue. We also have our own custom command which we use to sort the mail queue based on total emails bound for addresses.

mailq | sed "s/^[\t ][\t ]*</ </" | grep "^ <" | sed "s/<//" | sed "s/>//" | sort >~/.msgs
uniq --count ~/.msgs | sort -nr

The problem was, with over half a million messages in the queue, these commands would not return, even after 30 minutes of running. There is even a cronjob that checks on how many messages are in the queue and sends an email if it is greater than 500. That cronjob was stacking up on itself. What a mess! After killing the mailq processes I decided to do it by hand with a simple bash command

grep "From: " /var/spool/mqueue/qf* | cut -f2- -d':' | sort | uniq -c

Did I mention it had over half a million messages in the queue? Grep just went “Nope!” and threw an error there were too many arguments.

That was new. I thought maybe my grep command was broken. Then I realized it couldn’t handle that many files. Well, lets try find and then send the results to grep.

find -name 'qf*' -printf '%f\0' | xargs -0 grep "From: " | cut -f2- -d':' | sort | uniq -c

Yeah! That worked, and it returned in under 1 minute! Over half of the emails were from the mail server itself as MAILER DAEMON messages. Those were the first to go. A quick toss of all of those messages… and an hour later the queue was down to about 100,000. A few more searches of the biggest offenders and tossing the messages that were undeliverable and the queue was back down to normal levels and lots of free space on the volume again.

Let’s Chef it with Monitoring

Now to make sure this never happened again. I was going to get this service properly monitored. I spoke with the infrastructure team and got permission to rebootstrap the server into my team’s Chef org. That way we could manage the node and get monitoring on it. We already put a lot of work into our Chef baseline. It configures monitoring and sets up a slew of checks, metrics, an integration with graphite, and grafana, and all that jazz. So it was as simple as running knife bootstrap and adding our baseline to the runlist. Or so I thought.

Turns out RHEL 5 is really old, and whoever configured the firewall on the server was very explicit with the outbound rules so downloading chef client or ruby gems was a no-no. After updating the OUTPUT chain in iptables and installing the latest supported version of chef client for RHEL 5 (13.4.24 if you’re wondering)… and adding an attribute to the node so that our baseline wouldn’t try to update Chef… and manually installing Sensu with yum –nogpgcheck¬† because the RHEL 5 release wasn’t signed for some reason? After all the Yak Shaving, Chef finally converged and the service was now checking into our Chef org, reporting into Sensu, and integrated into PagerDuty for CPU, Memory, and Disk checks. Phew. Now for the mail queue checks.

Custom Sensu Checks

I wrote two quick bash checks that I wanted Sensu to call:

mqtotal returns a graphite string for the total size of the mail queue:

TIME=date +%s
COUNT=mailq | tail -n 1 | awk '{print $3}'
echo "servername.mqcount $COUNT $TIME"

mqcheck could be called by sensu as a check command to alert based on the queue size:

COUNT=mailq | tail -n 1 | awk '{print $3}'
if [ $COUNT -ge $WARN ]; then
 exit 1
elif [ $COUNT -ge $CRIT ]; then
 exit 2
 echo "OK: Mail queue at $COUNT"
 exit 0

I then created the checks for Sensu and restarted the service. Jumping into Uchiwa to make sure it was working I noticed Sensu wasn’t allowed to call mailq… so I created a sudoers file for sensu and gave it passwordless sudo to those two commands:

sensu ALL=(ALL) NOPASSWD:/root/bin/mqtotal
sensu ALL=(ALL) NOPASSWD: /root/bin/mqcheck

Perfect. Now I could rest easy knowing that we would never be bitten by a full disk or overflowing mail queue again… Plus, now we have fancy metrics to look at. Oooo… Ahhh….

Let me know of your adventures with Sendmail or other Yak Shaving antics in the comments below!

Travis Payton
Follow Me

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.