Stories from the engine room: I have a big configuration file which is part of a mail-setup and it is read on each mail delivery (MTA is exim and it is the file that contains all the domains the servers are responsible for). It lives in NFS and several servers use it.
Two days ago I migrated that part of the NFS to a new NFS-setup that works differently than the old one. And because it is already in production I cannot change a lot in terms of NFS-server configuration. The rest works now fine for months.
When I mounted the two small exports via the new storage I had problems with certain scripts because they use flock. Those scripts just hang. The big storage that runs for months now is using nolockd, so that part never came up. Since I couldn’t really configure the stuff for locks correctly because it would need a restart of the nfs-server, I switched to NFSv4.2 (the old one uses v3; the big storage part as well) because file locking works there differently. The scripts worked and everything was great.
Yesterday suddenly the problem came up that the mail platform misbehaved. Huge numbers of newsletters and spam came in and everything went down to a crawl. I tried around and after implementing greylisting especially for mails from googleusercontent, things recovered.
Today the problem came back and even harder. When I hit ctrl+t (FreeBSD - sends SIGINFO) on a hanging mail delivery I saw that it hang with [nfs]. But the NFS-servers were fine, no saturation, no massive IOPs, they were basically bored. So I attached to truss to a hanging process and it hang when it ran the big configuration file from the beginning. I changed it now that configuration file is locally on each server and gets copied over via a cronjob if it changes. And suddenly all the problems went away.
I have been rarely so happy to solve a problem. That thing was a beast and it had such a massive production impact. I have nonidea how I should have figured that out before.
Especially since I was focused only on the big storage parts and not the small ones which are barely noticeable. That was probably the last hurdle to bring the project to the finish line.
I already had to migrate an LDAP as part of that project which didn’t perform in the new setup and the problem were too restrictive file limits in the new Linux-setup in contrast to the old FreeBSD-setup.
Now I just have to migrate a couple of Terabyte of mailboxes (around three quarters are already migrated) and then I can clear servers which take up roughly 24 rack units (spread over several racks) and some ancient cabling. This project took ages…