GlusterFS Network Issues

By Phil Frilling, 30 August, 2016

Not sure what happened with GlusterFS tonight. Around 19:07 EST, the AWS alarm sounded. After rebooting the GlusterFS servers and the web servers, nothing was bringing the sites back online. There was a terrible lag when accessing the GlusterFS file systems and there weren't any glaring errors in the log files. In an effort to get the sites loading again, I began making an archive all of the files from the GlusterFS server mount point and was planning to copy them onto the main web server (web1). Then, I was planning on removing the GlusterFS servers altogether and power down to a single, shared server This was certainly baffling as to what caused GlusterFS to flake out all of a sudden after running for nearly three years. Once the archive had been transferred to the web server, I was planning on doing the following: 1. Remove GlusterFS from fstab, so it stops trying to mount. 2. Place the files from the server into /mnt/glusterfs/ (this will ensure the home directory paths remain the same) While attempting to transfer the archive to the web server, the network transfer was going to take 3 hours or more. I said 'F' this and stopped the Gluster server node I created the archive with and upgraded it to an m3.medium instance. After restarting the Gluster node and changing DNS, the performance issue went away. It seems that AWS was having a network bottleneck, which was causing the very slow latency times. It probably didn't help matters that the two replicated GlusterFS nodes were of two different instance types, and one of those was a small.

Tags