This is bizarre and while I have a workaround, I'd prefer a permanent fix.
I have a small group of GPU machines running Ubuntu 14.04 which I am using as workers for a cloud service that's effected via Docker images. I have nvidia-docker installed on all the worker machines, so that docker has access to the GPUs. The worker machines also function as individual servers which lab members can do experiments on directly (academic environment, the cloud service is experimental, etc). For the latter purpose, all the machines automount individual user shares over NFS. We recently switched to automount from a static fstab configuration, and I'm still getting used to it -- it's entirely possible there's some obvious issue at play here I'm not seeing because I'm an automount n00b. Finally, I haven't set anything up for docker images to be able to access the NFS shares, so in theory there should be no connection... in theory.
This week one of our lab members reported the
Too many levels of symbolic links error when attempting to access their share drive from one of the GPU machines. They're not using docker at all (to their knowledge). There are no questionable symbolic links in their tree (via
find -type l), so it has to be something else getting into a weird state. The mount point looks like this under
ls -l from the parent directory:
dr-xr-xr-x 2 root root 0 Dec 5 18:38 labmember1
which seems... bad? root:root 555, really? and when you try to browse it you get, indeed:
$ cd /path/to/labmember1/ -bash: cd: /path/to/labmember1/: Too many levels of symbolic links
The share doesn't seem to actually be mounted -- it does not appear in /etc/mtab, and (predictably) attempts to unmount it manually report:
$ sudo umount /path/to/labmember1/ umount: /path/to/labmember1/: not mounted
Restarting autofs (
service autofs restart) did nothing.
What I thought was unrelated at the time: docker had been spewing veth interfaces everywhere. This was a machine being actively used as a cloud worker, so I figured it was our cloud software. Now I'm not so sure.
Too many levels of symbolic links failure occurred on another GPU machine, which has docker/nvidia-docker installed but does not run the cloud worker software. Lo and behold, veth interfaces everywhere, though in far fewer numbers than on the cloud worker machine.
On a whim, I stopped the docker service (
service docker stop). Magic! The share mounts normally and our lab member can use their stuff again. The share remains in working condition after starting docker back up again.
So I can clearly fix this issue by restarting docker if(when) it happens again, but I'd like to know