At work we needed to create a "doorkeeper" process that accepts incoming network connections and then hands over the connected sockets to one of several, already running, service processes. The service process then talks directly to the client through the network layer. I can't go into detail about why that happens to be the right design for us, so just stipulate that it is what we want.
I was assigned the task of researching how to do the socket handover and write code to implement it. Solutions had to be found for Windows, Linux, and Mac OS X. Possibly later other BSD variants too.
On Windows things turned out to be fairly easy. There's a dedicated system call, WSADuplicateSocket(), for exactly this scenario. This syscall fills in an opaque structure in the doorkeeper process, whose binary content you then transmit to the service process (say, through pipes that you set up when the service process was started). In the service process another system recreates a copy of the socket for you, and then you're free to close the doorkeeper's socket handle.
Things were not quite as straightforward on the Unix side. (At least it works the same way on Linux and Darwin, modulo a few ifdefs). The way to move an open file descriptor between existing processes is to send it as out-of-band data (aka an "ancillary message") the the SCM_RIGHTS tag on a Unix domain socket (aka an AF_UNIX/PF_UNIX socket). The kernel knows that SCM_RIGHTS is magic and processes its payload of file descriptors such that the receiving process receives descriptors that are valid in its own context, in effect an inter-process dup(2).
Or at least, this was the only way to do it that a few hours of googling disclosed to me. It works, but ... ho boy, Unix domain sockets! In my 20+ years of programming, many of them with a distinctly Unix focus, I've never before encountered a situation where AF_UNIX is preferred solution. This turns out to be not entirely without cause.
Firstly, Unix domain sockets are sockets, which means that you have to go through all the motions of the client/server-based sockets paradigm we know and love from the TCP/IP world. First the two processes must somehow agree on a rendezvous address. Then one of the processes creates a socket, binds it to your chosen address, calls listen() and accept() on it, then gets a second socket on which the actual SCM_RIGHTS transaction can take place, and finally closes both of its sockets. Meanwhile the other process goes through a socket()–connect()–communicate–close() routine of its own. Many of these calls are potentially blocking, so if you're doing non-blocking I/O and want to schedule unrelated work packets in your thread while all this goes on (which we happen to do), the entire transaction needs to be broken into a fair number of select()-triggered work packets. Not that any of this is difficult to code as such, but it is still about the most complex-to-set-up way of doing IPC Unix offers.
Secondly, Unix domain sockets are also files, or more accurately, they are file-system objects. A listening socket has a name and a directory entry (otherwise it cannot be connected to) and a inode. The inode and directory entry are created automatically by the bind() call. This means, among other things, that you have to create your socket in a particular directory, and it has better be one where you have write permission. All of the standard security considerations about temporary files kick in – if you create your socket in
/tmp, somebody might conceivably move it out of the way and substitute his own socket before the client process tries to connect, and so forth.
Further, cleaning up after the transaction becomes an issue. The inode for the listening socket will disappear by itself once nobody is using it, as is the nature of inodes. However, the directory entry counts as using the inode, and it does not go away by itself. You need to explicitly unlink(2) it in addition to closing the socket. If you forget to unlink it, it will stick around in the directory, taking up space but being utterly useless because nobody is listening to it and nobody can ever start listening to it except by unlinking it and creating a new one in its place with bind(). In particular, bind() will fail if you try to bind to an existing but unused AF_UNIX socket inode.
What happens if the server process dies while it listens for the client to connect? The kernel will close the listening socket, but that will not make the directory entry go away. You can register an atexit() handler to do it, but atexit() handlers are ignored if the process dies abnormally (e.g. if it receives a fatal signal). There is essentially no way to make sure that things are properly cleaned up after.
This is a major downside of Unix domain sockets. It is very probably impossible to change, because of the fundamental design choice of the Unix file system model that you can't get from an inode to whatever name(s) in the file system that refer to it. Not even the kernel can. But, good reason or not, it means that you either have to accept the risk of leaking uncollected garbage in the file system, or try to used the same pathname for each transaction, unlinking it if it already exists. Unfortunately the latter option conflicts with wanting to run several transfers (or several instances of the doorkeeper process) in parallel.
Access control for the Unix domain socket transaction is another issue. In theory, the permission bits on the inode control who can connect to the listening socket, which would be somewhat more useful if you could set them using fchmod(2) on the socket. But at least on Linux you can't do that. Probably you can use chmod(2) after bind() but before listen(), but you'd need to be sure that no bad guy has any window to change the meaning of the pathname before you chmod it. This sounds a bit too tricky for my taste, so I ended up just checking the EUID of the client process after the connection has been accept()ed. For your reference, the way to do this is getsockopt() with SO_PEERCRED for Linux, and the getpeereid() function for BSDs. The latter uses a socket option of its own internally, but getpeereid() appears to be more portable. Note that on Linux you have the option to check the actual process ID of your conversation partner; BSD only gives you its user and group IDs.
I'm as critical as the next guy about stuff that comes from the Dark Tower of Redmond, but in this particular case the two-click approach that works on Win32 does appear to be significantly less complex than the Unix solution.
(New comments disabled on 2013-06-11 due to persistent comment spam)