2008-06-10

Unix domain socket woes

At work we needed to create a "doorkeeper" process that accepts incoming network connections and then hands over the connected sockets to one of several, already running, service processes. The service process then talks directly to the client through the network layer. I can't go into detail about why that happens to be the right design for us, so just stipulate that it is what we want.

I was assigned the task of researching how to do the socket handover and write code to implement it. Solutions had to be found for Windows, Linux, and Mac OS X. Possibly later other BSD variants too.

On Windows things turned out to be fairly easy. There's a dedicated system call, WSADuplicateSocket(), for exactly this scenario. This syscall fills in an opaque structure in the doorkeeper process, whose binary content you then transmit to the service process (say, through pipes that you set up when the service process was started). In the service process another system recreates a copy of the socket for you, and then you're free to close the doorkeeper's socket handle.

Things were not quite as straightforward on the Unix side. (At least it works the same way on Linux and Darwin, modulo a few ifdefs). The way to move an open file descriptor between existing processes is to send it as out-of-band data (aka an "ancillary message") the the SCM_RIGHTS tag on a Unix domain socket (aka an AF_UNIX/PF_UNIX socket). The kernel knows that SCM_RIGHTS is magic and processes its payload of file descriptors such that the receiving process receives descriptors that are valid in its own context, in effect an inter-process dup(2).

Or at least, this was the only way to do it that a few hours of googling disclosed to me. It works, but ... ho boy, Unix domain sockets! In my 20+ years of programming, many of them with a distinctly Unix focus, I've never before encountered a situation where AF_UNIX is preferred solution. This turns out to be not entirely without cause.

Firstly, Unix domain sockets are sockets, which means that you have to go through all the motions of the client/server-based sockets paradigm we know and love from the TCP/IP world. First the two processes must somehow agree on a rendezvous address. Then one of the processes creates a socket, binds it to your chosen address, calls listen() and accept() on it, then gets a second socket on which the actual SCM_RIGHTS transaction can take place, and finally closes both of its sockets. Meanwhile the other process goes through a socket()–connect()–communicate–close() routine of its own. Many of these calls are potentially blocking, so if you're doing non-blocking I/O and want to schedule unrelated work packets in your thread while all this goes on (which we happen to do), the entire transaction needs to be broken into a fair number of select()-triggered work packets. Not that any of this is difficult to code as such, but it is still about the most complex-to-set-up way of doing IPC Unix offers.

Secondly, Unix domain sockets are also files, or more accurately, they are file-system objects. A listening socket has a name and a directory entry (otherwise it cannot be connected to) and a inode. The inode and directory entry are created automatically by the bind() call. This means, among other things, that you have to create your socket in a particular directory, and it has better be one where you have write permission. All of the standard security considerations about temporary files kick in – if you create your socket in /tmp, somebody might conceivably move it out of the way and substitute his own socket before the client process tries to connect, and so forth.

Further, cleaning up after the transaction becomes an issue. The inode for the listening socket will disappear by itself once nobody is using it, as is the nature of inodes. However, the directory entry counts as using the inode, and it does not go away by itself. You need to explicitly unlink(2) it in addition to closing the socket. If you forget to unlink it, it will stick around in the directory, taking up space but being utterly useless because nobody is listening to it and nobody can ever start listening to it except by unlinking it and creating a new one in its place with bind(). In particular, bind() will fail if you try to bind to an existing but unused AF_UNIX socket inode.

What happens if the server process dies while it listens for the client to connect? The kernel will close the listening socket, but that will not make the directory entry go away. You can register an atexit() handler to do it, but atexit() handlers are ignored if the process dies abnormally (e.g. if it receives a fatal signal). There is essentially no way to make sure that things are properly cleaned up after.

This is a major downside of Unix domain sockets. It is very probably impossible to change, because of the fundamental design choice of the Unix file system model that you can't get from an inode to whatever name(s) in the file system that refer to it. Not even the kernel can. But, good reason or not, it means that you either have to accept the risk of leaking uncollected garbage in the file system, or try to used the same pathname for each transaction, unlinking it if it already exists. Unfortunately the latter option conflicts with wanting to run several transfers (or several instances of the doorkeeper process) in parallel.

Access control for the Unix domain socket transaction is another issue. In theory, the permission bits on the inode control who can connect to the listening socket, which would be somewhat more useful if you could set them using fchmod(2) on the socket. But at least on Linux you can't do that. Probably you can use chmod(2) after bind() but before listen(), but you'd need to be sure that no bad guy has any window to change the meaning of the pathname before you chmod it. This sounds a bit too tricky for my taste, so I ended up just checking the EUID of the client process after the connection has been accept()ed. For your reference, the way to do this is getsockopt() with SO_PEERCRED for Linux, and the getpeereid() function for BSDs. The latter uses a socket option of its own internally, but getpeereid() appears to be more portable. Note that on Linux you have the option to check the actual process ID of your conversation partner; BSD only gives you its user and group IDs.

I'm as critical as the next guy about stuff that comes from the Dark Tower of Redmond, but in this particular case the two-click approach that works on Win32 does appear to be significantly less complex than the Unix solution.


(New comments disabled on 2013-06-11 due to persistent comment spam)

18 comments:

  1. Additional fun fact: SCM_RIGHTS and SOL_SOCKET happen to have the same numerical value on Linux, leading to strange portability bugs if you accidentally write one instead of the other ...

    ReplyDelete
  2. Hey, nice article. I'm looking at doing something similar, and this is very helpful.

    Thanks, Ben.

    PS adding to what you said about the dark tower of Redmond, another thing I really like about win32 is the WaitForMultipleObjects function.

    ReplyDelete
  3. You're welcome. :-)

    I've used WaitForMultipleObjects mostly as an interruptible select() replacement, and don't find it too onerous to create a self-pipe in the unix implementation. There are probably some complex synchronization scenarios where the full power of WFMO is handy, but I have not come across them yet.

    Apparently, FUTEX_FD in Linux attempted to let select() be useful in such cases, but unfortunately it ended up Broken As Designed.

    ReplyDelete
  4. If the two processes have common ancestry, you can use socketpair() to create a connected pair of AF_UNIX sockets, which are anonymous (don't have a corresponding inode). That solves most of your problems in one go.

    If you cannot use that approach (because the processes don't have common ancestry), the easist way to deal with the permission issue is to create a subdirectory with the relevant permissions, and create the socket inside that. That eliminates race conditions, as the subdirectory is created with the correct permissions from the outset.

    As for cleaning up "stale" socket inodes, it's normally preferable to accept that this may happen, and remove any existing inode just before you bind() to the address, rather than trying to ensure that it always gets cleaned up on exit (obviously, you can't do that if the process gets SIGKILL, or if the system gets hard-reset or power-cycled).

    ReplyDelete
  5. In theory, the permission bits on the inode control who can connect to the listening socket, which would be somewhat more useful if you could set them using fchmod(2) on the socket. But at least on Linux you can't do that.

    I've done just this before, and it worked fine for me in a piece of software handling large volumes of VoIP traffic...

    ReplyDelete
  6. My information that fchmod is a no-op on sockets is second-hand (but came from somebody who claimed he had checked the kernel sources). It is also possible that it works in newer kernels and doesn't in older. If it didn't work on some boxes, would your VoIP app fail in ways that alerted you to the problem, or would it just silently give more access than you think?

    ReplyDelete
  7. try umask() before bind() if you don't like the socket's created permissions. and setegid() before bind() can be useful too if you like to muck with groups. since the bind() creates the file exclusively, and fails if it already exists. you don't have to worry about someone inserting a rogue socket at the path.

    ReplyDelete
  8. Still, all these very helpful work-arounds posted by commenters only demonstrates the utter inadequacy of sockets as an IPC mechanism.

    One of the original incentives for developing sockets is easy access to the IP protocol. IP is a message-oriented protocol, as (by necessity) everything above it also is. Maintaining the archaic illusion of a byte-pipe on top of this is the largest cause of socket-related complexity, both in terms of _implementing_ a sockets API and in _using_ the sockets API. It, in fact, doesn't matter if you're working with IP or Unix-domain sockets.

    Unfortunately, the Unix community fails to see the need for a genuine messaging API, resulting in numerous technologies that sit on top of sockets intended to emulate one (e.g., CORBA, D-Bus, et. al.). Think about it: messages on top of a byte-pipe on top of a message-oriented protocol. What an incredible waste!!

    (NOTE: TCP sockets are, by definition, byte-pipes, as specified by the TCP specification. However, it's far, far, far easier to emulate a byte-pipe on top of an arbitrary stack of messaging protocols than the alternative. Let's assume sockets were implemented "correctly" -- you'd have a byte-pipe on top of a messaging protocol, on top of a messaging protocol. This structure is fundamentally easier to implement and reason about.)

    SysV message queues are a failed experiment -- we understand this implicitly. Unfortunately, instead of fixing a broken implementation of a good idea, history has chosen to implement a great implementation of a really bad idea. This is, in my opinion, positively backwards, as any coder for AmigaOS, PC/GEOS, Win32, BeOS, VMS, or OS/2 can tell you.

    After 35 to 40 years of experience in systems coding, you'd expect folks to finally learn that not *everything* "is a file".

    Oh well -- what would I know? I'm just a nobody.

    ReplyDelete
  9. Samuel A. Falvo II: What are you talking about? You can create SOCK_DGRAM unix sockets.

    ReplyDelete
  10. Trying to change permissions after creating a file creates a race condition. Use the umask and the modes on open to create it atomically with the right permissions.

    ReplyDelete
  11. why not just have all the server processes listening on the same socket and dump the gatekeeper entirely. that's what apache does. just set up your server socket and then fork once for each server process.
    if you're bothered about garbage in the file system, then only create in a tmpfs file system (disappears when machine shuts down) and invoke any creating processs via a wrapper script that waits for creating process to finsh (or crash) and then cleans up on it's behalf

    ReplyDelete
  12. > WaitForMultipleObjects

    Except that WaitForMultipleObjects does not work on pipes.

    BTW, what's this awful idea of disallowing cut-and-paste and cursor keys in the post comment textarea?!?!??

    ReplyDelete
  13. /* pseudo code to address the file perms issue */
    #include
    #include

    mode_t old_mask = umask(0);
    bind(...);
    umask(old_mask);

    ReplyDelete
  14. You're missing "grab a global mutex because the umask is not thread local", and "rewrite entire program, including any third-party libraries, to grab the same mutex everytime it may create a file".

    ReplyDelete
  15. #include
    #include
    #include
    #include
    #include
    #include
    #include

    #define SOCK_PATH "echo_socket"

    int main(void)
    {
    int s, s2, t, len,rc;
    struct sockaddr_un local, remote;
    char str[100];
    fd_set connected_apps_fd;

    if ((s = socket(AF_UNIX, SOCK_STREAM, 0)) == -1) {
    perror("socket");
    exit(1);
    }

    local.sun_family = AF_UNIX;
    strcpy(local.sun_path, SOCK_PATH);
    unlink(local.sun_path);

    len = strlen(local.sun_path) + sizeof(local.sun_family);

    if (bind(s, (struct sockaddr *)&local, len) == -1) {
    perror("bind");
    exit(1);
    }

    /* if (listen(s, 5) == -1) {
    perror("listen");
    exit(1);*/
    }

    while(1)
    {
    printf("Waiting for a connection...\n");
    FD_ZERO(&connected_apps_fd);
    FD_SET(s, &connected_apps_fd);
    rc=select(s + 1, &connected_apps_fd, NULL, NULL, NULL);
    if(rc < 0) {
    printf("%d\n",rc);
    continue;
    }
    if(FD_ISSET(s, &connected_apps_fd))
    {
    int done, n;
    printf("Got connection\n");
    /* t = sizeof(remote);
    if ((s2 = accept(s, (struct sockaddr *)&remote, &t)) == -1) {
    perror("accept");
    exit(1);
    }

    printf("Connected.\n");

    done = 0;
    do {
    n = recv(s2, str, 100, 0);
    if (n <= 0) {
    if (n < 0) perror("recv");
    done = 1;
    }

    if (!done)
    if (send(s2, str, n, 0) < 0) {
    perror("send");
    done = 1;
    }
    } while (!done);

    close(s2);*/
    }
    }

    return 0;
    }

    Output:

    Waiting for a connection...
    Got connection
    Waiting for a connection...
    Got connection
    Waiting for a connection...
    Got connection
    Waiting for a connection...
    Got connection
    Waiting for a connection...
    Got connection
    Waiting for a connection...
    Got connection
    Waiting for a connection...
    Got connection
    Waiting for a connection...
    Got connection
    Waiting for a connection...
    Got connection
    Waiting for a connection...
    Got connection
    Waiting for a connection...
    Got connection
    Waiting for a connection...
    Got connection
    Waiting for a connection...
    Got connection
    Waiting for a connection...
    Got connection
    Waiting for a connection...
    Got connection
    Waiting for a connection...
    Got connection
    Waiting for a connection...
    Got connection
    Waiting for a connection...
    Got connection
    Waiting for a connection...
    Got connection
    Waiting for a connection...
    Got connection
    Waiting for a connection...
    infinitely ....

    Problem:
    The above is a Unix Domain Server , using a select() call to listen to the incoming connections....But, this server without any clients running always says it got ant IO read event on the socket fd, and select succeeds always and says fd is set for reading... here i am puzzled where my code is going wrong or where select is getting these signals...

    Plz let me know...

    My Question is.....
    1> Can we do select() on a Unix Domain socket, if yes, where am i going wrong in getting stuff right???

    2> Difference between select() and listen()???

    Thanks in Advance,
    Hruday

    ReplyDelete
  16. (0) You'd probably get much better answers by posting to some general unix programming forum, rather than commenting on a years-old post on some random guy's rarely used blog.

    (1) Yes, select works on all file descriptors. Your problem is that you have commented out the listen call (and also the accept call). I don't think it is even well defined what "ready to read/write" should mean on a socket that is bound but neither listening nor connected. On your system it appears that the socket counts as "readable" (which makes a certain sense because whatever you want to do with the socket in that state, it is as ready for it as it will ever be). However, it would probably be unwise to depend on this behavior generally.

    (2) Difference? They have hardly anything in common. Perhaps you meant difference between listen and bind or accept? The best way to think is probably that bind+listen is one logical operation that for historical reasons must be split into two syscalls. The listen call tells the socket stack that you're actually going to call accept later -- as opposed to bind+connect, which is allows you explicit control over the local address of an outgoing connection, such as if the machine has serveral IP addresses and for some reason you care which of them a server will see your connection coming from.

    ReplyDelete
  17. you have to place accept or receive method inside if(rc >0)

    ReplyDelete