Tailing output

It is easy to take tail -f for granted. It has a deceptively small responsibility. But a number of details become apparent when implementing a subset of its functionality.

inotify

Inotify is inherently lossy. There is the possibility of queue overruns and events being dropped. There is no opting out of the coalescing of events, and even if there was the events do not carry enough information to reconstruct the states a file goes through. Concretely, suppose a process is watching some file via inotify to read new bytes as they are written. To this end, it keeps an offset into the file. When inotify signals that the file has been modified, the process calls fstat(2) and checks the file size against its offset to determine if the file has grown. The difference can be read, and the offset updated.

When truncate(2) comes into the picture, the file can now be made shorter than it was. It is easy enough to detect this and bring the offset to the new file size, but filesystem matters are fraught with races. If a process truncates a file and then quickly writes to it, it is possible for the second IN_MODIFY event to be queued faster than the watching process can read the first one. In this case, the OS will coalesce the two events, and the point to which a file had been truncated will be lost. Even if the first event is processed in time, the write could still happen between reading from the inotify fd and getting an updated struct stat. There is no consistent way to trace the steps of the end of the file for the offset to end up in the appropriate place; not with inotify alone, at any rate.

The behavior of “out-only” events is not obvious at first. Consider IN_IGNORED. It is one of the events listed by inotify(7) as potentially being set in the mask field returned by read(2). The Linux Programming Interface mentions that removing a watch causes IN_IGNORED to be generated, and has the event marked in the “Out” column only. What this means is that IN_IGNORED will be delivered regardless of the watch mask, so an application must expect it. Kerrisk spells this out in a LWN article:

“In addition to the various events for which an application may request notification, there are certain events for which inotify always generates automatic notifications. The most notable of these is IN_IGNORED, which is generated whenever inotify ceases to monitor an object.”

epoll, eventfd

In Go, blocking calls can be made conveniently asynchronous by executing them in a goroutine that sends their results via a channel. The downside is that a goroutine blocked in a syscall cannot be interrupted or selected out via channels unless the syscall itself becomes ready or returns an error. A read on a pipe can be unblocked by widowing the pipe, but there is no such guarantee for reads on an inotify fd.

epoll(7) helps with this when combined with eventfd(2). In this arrangement, epoll monitors both the inotify and the eventfd file descriptors. The process can make the eventfd ready by writing to it, thus unblocking the epoll read and allowing the logic to move forward. It effectively acts as a syscall-level channel, with epoll as a select. eventfd is a very handy little interface, simple enough to warrant only a side note in TLPI.

Process communication

It is tempting to read commands from stdin and use the shell to redirect it to a FIFO. However, the semantics of, for example, echo 'command arg1 arg2' > /tmp/the-fifo make for a more contrived behavior than is worth dealing with. The shell will open the FIFO, write to it, and close it. If a reader is blocking on (redirected) stdin, it will see the bytes, then see EOF. But it will not go back to blocking on the next read unless there is at least one write descriptor open for the FIFO. Instead it will start polling on stdin, which is not what one wants. This is precisely the reason why, when setting a pipe between two processes, the reader closes the write end of the inherited pipe: if it did not, a read(2) would block waiting for data because its write descriptor would still be open.

All this is easily avoided with a UNIX domain socket. The lifecycle of the connection, and the separation of the accept(2) and read(2) calls, make the perfect barrier to implement a blocking command reading loop. The set up is only slightly more contrived, with three syscalls (socket, bind and listen), but the simplicity when reading more than makes up for that.

Links