Regarding file permissions: This kind of archeaology (or forensics) is very important. Why? Because it exposes the trial-and-error over a multi-decade evolution. Sometimes trial-and-error is used as a pejorative (brute force hacking), but over the course of decades as technology advances, it is inevitable. The author is clear to point out that much of the issue here was due to scalability, but i think there is something else at work: unknown unknowns. It is impossible to be a 100% defensive software architecture team, and "room to grow" is usually jettisoned because it can lead to sloppy code, or worse, attack vectors. It's such a hard problem and analyses like these papers are first step in what I believe will become a full-blown historic discipline of software meta-thought. I say "become" because you can't really do this kind of analysis with 5, 10 or 20 years of history: you need multiple decades, and that is just now upon us.
I think there are applications for this kind of study. It can very clearly feed back into current practices, and possibly even more formalized language syntax that can be defensive and extensible. I would also love to see if this kind of analysis bears out which aspects of various languages (and architectural OS decisions) proved to be the most robust. Like with hemaglobin: it is one of the largest and oldest genes, it is hard to break via mutatation, and is shared by every animal with oxygenated blood cells. Something was done right with that design!
Something else that doesn't often get brought up here is that kill(2) itself is an unfixable race condition waiting to happen. It's only safe to use that to signal direct child processes. In Linux, programs should be using the newer pidfd_send_signal syscall in almost every case where they would otherwise use kill(2).
Edit: waitpid is also similarly broken and unfixable for a lot of the same reasons as signals, pidfds and waitid(P_PIDFD, ...) should be replacing most uses of that as well.
This is a (rare?) instance where I would say Win32 gives you some remedy over POSIX. On Windows, you can open a handle and deal with that, rather than a pid. Once the handle is open, it isn't subject to this recycling problem.
However if opening a handle based on pid you may want to double-check that the handle matches your expectation before using it, since that would be prone to the same race.
This is actually what pidfd does on linux, you get a handle to the process that lets you interact with it in that same manner. Once the process exits the handle gets closed and all the operations will report an error even if you have a recycled pid
Not that it matters much these days, but do note that pidfd is a Linux concept, not a posix one. It's also a fairly recent one and it'll be a number of years before many people can use it even still. Windows by comparison has had a solution for white some time.
Had no idea about this! Thank you.
Now I'm starting to understand undefined behavior safety is such a walled garden. All sorts of snakes might be lurking beneath.
I've always felt like it should be the case that as long as a pidfd for that process is open, the pid doesn't get recycled, so you could open the pidfd and then use kill safely, then close it later. Means you wouldn't need a whole bunch of new syscalls.
Unfortunately, it seems like this idea was rejected during the introduction of pidfd.
I think the idea with that was it would lead to denial-of-service type situations where some process could leak a bunch of pidfds and then that would cause exhaustion of pids everywhere else.
Not really, most systems should set RLIMIT_NPROC to prevent that. If pidfds held onto the pid, it would create a new denial-of-service that allowed random other processes to keep zombie processes open, and the fix for it would actually allow you to circumvent that limit!
Isn't this easily fixed by counting held processes via pidfd towards RLIMIT_NPROC? You'd get a rlimit error when calling pidfd_open, but it already returns EMFILE for RLIMIT_NOFILE, so that doesn't seem onerous unless you're trying to enforce exact numbers of processes rather than "less than some reasonable limit".
Edit: note that pidfds holding same-uid processes have no effect (that process already belongs to your uid), and multiple processes holding a single foreign process only count once (that single foreign process is counted toward your NPROC, since it will stick around for pidfd even if its user kills it).
I don't think that would necessary solve it, since the maximum number you can have open is still RLIMIT_NPROC * RLIMIT_NOFILE, right? It seems it would still be a problem as long as it's greater than RLIMIT_NPROC. Edit: I suppose you could fix it as long as you could guarantee that NPROC * NOFILE * maxlogins < kernel.pid_max... but to me this is piling on more workarounds.
This might be naive of me, but isn't this in a way fixed by systemd service handling? Assuming the process in question is, in fact, handled as a service of course.
The issue is solved in any service manager as long as the service doesn't fork, when you are the parent process you can ensure that you don't reap the child before sending a signal.
Once the service forks then it becomes a problem. If you use cgroups it can be solved separately with the cgroup freezer, but there are still some open issues with this in systemd: https://github.com/systemd/systemd/issues/13101
The waitpid() call isn't quite as broken; I think there are only races there if there is any chance other parts of your process call waitpid() on the same pid and it gets reaped twice.
Under normal operation (pid returned from fork, waitpid used in one spot until the child exits, no odd sigchild shenanigans), I don't think waitpid() has any races.
The issue I know of is, you cannot really use waitpid(-1, ...) in the same process as waitpid(<pid>, ...), as it would create a race condition on which call consumes the pid first. You can mitigate it in some programs by doing a lookup after the waitpid(-1, ...), but still it's not really reliable for libraries to spawn child processes and use that function. See for example the notes here: https://developer.gnome.org/glib/stable/glib-The-Main-Event-...
If you need to multiplex a wait for multiple processes, it's safer to use epoll with several pidfds. But you still eventually may need to call waitid(-1, ...) to reap any forked processes, and there really is no good way around that.
I think "unfixable" is in the eye of the beholder. A lot of these design choices make a lot more sense if you believe that complex Unix applications are supposed to be built from multiple loosely-coupled programs that perform exactly one task each, and communicate by piping data to one another. I think the two design decisions discussed in this article only create problems for developers who write programs that do "too much," such that the programs can no longer make best use of the OS facilities.
> Unix signals
Unix signals are an asynchronous best-effort form of out-of-band IPC. Because the programs that make up your Unix application already use pipes for IPC (which are synchronous and reliable), the role of the signal handler in a program would be to either absorb the signal by taking some localized action, or translate the signal into some piped IPC message to other program(s) in the application to consume and handle.
It's been pointed out elsewhere that threads and signals don't play nicely. But that shouldn't be a problem for a multi-program Unix application -- you'd keep the multi-threaded logic in a separate program(s) from the signal-handling logic, and have the signal-handling program forward the multi-threaded program the signal data in-band, via a pipe. For example, you might factor the application into a supervisor program and one or more subordinate programs (which can be multi-threaded), and have the supervisor intercept signals and route the relevant IPC notification to subordinates via pipes.
> Unix permissions
The "one user and one group" model for files stops being so limiting if you can make it so the different programs that make up your application run as different users and groups. For example, a "logger" program in your application would have a separate user/group ID than a "database" program, and in doing so, ensure that the "logger" program can only access log state, and the "database" program can only access database state.
But that's more working around the problems than fixing them, no? You can't really move all signal handling logic to a dedicated process, because every process can receive signals. And as for databases: every major production database I've seen implements its own authentication and permission scheme, so it can do things like provide ACLs on a more granular (per-table, sometimes per-row) basis.
> But that's more working around the problems than fixing them, no?
I don't think Unix signal behavior is the problem. The problems outlined in the article stem from people using them inappropriately. The new signal syscalls introduced in Linux over the years haven't stopped people from misusing them.
> You can't really move all signal handling logic to a dedicated process, because every process can receive signals.
Processes are not obliged to take action in response to signals. But, they could simply propagate the signal data to the parent process via a pipe file descriptor it inherits. Then, you could place all the signal-handling logic into a supervisor -- the supervisor would get notified via a pipe when one of its descendants receives a signal, and take appropriate action.
> And as for databases: every major production database I've seen implements its own authentication and permission scheme, so it can do things like provide ACLs on a more granular (per-table, sometimes per-row) basis.
No one said an application can't have its own authentication and permission scheme. All I am saying is that if you factor your application into multiple processes running under different system-level user accounts, you can get more mileage out of the Unix permission system than you could otherwise, because the kernel would be able to distinguish individual pieces of your application as having different sets of permissions.
Isn't the case with something like signals is that it needs to simply be left as is and instead a new API for interprocess communication be developed alongside?
It seems pretty clear that they're used for way more than originally expected (did threads even exist when signals began?) - and I suspect a number of systems use other communication paths already.
Unix has a plethora of IPC APIs, almost all of which were invented after signals (e.g. sockets).
Signals themselves got new APIs long before signalfd: sigaction and posix real-time signals were already a thing, as were posix threads, when Linux was invented.
What’s really sad is that multi user system interrupts were long since a solved problem when Unix was developed. I don’t know why that existing body of knowledge wasn’t applied.
Unix was also a skunkworks project running on pretty limited hardware. E.g. /usr/bin was added because the disk volume for /bin ran out of space.
So it used very limited tools, like C with its very simple semantics and a preprocessor instead of a module system, very simple kernel mechanisms, etc.
Do you have a reference on pre-1969 multi user interrupts being solved?
Also, Unix was developed on a ~16kB RAM machine IIRC... Maybe that's the reason?
Sure, here’s Dijkstra on it[1]. The X1[2] was a significantly more limited machine than the PDP-11. I believe that work predates Unix by nearly a decade.
The impression I got from reading the v6 kernel code in the Lions book was that signal handling had been added in as a solution to a few specific problems (like "we're going to kill this process but maybe it should get a chance to clean up first"). If you're thinking about them from that viewpoint then the (now) well-known problems like interruption-of-syscalls and the initial "when you take a signal the handler gets unregistered" don't seem like such a big deal -- after all, the process is going to exit anyway.
Are you talking about kernel threads or user space threads? I believe none of them were really in Unix very early, but the time of introduction varied.
I vaguely remember that kernel threads were something new in HP-UX in the end of the 1990s.
The question is, was that late? Windows NT had kernel threads from the beginning, so maybe a few years earlier. But then it took NT years to become stable enough to be used in servers, so saying they were generally ahead would not be a correct description.
So if Unix is considered late (according to the GGP) and NT not a comparable competitor, who was really early? If anybody.
There are definitely other problems with signals; for instance you can't fully implement a signal handler in most languages except C, because you can't write async-signal-safe code in them.
This is reasonable in a memory-safe language, but means you need some other way to interrupt a read from stdin with ^C.
A proper inter-process communication mechanism that supports multicast has been proposed to the kernel multiple times and denied every time. So it's probably not going to happen.
The variants that ultimately got corporate support turned out to be hot turds in implementation... and slower than userland implementation once you fixed some glaring bugs in it.
One of the complaints brought up - the 16 bit group/uid seems to have been fixed quite nicely in modern linux systems by adding an additional 16 bit s to each. It seems these problems aren't "unfixable" after all
The idea that having ownership and permission bits at the file level being a problem fixed by moving the permissions to the directory level completely hand waves over the fact that hard links exist in unix file systems. They need to think a little harder about that Mencken quote.
See the next article in the series "Ghosts of Unix past, part 4: High-maintenance designs"[1]. This specifically addresses how the existence of hard links is elegant in itself, but exports complexity to other parts of the system.
Have you worked with setfacl(1), getfacl(1) recently? The agony they inflict makes me want to die. Do you need log dirs read by a non-root logreader? Are there nested subdirs? What are the defaults? Extra crispy boss-mode: is SELinux on? I think the extended ACLs have taken us further into the weeds, and I think the permission architecture needs to be rethink entirely. It was designed for shared university-type computing resources at a time when 30 profs and researchers shared dirs and commingle a set of files, and daemons are users with own places to keep things. No longer. The RBAC and inheritance model, I dunno, they may work correctly but they are so fiddly with so many knobs and intersections that you end up front-loading a huge amount of work; nobody wants to do that, nor have I seen it done correctly, with design and intent.
I'm actually fully behind the POSIX permissions model as a solution for this: If you have a group that all needs read/write, no big deal. If you have a group that needs to write and the world reads, no big deal. If you have a group that writes and another group that reads: No big deal, so long as you have a third group that's the union of both groups and can have a multi-level subdirectory (where a/ has 750 and a/b/ has 775). If you have groups that need to read and groups that need to write in a more complicated (or somehow path-specific) problem, you probably need a daemon or setuid program to moderate access, and that's okay.
Happy to argue it or simply be told I'm wrong, but I've yet to encounter a not-insane permissions model that I couldn't solve with some "simple" nested groups (that in and of itself is a tooling problem, but a solvable one) and POSIX.
Linux is probably the last major system not supporting NFSv4 ACLs. Windows (obviously), MacOS X, Solaris, FreeBSD - all those support them - for at least a decade now.
POSIX ACLs were something even the POSIX group didn't want to publish and are horrible, horrible mess once you deal with ACLs beyond individual files.
I spent some... long nights writing a program that ensured ACLs were appropriately applied in complex setup and automatically inherited by new files.
It supported both NFSv4/ZFS format and POSIX ACLs... the former was essentially one line per ACL, the latters involved very racy "drop all", "apply again recursively in very specific order" for every change in ACLs.
The "systematic" part of things is relatively easy to handle (as the code can be made to handle anything complex) - it's the "user" interface that is harder. A user with root access wants to give access to a given file/directory to a user - this needs to be made easy to do successfully and securely. Too many times I've seen entire web directories 777 because they just wanted it to work.
Commands providing "why user X can't access Y" and recommended solutions can help.
Yes, this is a major problem when you introduce ACLs into unix-like systems. A comment in the article mentions the "Richacl" work. A key problem with this work was that even "chmod 777" might not get you out of a situation where an ACL was denying access. It's been over ten years since I've been involved in this; it might have changed.
The POSIX draft ACLs had the same problem, where a chmod might not grant you the permission that you're asking for. Back when Solaris implemented POSIX draft ACLs, they needed to change many user-level interfaces (e.g., the chmod command and the ftp daemon) to have a chmod request work the way end users expected.
Lots of difference between how unfixable design problems are treated in open source vs closed source operating systems. Is this difference good or bad?
I would argue being open source vs closed source doesn't matter. The governance model does as do the priorities of the project. This isn't to say you need a BDFL or single company running the project to address "unfixable" problems, but they certainly do seem to help.
On a more meta note, open source means a lot of different things. There is actually a lot of nuance in the different styles of open source. Linux vs chromium vs that project that just does source dumps. Whether they accept contributions, accept bug reports/feature requests, allow you to build from source (source dumps often don't include a working build system), have open communication channels, etc can all vary. I hope we have more specific terms for the different styles of open source in the future.
I agree - though open source does allow for a major fork if the users don't agree with the developers (or project leadership) on how it should be fixed.
Project leadership is the most significant aspect - look at Linus's absolute declaration that the kernel can "never break userspace" meaning that once an API is exported to userspace it never gets removed.
This is actually similar to Microsoft's philosophy though theirs is more "business oriented" (nobody will buy Win95 if their DOS and Win 3.1 programs won't work). Another example of this is Knuth's TeX code.
Open source developments seem to lean (in general) more toward "rip it out and replace everything" (see for example internal kernel APIs not exported to userspace) because access to the source means they can fix the things that touch it. Closed source programs more likely just die and get entirely replaced, otherwise they roughly try to keep working as is.
Really enjoyed reading about the struggles of implementing file permissions.
It seems like something that should be so simple, but once you sit down and try to build it you'll realize you have to support so many uses cases. I bet if you asked everyone on HN how they'd do it, you'd end up with so many confident answers that also had shortcomings themselves.
Biometrical specimens can be another 5 to 20 years. No threads old now tomorrow I can create data to show easy tasks the year is decade plus one passed part 2 and 1. No threads
I think there are applications for this kind of study. It can very clearly feed back into current practices, and possibly even more formalized language syntax that can be defensive and extensible. I would also love to see if this kind of analysis bears out which aspects of various languages (and architectural OS decisions) proved to be the most robust. Like with hemaglobin: it is one of the largest and oldest genes, it is hard to break via mutatation, and is shared by every animal with oxygenated blood cells. Something was done right with that design!