Emulating Windows system calls in Linux

mrpippy · on Aug 11, 2020

There was a follow up a few weeks later: https://lwn.net/Articles/826313/

Ericson2314 · on Aug 11, 2020

Let's make this the URL? I think the resolution makes for a more interesting discussion, and it does still talk about the provisional design in brief.

vii · on Aug 12, 2020

The plan described there is to use a signal (SIGSYS) to jump back into the process. This involves multiple kernel/userpace switches as most likely you need some Linux syscalls to emulate the MS Windows one.

I like the idea of using eBPF to run an emulation layer. That has potential to be more efficient and doesn't require the running process to cooperate so closely with the syscall emulation. Then this would probably lead to extensions to eBPF and its tooling which should be useful for other things!

nxc18 · on Aug 11, 2020

This seems like it would be a lot of maintenance effort to get right, considering Windows shuffles their syscall table every release, including releases within the Windows 10 family.

https://j00ru.vexillium.org/syscalls/nt/64/ Has an excellent table of syscalls by version.

gizmo686 · on Aug 11, 2020

If applications are calling these syscalls directly; then there needs to be some form of stability; otherwise, the applications would only work on the particular version of Windows they were compiled for.

kazinator · on Aug 11, 2020

That stability comes from the C interfaces of the Win32 API implemented by a DLL like kernel32.dll.

I think when the article is talking about Windows system calls, it's about the interface between the kernel32.dll and the kernel, not about the Win32 API.

poizan42 · on Aug 11, 2020

The syscall api is in ntdll.dll (and user32.dll for win32k.sys calls). kernel32 does not know the syscall numbers, it just calls functions in ntdll.

my123 · on Aug 11, 2020

Before, it used to be user32.dll/gdi32.dll calling directly while also having tons of logic inside.

Nowadays, it's split up as it should always have been, with the path being user32.dll -> win32u.dll -> win32k.sys.

poizan42 · on Aug 11, 2020

Ah yes forgot about that. The kernel side has also been split up into win32kbase.sys and win32kfull.sys/win32kmin.sys (I think win32kmin may have only been used for the now discontinued Windows 10 Mobile), and win32k.sys just calls into those.

anthk · on Aug 11, 2020

Wine translates those too I think.

frabert · on Aug 11, 2020

The Win32 API (and nowadays the WinRT API as well) is that interface, system calls are explicitly unstable and internal-use-only I believe.

gizmo686 · on Aug 11, 2020

If all windows applications used the APIs like they were supposed to, then there would be no need to emulate the Windows syscalls. The point of this emulation is that some applications (games, apparently) were not using these APIs, and instead performing syscalls directly.

In order for this to be feasable on Windows, there needs to be some stability in the syscalls; even if that stability involves angry MS engineers complaining that major applications are doing something that they were explicitly told not to do.

leeter · on Aug 11, 2020

You'd be surprised, MS doesn't do this randomly. They'd rather break a game if the game is actively in production or the company is around. They only put in shims for applications if the vendor isn't still in business usually or can't get source code. But in quite a few cases they'll just put in a patch binary in the shims that just removes the anti-cheat all together from the binary when loaded. Game still runs... nobody cares.

saagarjha · on Aug 11, 2020

Until the game checks its integrity?

Const-me · on Aug 12, 2020

Files on disk aren’t affected, the OS applies these patches in RAM only.

It’s gonna be very hard to check integrity of code in RAM. The complications include relocation table (applied when loading; see also ASLR) and global variables (modified as the software runs).

Someone · on Aug 11, 2020

It need not be stable. The OS could provide each application with the system call table it expects to see (with most of them getting the standard one)

Determining what table a program expects likely would be a black art, but I don’t think that would stop Microsoft from trying that for programs deemed important for the platform.

freeone3000 · on Aug 12, 2020

The target windows version is part of the standard set of defines set when compiling the program. It's visible in the manifest and the linked c library.

kazinator · on Aug 11, 2020

> If all windows applications used the APIs like they were supposed to, then there would be no need to emulate the Windows syscalls.

The Win32 API's require certain operating system semantics, and not all of that semantics comes nicely from Linux syscalls.

A Win32 API library on Linux benefits from Windows-like syscalls underneath, even if nobody is bypassing that library to get to those syscalls.

Think about it; that is exactly like saying that if programs were just written to the C API described in the POSIX spec, we wouldn't need a Unix-like kernel underneath. While that isn't an outright false statement, it has implications for the nature and quality of the system.

leeter · on Aug 11, 2020

MS has always held that system calls are not guaranteed ABI stable. So any application doing this is subject to breakage at any time. That said I wouldn't be surprised if they are doing something not to dissimilar to this for those processes in kernel mode to emulate for things that are out of support and have no living vendor.

That said if you look at the list linked above some are "stable" meaning that MS probably knows this happens and kept those specific ones stable. But the rule still applies... never do direct system calls on NT.

magicalhippo · on Aug 11, 2020

Despite having coded for Windows for quite some time, I realized I had never actually peeked that low before, so didn't have a good understanding of how system calls really worked.

Found this article[1] which explained it nicely (and hopefully correctly).

[1]: https://www.codeguru.com/cpp/w-p/system/devicedriverdevelopm...

CodesInChaos · on Aug 11, 2020

Since calling syscalls directly instead of through the wrapper dll is unsupported, I wouldn't be surprised if MS prevented syscalls from outside the expected libraries at some point.

OpenBSD already does this to improve security: https://marc.info/?l=openbsd-tech&m=157488907117170&w=2

saagarjha · on Aug 11, 2020

OpenBSD's technique doesn't really stop any actual attacks, nobody really targets syscall instructions in binaries because most of them don't have them anyways (especially on OpenBSD, where as I understand they aren't stable). You're going to ret2libc and this doesn't do anything to fix that.

wahern · on Aug 11, 2020

To ret2libc you first need to break ASLR. The point of the OpenBSD mitigation is to address JIT compiler exploits, which permit circumventing ASLR by injecting code to make syscalls directly.

saagarjha · on Aug 11, 2020

Generally, if you have enough control that you can influence the JIT compiler to emit arbitrary instructions you either already have an address leak or could make one fairly trivially.

guenthert · on Aug 11, 2020

I'm a bit at loss how this improves security or why it is necessary. If the kernel needs an external library as gatekeeper, something fundamentally went wrong.

And why would, say a Lisp compiler use the C library (or to use a more popular example, why would a Java VM do the same)? It might be common practice (to ease FFI/JNI), but making this a requirement seems heading the wrong direction.

CodesInChaos · on Aug 11, 2020

> I'm a bit at loss how this improves security or why it is necessary. If the kernel needs an external library as gatekeeper, something fundamentally went wrong.

It's not about defending the kernel from the user mode process. It's a mitigation to make exploiting usermode RCEs like buffer overflows a bit harder by possibly preventing some forms of ASLR bypass.

MS might do it in part for security, but likely aloo to prevent people from relying on unstable implementation details, which makes changing those details harder for MS because they try to avoid breaking popular applications.

> And why would, say a Lisp compiler use the C library

On unix based systems libc doubles as (part of) the OS's API. BSD in particular has no stable syscall interface, code must go through libc. Its libc doesn't even have a stable ABI across major versions, which is annoying if you're not using C. Non C languages are clearly second class citizens on BSD.

Linux on the other hand does have a stable documented syscall interface, so you are allowed to call the kernel directly, without relying on any OS provided usermode libraries.

On Windows it's similar to BSD, but you have to go through different OS provided libraries (at minimum ntdll, usually the Win32 API). At least these have a stable ABI. On Windows libc is little more than just another dll.

> but making this a requirement seems heading the wrong direction.

I disagree. IMO an OS provided dynamically linked library is a great abstraction boundary. It allows the OS to change its kernel interface or even turn operations that previously needed to transition into the kernel into purely usermode functions.

For example the OS might provide a mutex which in the first version transitions to the kernel on every lock attempt, while later it's replaced by a futex based implementation which only transitions on contented locks.

Or more relevant to the linked article, it allows implementing the same API on a completely different kernel (like Wine does), without increasing the kernel's API surface.

rstuart4133 · on Aug 11, 2020

> IMO an OS provided dynamically linked library is a great abstraction boundary.

In theory, yes that's true. In practice it's a question of whether it actually reduces anybodies work load. If everybody respected the boundary I sure it would reduce Microsofts work load.

But as has been reported repeatedly here, that's isn't how it's worked out in practice. Not only do applications not respect it, Microsoft finds itself in the position of having to ensure some of those applications keep working on newer kernel versions despite the fact they've wilfully violated the rules. So we end up the Windows GUI having to expose with "compatibility levels", and users having to deal with it. That doesn't sound like a win to me.

There are other ways of doing it that wouldn't suffer from that fate. Trampolines (a page the kernel creates in user space that intercepts some syscalls) is one way that Linux uses. I don't think Linux does, but it wouldn't be hard to insist that any syscall come from those pages. Do that and you have your cake (low overhead user space mediating some calls), and get to eat it to (you can safely assume every one uses it).

ChrisSD · on Aug 12, 2020

Except that the Windows applications that do use syscalls directly (i.e. anti-cheat software) are intended to break when something changes. It's part of their way of defending against tampering. It's also part of why they need to be continually updated and why some old games won't work at all on newer OSes until a patch is applied to hack out the anti-cheat engine.

If syscalls were stable, they'd just try to use some other means to obfuscate their OS calls and ensure it's sufficiently fragile as to break easily if something changes.

tl;dr the lack of stability is a feature not a bug for this use case.

trasz · on Aug 12, 2020

Not sure about other BSDs, but FreeBSD does have both stable syscall interface, and stable libc interfaces, which keep ABI both across major, and minor versions. Situation for non-C languages is pretty similar to what you'd expect in Linux, there's nothing second class there.

kazinator · on Aug 11, 2020

It isn't an external gatekeeper. The kernel receives a syscall. The kernel looks at the user space instruction pointer to determine where the sycall came from. The instruction pointer is not in a blessed area; the syscall is declined. That's all internal to the kernel. User space perhaps just arranges it; it informs the kernel about the restricted area. That's essentially the same as any other limit, like setting a limit on stack size, core file size, execution time, whether the stack is executable, ...

saagarjha · on Aug 11, 2020

I believe libc is the only stable way to ask the kernel to do something for you.

kazinator · on Aug 11, 2020

GNU Libc is only stable at the source code level, and tries fairly hard at the binary level with approaches like symbol versioning and whatnot.

Linux syscalls are more stable at the actual binary level. Some "int $0x80" code you wrote in 1996 in x86 assembly language will work today, provided the binary format (like a.out) will load.

The libc which was used then is not even around any more.

saagarjha · on Aug 11, 2020

Ah, I was talking about OpenBSD's kernel interface. Linux is ABI stable at the syscall level, of course–I know it tries to be binary compatible, but I have never actually tested this much.

trasz · on Aug 12, 2020

It's pretty common to run FreeBSD jails with old userspace (say, 8.0) on top of up to date FreeBSD kernel (say, 12.1). This works because to the stable syscall interface.

kyberias · on Aug 11, 2020

> Windows applications are increasingly executing system calls directly rather than going through the API

I didn't know this. What applications are doing this and why?

wtallis · on Aug 11, 2020

Anti-cheating libraries used in games seem to be the primary culprits, and they're doing it this way because their purpose is to break things while resisting the user's attempts to make things work nicely.

dataflow · on Aug 11, 2020

A bit off-topic, but do you happen to know if there is a way on Windows to intercept syscall instructions? (aside from dynamic analysis/recompilation/etc.?) If there isn't, I assume that's why cheat engines do this?

wtallis · on Aug 11, 2020

I think it's inherent to the SYSCALL instruction that control gets transferred directly to the kernel and ring 0, with no possibility of intercepting in userspace. So on Windows or Linux, you need kernel help to redirect/intercept syscalls.

dataflow · on Aug 11, 2020

I asked this as a Windows question rather than an x86 question. I was wondering if there's any Windows API that could help with this. (This includes kernel APIs.)

monocasa · on Aug 11, 2020

The PicoProcess model used by the first WSL worked by intercepting syscalls (among other mechanisms). It wasn't generally available for consumption by developers other than Microsoft, though.

The gnarlier antivirus will do a similar thing by just patching the IDT and syscall MSRs, but that's a bit frowned upon.

leeter · on Aug 11, 2020

Unfortunately WINE can't just take the windows shim mechanism that largely just nukes these from orbit on binary load and removes the check all together. Maybe at some point MS will be more open about that.

Andoryuuta · on Aug 11, 2020

To add to the mentions of anti-cheat software, there are also multiple binary packers/protectors, such as VMProtect[1], which use system calls directly to protect and/or obfuscate themselves.

[1]: https://lifeinhex.com/use-of-syscall-and-sysenter-in-vmprote...

ThrowawayR2 · on Aug 11, 2020

Based on the second post, seems to be anti-cheating code for games.

Ironically, this work might backfire in the long run if unscrupulous players are able to use cheats developed specifically to run under Linux.

asddubs · on Aug 11, 2020

windows may run games better, but linux lets you cheat better

badrabbit · on Aug 11, 2020

Interesting, I would also look into if the usemode-helper function can maybe used to re-map syscalls and proxy them through a wrapper library just like LD_PRELOAD but done at the kernel level after this, every process will load the wrapper lib (not just every glibc process) much like windows programs load user32/kernel32.dll.

Is this feasible or am I misunderstandig the feature?:

https://kernelnewbies.org/KernelProjects/usermode-helper-enh...

https://elixir.bootlin.com/linux/latest/source/security/Kcon...

cryptonector · on Aug 11, 2020

Oh, nice, so start an app that doesn't use libc (e.g., a golang app) using ptrace(2) to control it, inject code to intercept system calls, mark the app's text as requiring syscall dispatch, then stop ptracing. Presto: LD_PRELOAD for libc-non-using apps. Granted, to make this easy to use the interceptor should be able to use normal LD_PRELOAD objects, which means bootstrapping ld.so and libc, which is going to be very difficult to do.

monocasa · on Aug 11, 2020

This is how User Mode Linux and gvisor work more or less (albeit with the 'injected' code running in another process entirely). It's also how a kernel I'm writing works in a port that works like user mode linux (in addition to the ports on native hardware).

The issue is that ptrace is unreasonably slow (orders of magnitude slower than just regular syscalls), so they'd want the fast normal path to not go through ptrace. But they don't have a way to communicate that to the kernel at the moment.

cryptonector · on Aug 11, 2020

And this scheme gets you a fast path.

Ericson2314 · on Aug 11, 2020

If the firsts comment, https://lwn.net/Articles/824494/, is correct, and this can be made secure, I'm excited that this could mean a much more maintainable/easy-to-upstream version of Capsicum/CloudABI.

trasz · on Aug 11, 2020

Not really. Capsicum provides an actual security architecture (capability-based security), as opposed to a ad-hoc bandaid mechanism.

Ericson2314 · on Aug 11, 2020

Yeah to be clear I would rather see a proper implementation. From the LWN thread I thought Linux didn't have a personality system so this might be a stop-gap way to make a lighter weight implementation, but somebody commented that it in fact does (if barely), so that should be used instead.

saagarjha · on Aug 11, 2020

I don't see any reasonable protection you can have against TOCTOUs dealing with memory, since you could just always race for the small window between the check and the syscall being made.

jhallenworld · on Aug 11, 2020

Linux used to be able to run Xenix and SCO UNIX applications- I have not tried this in years, but I'm wondering if the mechanism can be shared such that any OS can be emulated.

rstuart4133 · on Aug 12, 2020

Linux still can: https://sourceforge.net/projects/ibcs-us/

That's my project. I also did the port of iBCS to amd64, which is what ibcs-us is based on.

I moved the support out of the kernel because the rate of change of the internet kernel API's makes maintaining invasive kernel modules like this is just far too much work. Every time I looked at the kernel the way syscalls worked internally changed in some way, and broke ibcs64 as a consequence.

Decades ago Linux had full on support for other syscall interfaces via it's personality mechanism. See personality(2). Most of the functionality has been ripped out as kernel internal API evolved. I'm guessing maintainers noticed the personality stuff wasn't used and so it was easier to delete it than port it to their new shiny API.

At the time I thought that would be a disaster for iBCS. Now after porting ibcs64 to user space, I think personality(2) was the wrong way to do it. Turns out iBCS is mostly glue that emulates foreign syscalls using Linux equivalents, and as ibcs-us demonstrates that kernel space glue runs equally well in user space. It should not be in the kernel, where it bloats things and could introduce security holes.

Once you realise the glue code can run anywhere, there is only one problematic area remaining: directing the foreign syscalls to your user space glue code. Here ibcs-us could really use some kernel could really help to make it both fast and safe, but it's a use case that apparently hasn't crossed the Linux kernel dev's minds.

This is exactly the same problem the Wine developers are having, and are trying to solve in these LWN articles. To me their solution looks like a horrible kludge. That isn't a criticism. I don't see how anything that tries to make minimal changes to the kernel is going to be anything other than a kludge.

If you want to do it properly, you need to add some mechanism to the kernel that lets user space redirect whatever syscall method is in use to a user space trampoline page. There are any number of syscall methods out there: int's, sysenter's, lcalls. The kernel has to provide a way of trapping each and redirecting them to the trampoline, but once they are there they are the emulator's problem. Mostly. The emulator then needs a second mechanism to make the real Linux kernel syscall that isn't redirected. One way to do that is to say "syscalls from within this address range aren't to be emulated". I’m guessing that’s what OpenBSD does now, but replace “aren’t to be emulated” with “are allowed”.

This could be thought of as a way of the kernel providing a way to virtualise it’s own syscall interface. There is no need to stop at one level either, as there there is no good reason one virtualised interface shouldn't end in another trampoline, unbeknownst to it. So a Xenix 286 emulator could be running inside of a Xenix 386 emulator running on Linux. It would be lovely for ibcs-us of course, but LD_PRELOAD's, virtual machines and yes even Wine all want to do the same thing, so it is a broadly applicable use case.

jhallenworld · on Aug 12, 2020

So I remember a big use case for SCO UNIX was small governments (like at the town level). Do you still see such users of iBCS?

Anyway, this seems like a very good idea. It provides a very general light-weight VM mechanism.

rstuart4133 · on Aug 12, 2020

> So I remember a big use case for SCO UNIX was small governments (like at the town level). Do you still see such users of iBCS?

I don't know who they are, which is the normal situation in the open source world. The few I've interacted with seem to use Informix. I, well the company I work for, is the only exception I've met to that.

tzs · on Aug 11, 2020

Were those 286 Xenix applications or 386 Xenix applications?

Running 286 Xenix applications on a 386 or later Unix or Unix-like system is fairly straightforward as far as the 386 kernel is concerned. 286 Xenix used a different syscall mechanism than 386 Unixes did, so you didn't have the problem of the kernel needing to deal with 286 and 386 syscalls ending up in the same place.

My recollection, from being half the two person team that implemented 286 Unix binary compatibility for System V Release 3.2 on 386 at ISC, and then being half the different two person team that did 286 Xenix binary compatibility, was that this is what was required in the kernel:

1. Adding a mechanism to allow a process to set up 16-bit segments as aliases to portions of its address space.

2. Modifying exec to recognize a 286 Unix or Xenix binary, and turn the exec into an exec of /bin/i286emul or /bin/x286emul, with the path to the 286 binary as an argument.

3. I don't remember if we had to do anything special for the 286 system call mechanism, because I don't actually remember what that mechanism was. I don't remember if what the binaries would try to do for a syscall would simply cause a trapable signal, and then the signal handler would recognize the system call and take care of it, of if the system call mechanism used something like a call gate that the 386 code had to set up, and so we had to add a mechanism to allowed the 386 code to do that.

That was almost all the kernel work. Pretty much everything else could be handled in user mode code. When you would run a 286 Xenix binary, that turned into an exec of /bin/x286emul. x286emul would read the Xenix binary, allocate memory for it and load it, map segments to it, and jump to the code, switching the processor to 16-bit mode.

When the Xenix code did a system call and it ended up in the handler in x286emul, for most system calls it was simply a matter of copying arguments from where the 286 call put them to where the corresponding 386 call expected them, making the 386 system call, then putting results in the right place, and returning to the Xenix code.

For some system calls, more work was needed. Signal handling, for instance. If the 286 code wanted to trap a signal, the 386 code would have to set a trap of its own, and its handler would then have to deal with the stack fiddling and such to deliver it to the 286 code.

Speaking of signal handling, that led to the stupidest, most annoying meeting I've ever had to attend.

ISC had done the official 386 port of System V Release 3 under contract from AT&T, and we did the 286 Unix binary compatibility as part of that. Later, there was a deal between AT&T and Sun and Microsoft to make Unixes more compatible, which included Xenix compatibilty, and we were doing the Xenix compatibility as part of that, under contract from Microsoft.

For the 286 binary compatibility, ISC did the whole thing--all the user mode code in i286emul and all the supporting kernel changes. But for Xenix compatibilty, ISC was just doing x286emul. Microsoft was doing any kernel modifications needed, which wasn't very much because the mods already there for i286emul mostly worked find for x286emul.

There was one difference between Xenix and Unix signal handling that we could not fully address in x286emul. We needed kernel support--some sort of per-process flag that would tell it the processes wanted signals to behave like Xenix signals. We asked Microsoft to add such a flag.

A bit later they got back to us and said there was issues with such a flag that could not be settled by email or phone. We had to have an in-person meeting. So me and the other guy working no x286emul had to fly early one morning from Los Angeles to Seattle, take a rental car to Redmond, and attend a meeting with Microsoft.

At the meeting, we introduced ourselves and the half dozen or so Microsoft people introduced themselves, and then they presented this issue that could not be handled by email or phone: should this flag be controlled by an ioctl, or should it be an optional new parameter for the signal() call [1]. We gave our opinion, that was the end of the meeting, and we headed for the airport.

[1] or something like that. I don't remember the exact second option.

jhallenworld · on Aug 11, 2020

Well 386, but 286 compatibility would be nice also. I'm still annoyed that AMD dropped virtual 8086 mode from x86_64- I liked it because Windows ran MS-DOS programs better than MS-DOS ever did and I would still use this facility if it was available in 64-bit operating systems. Virtual x86 mode in Linux was also pretty good, but the graphics modes were better supported in Windows.

a1369209993 · on Aug 11, 2020

To be fair, anything written against a actual 8086 can be brute-force software emulated in faster than real time at this point. It's no excuse, but it's less of a problem than it could be.

jhallenworld · on Aug 12, 2020

This is entirely true, but I can tell you that DOS OrCAD does not work as well as in DOSBoX as it does in Windows-XP. The issue is not performance, but a matter of the variable size video modes available in Windows and not on DOSBoX.

a1369209993 · on Aug 12, 2020

Oh, absolutely; I wasn't saying that a full-featured emulator actually exists, just that the performance difference was huge enough that you could get away with using one as a replacement for amd64's missing virtualization. Video mode support is a operating system interface problem rather than a hardware-can't-run-fast-enough problem.

lizknope · on Aug 11, 2020

https://en.wikipedia.org/wiki/Intel_Binary_Compatibility_Sta...

The Intel Binary Compatibility Standard (iBCS) is a standardized application binary interface (ABI) for Unix operating systems on Intel-386-compatible computers, published by AT&T, Intel and SCO in 1988, and updated in 1990.

jhallenworld · on Aug 11, 2020

Yes, also I remember that SCO UNIX worked but SCO Xenix did not work. But really I wanted Xenix to work.. I seem to remember it worked once, but then stopped working when they implemented the standard. Or something- it was a long time ago.

lizknope · on Aug 11, 2020

I had a friend get WordPerfect for SCO working under Linux. That was around 1996. I found these instructions.

https://www.tldp.org/HOWTO/WordPerfect-5.html

Same friend had a DEC Alpha and got Digital Unix Netscape to run under Linux on the Alpha.

http://users.bart.nl/~geerten/FAQ-17.html