Hacker News new | past | comments | ask | show | jobs | submit login
Do we need a link step? (ocallahan.org)
94 points by hasheddan on Jan 1, 2022 | hide | past | favorite | 54 comments



Virgil does not have a link step. It's a whole program compiler that does everything in one step. Hand all your Virgil source files to the compiler in one invocation, get out a fully optimized binary. The compiler is fast enough that trying to cache the result of a single unchanged file isn't necessary--probably more trouble than it's worth, except for huge programs.

Linking was designed when computers literally didn't have enough memory to load the IR of an entire program into memory at once in order to compile it, so programs had to be split into many files and compiled one file at a time. Primarily for C, but as we know, its very naive file-inclusion mechanism now results in huge translation units anyway (even worse in C++). It's antiquated and highly inefficient (O(n^2)).

Just for reference, the Virgil compiler, right now the largest Virgil program in existence, is about 43KLOC. It takes about 200MB of memory to parse, verify, transform, optimize, and generate code for it, and about 300 milliseconds of execution time, on one thread. Whole program compilation won't scale to millions of lines, perhaps, but hundreds of thousands are clearly within reach. And I'm OK with that.


Linking was also invented because sources to certain libraries were (and often still are) closed, not available at compile time.

Another point of linking is, well, linking together different source languages, often a high-level language and and assembly, or a high-level language and C. In certain areas, libraries written in Fortran are important, too.

A complete language monoculture often is not attainable, even with languages as total as C++, Rust, or Java.


Still in some languages like Turbo Pascal, the linker was part of the compiler, so there was no separate linking step.


It was built-in and called automatically, but since version 4.0 if I remember correctly linking was indeed a separate step. Units were each compiled into TPUs (or, later, DCUs) and then linked together, possibly with external OBJs.


You could do it both ways, there was tlink.exe for script based builds, but also included in the compiler on the IDE, then again maybe I need to go have a look at those old manuals.


I did recall it correctly,

> Turbo Pascal's built-in linker automatically removes unused code and data when building an .EXE file.

Turbo Pascal 6.0 Programmers' Guide page 258.


> Linking was also invented because . . .

This is just plain wrong. Linking was, and still is, about assembling a final executable in a format and a way that accurately describes the data within it. For example, kernels tend to need very customized linker scripts in order to place all of the data in the correct spots.

The reason why linking can be hidden in most cases is because building an executable for a specific system is generally straightforward and well specified, needing little to no tweaking or opinions about how it's built. This is not always the case, and linking was certainly not "invented" to work around binaries with no sources...


Theoretically, the compiler can form a complete "statically-linked" executable, and even sidestep libc to use kernel APIs. Golang tends towards this. This may diminish if not eliminate the role of linking.

The case of VM-based languages is entirely out of purview of the system linker, too.


> Theoretically, the compiler can form a complete "statically-linked" executable, and even sidestep libc to use kernel APIs.

This isn't theoretical, it's exactly how Virgil generates binaries for x86-darwin, x86-linux, and x86-64-linux. The output is just one binary and no C environment. No C libraries for anything. Instead, the Virgil compiler knows the kernel ABI for the target and library code (written in Virgil) implements a thin portable layer directly in terms of syscalls using raw pointers.


Again, that's if the compiler is the only thing contributing object code to the final executable...


> Just for reference, the Virgil compiler, right now the largest Virgil program in existence, is about 43KLOC.

I have worked on projects where a single source file is as large as that, before the pre processor has had a go at expanding includes.


Wow. What industry is this?



Not the parent, but miniaudio.h is a single C header containing both interface and (ifdef'd) implementation, which is 89k lines long. It horrifies me.


Do you mean this [1]?

[1] https://raw.githubusercontent.com/mackron/miniaudio/master/m...

Holy moley, yuck. It would truly suck to have a project that used that header in more than one file!


Yep, that. Many of the backends are ifdef'd away, but it's still a trainwreck to include an audio interface, mixer, and MP3/FLAC decoder implementation in a single header (you can disable the decoders but not the mixer).


You wouldn't, hopefully. You'd create your own application-specific set of functions where the implemenation used this header and included it in exactly one translation unit.


Sounds like a real failure of API design, TBH.


I work in audio and (so far) primarily use audio libraries to output sound straight to the speakers, rather than having the library perform mixing for me. In this case, you generally write a callback to generate sound (which can be called by the audio library and outputting the results to speakers, or directly by unit tests and having the resulting audio analyzed in tests). This callback can often be entirely unaware of which audio library is being used (this may not be possible if your callback needs to react to latency and overruns), and you pass a function pointer and void* data to the audio library.

The only real loss of only referencing audio-library-specific types in 1 translation unit is that the class you write to wrap audio-library-specific types must be located behind a pointer (either forward-declare class LibraryType and embed unique_ptr<AudioLibraryType>, or put AudioLibraryType or unique_ptr<AudioLibraryType> in a TU-local subclass of a public interface).

Meanwhile Rust lacks the concept of translation units (instead each crate is a single rustc invocation which may sometimes be split into codegen units), and hides the presence/absence and location of compilation firewalls from the user. I don't actually know how to trigger or avoid cascading recompilations when using Rust's cpal audio library.


Not really, just a side effect of tons of code and lack of agreed upon specification cross-platform. I've worked with audio programming before and it's an absolute mess - namely Windows, which uses archaic and obtuse COM APIs.


Video games. It's often in the "glue" components of the code that this happens in my experience.


Is it possible to hand the Virgil compiler a mix of source and binaries, or does the compiler demand that your whole program exist in source form? I don't doubt that modern compilers running on modern computers can hold the IR of nontrivial programs in RAM at once, but that technical aspect isn't the only reason we build programs out of preexisting binaries: If you want your program to use the tuned algorithms in LAPACK, they're written in Fortran 90. Other libraries are written in tuned assembly language. And even that's ignoring the fact that, sometimes, source isn't available for legal or financial reasons.

https://en.wikipedia.org/wiki/LAPACK


> Is it possible to hand the Virgil compiler a mix of source and binaries

Not currently, no. I've thought about trying a completely transparent behind-the-scenes incremental compilation scheme, basically hiding customly-formatted object-file equivalents somewhere on disk (and/or in memory), checking source file modification times, and stitching them together like a linker would, but I did the cost/benefit analysis and since I am generally working on programs < 50KLOC in size, it's not a problem I have yet. (Relatedly, in my bug/debug/rebug cycle, aka development, I often run my programs in the compiler's built-in interpreter, which only parses and verifies the code, and then interprets IR directly, saving 90% of compilation time).

As for linking with other languages, I think it's considerably easier to have the Virgil compiler generate a ".o" [1] that is relocatable (i.e. linkable) and describe the imports/exports of a Virgil module in the Virgil language [2]. Then this ".o" could be linked in with the existing C/C++ ecosystem, rather than me trying to replicate linking in my compiler. The Virgil compiler really likes being able to nail down static (virtual) addresses; I suppose it can keep doing most of that if it doesn't have to link with too many other object files that constraint its placement choices.

[1] The Virgil compiler can generate either ELF32, ELF64, or MachO binaries, so it's incremental work to set a few flags on sections and work out how to emit relocation entries.

[2] I have a prototype of this for the Wasm target, so that Virgil programs can import unknown functions from modules, and the compiler generates a Wasm module with appropriate imports.


FWIW, in more than 15 years I've never worked [1] on a project that was less than 1M lines. Sometimes one order of magnitude more. Linking exists because whole program compilation doesn't scale well.

[1] well, except for personal projects, but that's not work.


Out of curiosity, what sorts of projects have 10M lines?


I hate using this website[1] as a source, but here are some estimates:

    KDE 57.9M LOC, 19402 years of effort

    NetBSD 50M LOC, 16822 years of effort

    Linux kernel 36.6M LOC, 12,369 years of effort

    Google Chrome 25M LOC, 8131 years of effort

    Mozilla Firefox 20.5M LOC, 6564 years of effort

    FreeBSD 17.2 LOC, 5538 years of effort

    GNOME 15.7 LOC, 4946 years of effort

    LibreOffice 9.5M LOC, 2907 years of effort

    QT 5, 8.2M LOC, 2525 years of effort

    OpenJDK 8.4M LOC, 2583 years of effort

    IntelliJ IDEA community 6.1 LOC, 1850 years of effort

    GNU compiler collections 7.5M LOC, 2,327 years of effort

    Blender 3D 2.1M LOC, 612 years of effort

    PostgreSQL 2.0M LOC, 596 years of effort

    TensorFlow 2.5M LOC, 724 years of effort
And that is just counting open source/free software. It's not hard to imagine there is plenty of projects out there that have existed for more than 10 years with more than 10M LOC.

- [1] https://www.quora.com/What-are-some-open-source-projects-wit...


The vast majority of these programs are not ordinarily linked in a single DSO though; KDE and the Linux kernel for example are well-known for containing very large numbers of DSOs. I am also skeptical of this methodology: checking the first project, KDE, on the linked source, Open Hub, it says there are 19 million lines of code, not 58. Of those, only 63% are in C++, with 16% in XML, 6% in PHP (?), etc, which would presumably not count towards a linking cap.


No, they have multiple binaries. But they share a lot of core code. Compiling each one separately from scratch would bloat the binary size even more and be pointlessly slower.


Linux kernel is well north of that. A ue4 based game would be similar in scale to 10m loc (the engine is 5m or so, plus third party, and in my experience the games are as big as the engine often). I'd wager many many many closes source projects are far larger than 10m loc


I am confused why is the alternative to linking not of the same complexity order? Can you elaborate?


Sure, the compiler only processes the source of each file once, the IR of each function/method/class once[1] (modulo polymorphic specialization, similar to template expansion), and generates machine code just once. It resolves all addresses just once. Additionally, the Virgil compiler does a reachability analysis fairly early in the compilation pipeline, so code that is not reachable from "main()" or the other entrypoints of the program is pruned away. So huge libraries will only be parsed and verified, not translated or optimized, when building a binary for a given program. Parsing and verification clock in about 500K - 1MLOC/s on a single thread, so having a larger library might only add milliseconds to compilation time.

Linking in C/C++ is pretty efficient in that it is mostly about loading symbol tables and then byte-copying sections together, and then applying relocations. The non-linear part of C/C++ is mostly all in the front end when it must reparse the same header files dozens or hundreds of times over, each time building an AST, verifying the AST, rebuilding the IR, and even repeatedly generating code for the same functions/classes. The linker generally deduplicates functions, sometimes dangerously, like taking the first definition of a given symbol it sees.

[1] Of course, the entire compilation process does involve visiting and transforming the IR in multiple different ways in different passes. The IR is also indexed by different keys, e.g. by class or by method or by type or by signature, so not all passes visit the entire IR. And of course the optimizer might decide to run different passes on a given method based on its characteristics, like whether it has a loop or is very branchy, etc.


> didn't have enough memory to load the IR of an entire program into memory at once

Still true of C++


Sounds like "hide it from me" rather than "eliminate it". "Do I need to think about the link step?" to which the answer is "rarely if ever" now. Covering it with "make it easy for me" plating would make it harder or impossible for those cases where you do need to get into the guts of things.

I think we need more linkers, and more thoughts like this on why that step is still necessary and how we could do it better. Look at the "all static" docker images and how the Lua FFI was "faster than C"; these are effects of linking and shared libraries being Deep Magic that no one wants to think about; thus we get efforts to work around the problem.

Maybe every executable should do its own final assembly of memory image and fixups and stuff at execution? Maybe we should be executing memory dumps that include zero'd pages already... maybe we need one or the other depending on the circumstances, and no "one size fits" all solution is suitable.


> Maybe every executable should do its own final assembly of memory image and fixups and stuff at execution?

Sounds a lot like dynamic linking, though that does have an extra step of indirection that can result in performance issues, either directly due to the pointer hopping or indirectly due to the inability to inline a dynamic function.

> Maybe we should be executing memory dumps that include zero'd pages already

For global variables, C/C++ include this already. That's why the initialization of `int x;` depends on whether it's a global variable or not. For newly allocated memory, whether on the stack or on the heap, initializing to a default value that is later overwritten would have a runtime cost. (e.g. If it is immediately passed as an out parameter to a function.) However, for global memory the initial value of `int x;` can be set in the program image with no runtime cost, so it is done automatically.


>release the object file's space in the final executable, and allocate new space for the new object file.

Except, the size of the object file will inevitably change. So you are stuck reimplementing a mini-filesystem with logic for allocating/freeing blocks within your final executable, or loading the entire old version in memory, and then producing the new one from scratch. Or, you just keep things in separate files, let the OS handle it for you and go sip some champagne on a New Year's Eve...

Really, C/C++ build logic sucks in many ways, but adding an extra level of complexity to the linking process won't make it better. If you want something cool, check out C++20 modules [0] instead. They will make compilation dramatically faster when used properly.

[0] https://en.cppreference.com/w/cpp/language/modules


> Except, the size of the object file will inevitably change. So you are stuck reimplementing a mini-filesystem with logic for allocating/freeing blocks within your final executable,

It would be more like malloc/free than a mini-filesystem.

I don't think this is likely to be a big problem because the sizes of object files will often not change very much, we know how to implement malloc/free with reasonably low fragmentation, and it doesn't matter much if there is some wasted space in the final executable, when the goal is "we're going to build an executable that runs "pretty fast", for testing purposes".


The suggested approach of generating stubs is exactly how late binding in the SICL Common Lisp implementation works, as described in [0]. The scheme also allows generating specialised entry points to avoid keyword argument parsing, and the stub/snippet code can also be used for inline caching for generic dispatch.

[0] http://metamodular.com/SICL/call-site-optimization.pdf


OpenBSD does something similar but in a reverse manner. It's called Kernel Address Randomized Link (KALR).

What it does is bundling kernel object files into an archive, which is linked at boot time to form the final kernel image. Note that the link order is randomized each time the kernel boots, since KALR is a security vulnerability mitigation designed to thwart exploitation attempts.


Most common lisp implementations lack a link step. Compilation is incremental, and function definitions are stored in memory.

If an executable is desired, you just dump the contents of memory to a file.


Though one can compile a file, which one later can load into memory.

Some implementations can't dump memory: ABCL (compiles to JVM), ECL (compiles to C), ...


> Here's an option for function symbols: when A is used for the first time, write a stub for A to the final binary and call that. When a definition for A is seen, patch the stub to jump to the definition

Sorta like the trampolines linker will insert if it can't reach the code.

Symbol resolution has lots of complicated rules. But I suppose if we are designing a new system we aren't bound to those.

I would suggest that, relative to this new link-less compilation, the baseline link behavior should be considered like a static optimization that trades off static build time against code size and execution performance.


ELF does some of this already - relocations can be resolved at load time when compiler emits relocations intended to leverage GOT & PLT. Presumably Mach-O and COFF have an analogous feature.

This is how ELF shared object external references get resolved.

> Update Zig can do something like this.

The mechanism zig uses is indeed leveraging the GOT.


GraalVM's Native Image compiler is an example of a modern compiler that doesn't use a link step, and does keep the entire IR of the program in memory at once in order to do whole-universe analysis. It does take quite a bit of memory - 9 GB to compile a Ruby interpreter for example.


Question is would we really get any benefits by this - compilation would take longer by some x amount which may or may not be less than the time linking step takes currently.


That's a question worth considering. But fundamentally, it should be faster to write compiler output directly to the final executable than to write it to an object file that is then copied into the final executable.


Is the slow part of linking the "copy all the bytes into the executable" step (in which case avoiding separate-link is a clear win, saving a copy), or is it the "do all the relocations" work, which I think needs to be done anyway ?


I put my question to a friend of mine who works on linkers, and his take was that for a single threaded linker like ld.bfd the copy-bytes part would probably dominate, but that for a multithreaded linker like lld that part trivially parallelizes and so the slow part tends to be elsewhere. He also pointed me at a recent blogpost by the lld maintainer on this topic: https://maskray.me/blog/2021-12-19-why-isnt-ld.lld-faster which I should go and read...


The approach I suggested would do away not only with the intermediate object files but also the "do all the relocations" work.


In Oberon object files are executable files. Every object file can have subcommands, like the git command has init, add, commit etc. but it also contains the library functions (like libgit has) in that same object file.

If an object file contains calls to other object files, they get linked during load time, linking is not considered as a separate step.


I think the closest we got to that model is Windows with the COM/.NET interop provided by PowerShell.

Note that other platforms couldn't have it, they just can't decide on whole platform development experience.

On Oberon, it is a bit sad that most compilers that were developed outside of the project focused on static linking without that development experience.


This is how things work in Java, or Python, or JavaScript.


How do you make sure you delivered all object files needed in deployment? Missing object files will lead only to runtime exceptions, correct?


Yes that is correct.


> reuse the final executable and keep around its symbol hashtable; when an object file changes, release the object file's space in the final executable, and allocate new space for the new object file.

This sounds unnecessarily complicated




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: