Hacker News new | past | comments | ask | show | jobs | submit login
JVM Anatomy Quarks: Compressed References (shipilev.net)
141 points by luu on March 6, 2019 | hide | past | favorite | 34 comments



Data layout optimizations are a really neglected part of compiler engineering. Performance bottlenecks are now in caching and moving data in the memory hierarchy, instead of ALU throughput. So compact data representation lets you fit more items in the same amount of cache and move more items in the same amoutn of banwidth.

Compressed references is one of the most basic optimizations that is routinely done by hand when programmers do performance work. Others include storing data in a columnar form (aka AoS->SoA transformation), altering array layout to be cache-friendly (tiling / swizzling), moving rarely accessed members out of line, and field shrinking based on dynamic range analysis.

The history of "compressed references", afaict, is reactive on the JVM: people were shocked at memory usage increases when moving from 32-bit to 64-bit machines. So it came about semi-accidentally and is not really indicative of systematic work on data layout optimizations in the JVM...

There is also a term of art in CS that refers to economical use of memory: https://en.wikipedia.org/wiki/Succinct_data_structure and a second one that is for shrinking data structures based on content entropy: https://en.wikipedia.org/wiki/Compressed_data_structure - this literature is dealing with more advanced encoding tricks that are rather far from what a compiler could realistically do on its own, and often make big tradeoffs such as making the structure read-only.


Rust does quite some optimization on layouts[0], it can do so since its internal ABI, including struct layouts, is unspecified and thus they can change it while still having C ABI for interop. They're also keeping the options open for PGO-based layouts.

Of course those are still compile-time decisions, so compressed pointers are out since they only work if you restrict yourself to ~32GB, depending on alignments.

[0] https://github.com/rust-lang/rust/pull/45225


Swift does layout optimization as well, since it similarly has not pinned down module stability.


Can you point out some layout optimizations that are done? I could not make out much of your link sadly.


Look at the "Size optimizations implemented so far" section. It's mostly about packing enum discriminators into unused bits and bytes.

Beyond that rust has always done field reordering to minimize wastage from alignment gaps.

Other optimizations are still being planned.


> Data layout optimizations are a really neglected part of compiler engineering.

From what I understand this is mostly impossible in other popular languages because the compiler just puts fields of your structs in the order you wrote it in. (You wanted this, right? Here ya go!) JIT'd languages don't have this problem because they can rewrite the layouts however they want and based on real time profiling. AoT languages either have to run first and compile with the profiling information to have hope of doing this, or else the author has to break out perf and VTune and do it manually. Both are difficult and time consuming, which is why it doesn't happen much in practice.

A minor example of this is Protobuf, where the IDL translates to the same field layout in C++. Could the fields be packed into a more cache friendly layout? Sure, but only on a case-by-case, application-by-application basis. It doesn't work in general, even though proto authors don't care about the precise layout.


JIT'd languages don't help you here. You still need the data layout of a given class to be constant otherwise all field lookups can't just be a load at an offset. And you can't specialize this on a per-method basis, which is the level JITs tend to work out. You'd have to do this via whole-program optimization, which JITs by and large don't ever come close to doing.

Also, the problem you describe (abi stability) is orthogonal to whether or not it's JIT or AOT. Nothing says an AOT language needs to have a defined data layout (see: Rust), and nothing says a JIT'd language doesn't have that same constraint (see .NET CLR, which defines struct layout as sequential by default for interop with unmanaged code).

It's easier to do data layout optimization in AoT'd languages as a result. Much, much easier. And, in fact, some JIT'd languages have been converted to AOT specifically to do this, such as Unity's HPC#.

EDIT: Also you need value types for these layout optimizations to make any sense at all - which Java doesn't even have.


Yes, there are separate things that don't depend on each other: the language semantic support for compact layout, the the application profiling, and the hot-swapping of implementations. And none of these are really bound to JIT or AOT.

I agree that whole-program optimization is really what you want here. There are similarities to GC since that also swaps out data from under the program. There are lots of interesting dynamic whole-program optimization things you could imagine a JIT doing, based on emergent program behaviour patterns that might vary from run to run of the program.

Trading off direct field access may be done as one strategy, it's just a couple of conditional cheap ALU instructions after all. Another way would be to compile alternate versions of the hot code path in question.


> JIT'd languages don't have this problem because they can rewrite the layouts however they want and based on real time profiling.

Any examples of this?

> AoT languages either have to run first and compile with the profiling information to have hope of doing this...

What prevents one from JITting AoT compiled binaries?


> Any examples of this?

Yes, Java and .NET.

Some JVM implementations, e.g. IBM J9, can save profile information between runs and use it for speeding up code generation, based on the complete history of the executable.

Google also started doing this when they rebooted ART into a mix of interpreter written in Assembly/JIT/AOT runtime. Starting with Android P they are even uploading profile information back to the store, so that similar devices can reach optimized AOT compiled code faster.

.NET Framework allows some control to save PGO data, and variants like that WinRT/UWP cloud compiler take it further.

> What prevents one from JITting AoT compiled binaries?

It relatively complex, versus just starting from bytecodes.

Hence why mainframes always took the approach of either JITing at installation time, or using microcoded CPUs.

Like on Burroughs, Xerox PARC and IBM cases.


The IBM J9 JVM was open sourced and now lives as the Eclipse OpenJ9 project [1]. We rely on interpreter profiling (block frequency, and value profiling) before the JIT gets it's hands on the method. Once the JIT has determined the cost/benefit is right for a JIT compile we will run through an extended basic block ordering optimization which will lay out the code by block frequency calculated from the taken vs. not taken profiling information on the branch bytecodes given to us by the interpreter.

The persistence portion you mention between runs is our AOT capability in which we are able to cache JIT method compilations on disk and load them between different JVM invocations to greatly speed up startup performance. There is a recent series of blog posts on the AOT technology in OpenJ9 in [2].

[1] https://github.com/eclipse/openj9

[2] https://blog.openj9.org/category/jit/aot/


It'd be great if IBM and Google could contribute that into OpenJDK. Would that collide with project Panama [1] currently under way? (Or maybe that's a goal of the project?). Also, wasn't that the goal of Azul's ObjectLayout [2]?

If that could make it into the developer's IDE somehow, it'd be awesome. As in, the runtime detects which fields suffer from cache misses, tries a different layout, gathers data; and it all comes as fields being painted shades of blue/red in the IDE. It is one thing to have an optimized JPA object layout, but it is another to only have flat relevant data in the first place. Sort of like a database pushdown optimization through the developer, if you will.

[1] http://openjdk.java.net/projects/panama/

[2] https://www.azul.com/presentation/objectlayout/


I think you missed the question was about optimizing data layouts. Any references about that?


I think they do this on the JS side: http://www.jayconrod.com/posts/52/a-tour-of-v8-object-repres...

For example, an array can be specialized to hold just one datatype, or just a couple of fields, based on how it seems to be used, and then de-optimized later if the assumption is broken.


> field shrinking based on dynamic range analysis

Could you expand a bit on this? Can't find any ressource about it, words are overloaded.


Range analysis is a compiler term that means exploiting the known range of values that are placed in a field. In a JIT language¹ this can be done even when you can't absolutely prove the range, by trapping and deoptimizing the structure if it turns out that the compiler's range assumption failed. Similarly to how JS JITs do specialization and fall back to deopt.

As a concrete example, a tree node reference could be a 16-bit index into a backing table of real tree nodes, similar to how indexed vertex data is represented in graphics code.

¹ You could do it in an AOT language too, but it's easier to think about our common JIT languages as the example case.


Much thanks.

In your example, you mean the indices are shrunk if you know the size of the table? I knew about specialization and deopt, but never suspected the principle could be applied to vertex indices in the GPU.


It's problem of Java, because it uses heap objects for almost anything instead of using structs.


The 'heap' in Java is an abstract concept. The JVM doesn't literally need to use heap objects for everything.


In Java a new doesn't always map to heap objects.

Depends on which JVM one is talking about and what JIT is being used.


If you're talking about stack allocation, memory usage will be the same. Stack allocation reduces GC pressure just a little bit, but if your software uses 25G RAM, it won't help. There's no JVM optimization that I'm aware of, to inline object inside another object.


> If you're talking about stack allocation, memory usage will be the same

The JVM never literally 'stack allocates'. It scalar replaces objects, so object header words, klass pointers, unused fields disappear, and remaining fields become just dataflow edges so may be folded, or may end up in registers or on the stack. Additionally, objects which are 'stack allocated' do not need to be aligned as on the heap, and do not use GC head-room. All of this does mean a real reduction in memory consumption, but it's likely extremely small in practice.


The short version is that you are often better off running multiple instances of a JVM application with a heap <32GB when you have the choice between scaling up and scaling out.


That depends on your memory usage. There is an awkward range (32GB - 48GB) where the increase in heap size does not give you a proportional increase in available memory because of the doubling of references. That is where you're better off running multiple instances. But for heaps bigger than 48GB running a single jvm is more efficient.


> The short version is that you are often better off running multiple instances of a JVM application with a heap <32GB when you have the choice between scaling up and scaling out.

We're in an awkward time right now, where many servers have more than 32GB of RAM, but not that much more. Once it becomes routine for servers to have 1TB of memory, this will no longer be much of an issue.


I wonder if using 16 bytes as an alignment would help in those corner cases. Some objects would use up to 33% more space (24 bytes -> 32 bytes), but I think that for most objects the increase will not be very noticeable and one extra bit would allow to use up to 64 GB of RAM with compressed pointers.


Depends on the use case. Sometimes compressed oops is counter productive, and it really isn't a significant piece of overhead if you are say... doing matrix multiplication with very large arrays.


> Once the heap size gets larger than the threshold under which compressed references are working, a surprising thing happens: references suddenly become uncompressed

Am I correct in assuming that this is made possible because Hotspot can re-JIT? So the GC/Allocator just warns the JITter, "hey, all of your Oops [flags] code is no longer valid," resulting in a re-JIT of everything? Or a prologue in every method that checks the Oops flags (and triggers a re-JIT)?


No, compressed oops mode doesn't change at runtime. That's pretty hard to pull off. You'd need to change object layout dynamically and unless you want to reformat the whole heap you'd need tags or regions with different pointer rules. Any of these techniques would have a very substantial runtime overhead.


nope, CompressedOps is an old technique that depends on the max heap. less than 32GB, 64bit -> compressedOps


"The name underlines the fact that the single post is not enough to form reasonable matter"

What does that bit of legalese mean?


That is an exceedingly awkward sentence --- either due to non-native influence or an attempt to sound fancy. It's really saying something like "the articles refer to each other so you might not understand everything from reading just one."


Reference to physical quarks


I think they're just making a joke, alluding to the fact that single articles here won't tell you a lot about the JVM.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: