Hacker News new | past | comments | ask | show | jobs | submit login
Is parallel programming hard, and, if so, what can you do about it? (paulmck.livejournal.com)
135 points by matt_d on Jan 13, 2020 | hide | past | favorite | 85 comments



I feel like every comment here simply read the title, and not even skimmed the book. This is a book about extreme low-level optimizations in concurrency. It's a poor title for this kind of book since it seems like it's a beginner book, but it's quite the opposite.

It's arguing it's hard to implement the structures and methods used by concurrency APIs, not that it's hard to use them.


It doesn't help that the link goes to a page that links to another page that links to the book and the first page doesn't make it clear that that it's talking about a book.


is there an HTML version?


I believe there's only a PDF version, and the author encourages you to print it. It's a monumental effort to make that kind of book, so it's pretty amazing he released it free.


It's hard to find problems that benefit from parallelism, because most work has sequential elements. Due to amdhal's law, it's harder to realize significant parallel gains unless the problem is of a special or constrained type.


This seems like unnecessarily black and white: in the real world, most parallel solutions have a part shared state / sequential operations, and a part parallel. Just because a solution has some sequential elements, doesn’t mean it doesn’t benefit from parallelism.


The thing I finally realized after being big on parallel algorithms and such since first learning about them in university is that, honestly, most things don't need to be cleverly parallel.

The majority of tasks are "embarrassingly parallel" as they say. Which is just to mean, there are a lot of standalone tasks that need to be done. For example, any web server is embarrassingly parallel. You don't need to be clever, you just need to have the processing power available to handle them.


Yeah, to be honest, parallelism is great for number crunching.

I used GPU.js for some frontend crunching and it turns out even a builtin GPU (integrated graphics) can get a small upgrade out of moving some data into a more parallel format.


I've only looked at the readme for gpu.js so far but from that I guess it's doing some sort of transformation from the subset of JS it supports in to GLSL, and then running that as a shader in a WebGL context. Presumably there's a not-significant overhead in both starting it up and getting data back from the gpu.js function. Is there a way to "compile" a gpu.js function to remove the translation step? That could be really useful.


Hi! I maintain GPU.js.

>I guess it's doing some sort of transformation

That is exactly correct. From javascript to an abstract syntax tree, and then to whatever target language we need. In this case GLSL.

>Presumably there's a not-significant overhead in both starting it up

That has been the goal, we're really careful to measure it with releases.

>and getting data back from the gpu.js function

That is correct as well, it is about as costly as it would be without GPU.js. However you can enable pipeline on kernels, so they can communicate via textures, which is significantly faster, as we don't need to transfer back to the CPU.

>Is there a way to "compile" a gpu.js function to remove the translation step? That could be really useful.

It is built into every kernel. You simply call `kernel.toString(...args)`.


Parallel programming is not hard per se, it's just that there's so much more to learn (compared to "sequential programming"): threads, processes, IPC, messaging, actors, queues, synchronization primitives (mutexes, semaphores, monitors, condition variables, critical sections, atomics... you name it) along with their gotchas, concurrent containers with their limitations, transactions, distributed computing with its performance implications, new design patterns (e.g. async/await, promises and futures). But hey, programming has never been thought of as easy, right? Yet, speaking of having it hard, it may be enlightening to compare software development with having to design a piece of (digital) hardware, where not only much has always been happening "in parallel" but one also must keep taking into account physical characteristics of various parts of the circuit.


> Parallel programming is not hard per se, it's just that there's so much more to learn

Parallel programming isn't the hardest part (it is hard, but good teachers help with good abstractions over problems), it is the parallel debugging that is - or at least the second pass over the same codebase.

Often being able to write decent parallel code is easier than debugging a system which is behaving inconsistently without any ability to set up the timing bugs the right way around.

The difference between "I fixed a bug" and "I massively lowered the probability of it" is hard to tell.

I spent 7 months chasing an issue that happened once on average of a billion requests (or rather once every 2-3 days across a web farm). The fix was easy, but debugging it drove me insane slowly over a whole summer.

And after that, there's the loss of efficiency from parallelization which needs enough parallel tracks to actually be faster - more threads is often slower at low thread counts.

As an anecdote, there were enough servers at Yahoo which had SMT turned off because the extra locking introduced by 2 threads on a core ate up the extra CPU those threads freed up.


> Parallel programming isn't the hardest part [...] it is the parallel debugging that is - or at least the second pass over the same codebase.

We need better IDE parallel thread debugging tools. I don’t like how VS won’t let you open a separate Autos/Locals window for each thread, or easily step each thread - setting a breakpoint in a function currently executing in multiple threads is a usability nightmare.


This gives me two questions:

- did you have a method/tool to debug (from book, or chatting with others)

- are there still cpu designs dedicated to parallel workloads (I assume the usual desktop intel/amd, no matter how brilliant, may not be without hiccups in high parallelism)

oh one last thing, do you use dedicated compilers for parallelization ? or is it something "mainstream" like openmp


I can greatly recommended mozilla's rr for hunting concurrency bugs, as it allows efficient replay and provides a scheduler that is designed to increase the probability of triggering concurrency bugs.

The issue with x86 is simple: the memory model pretty much requires loads of broadcasts on the SMP interconnect. IIRC Power9 is nice due to (1) a memory model that doesn't require this (much) broadcasting on the interconnect, and (2) remote atomics like "fetch current 64bit value pointed to by register1 and atomically add register2 after", which prevent the cacheline from bouncing between cores or even sockets (Power9 scales to about 5~10kW TDP @8~20W TDP/core and 4 threads/core (there are versions that combine core pairs to have 8 threads using the resources of (almost) 2 cores (overhead / diminishing returns))).

I can't really recommend OpenMP. If the architectural restrictions aren't a problem, you might want to check out Intel's TBB, which uses C++ templates instead of compiler pragmas. Unfortunately you'd ideally need multithreading-aware polyhedral optimization, and _that_ mostly stalled AFAIK.

You'd likely write C++ or now maybe Rust with low-level primitives / intrinsics to build libraries. I don't know of any actually good "automatic" frameworks for general shared-memory parallel programming (There's Chapel, but it seems fairly niche. Still going strong though, by the looks of it.). If you can efficiently express the problem with loop-dataflow, you might want to try timely-dataflow (Rust, does thread/process/node parallelism), as it nicely integrates with normal Rust code.


1) why would the x86 memory model require more broadcasts?

and

2) I can't find any hit about power9 remote atomics outisde of GPU memory. Any pointers?


There are ancient design decisions regarding exclusive locks. Early on, while it was a bus, it just used a signal for "locked" (~a mutex), and later this physical broadcast had to go because the interconnect switched to a fabric of point-to-point SerDes links.

The remote atomics are a GPU-independent feature on the SMP interconnect. The GPU thing is mostly icing on the cake, by exposing this feature to NVLink GPUs (iirc. Volta+, maybe Pascal+).


Right, so the broadcast is no more (since the original Pentium Pro IIRC). So why again does x86 requires more expensive broadcasts?


> did you have a method/tool to debug (from book, or chatting with others)

Specifically for this issue, I ended up forking valgrind to add multi-threaded watchpoints in code (this was debugging shared memory refcounts, so I had to extend it to log negative refcounts only).

The origin of those patches are here - https://valgrind.org/downloads/variants.html?rjw


impressive :)

great job


> Parallel programming is not hard per se

I'm not sure how you can say that.

Programming a fairly hard activity but the general consensus is adding parallelism to any given programming task, you make that task much, much harder. I don't think you find any version of parallel programming that isn't hard.

The situation of parallelism, your commands happening in an unpredictable order, not synchronously, inherently, by itself, per se, makes thinking about a parallel program harder than a non-parallel program.

Edit: Parallel programming is hard enough that the advised method for most such programs is one or another standard patterns - which are infinitely easier than rolling your own plan but still contain serious gotchas when your program complexity increases.


I have to second that. Parallel programming is extremely hard to get right and people persistently underestimate it. For example, I've seen many times code that erroneously assumed that values can be read from global memory concurrently if only write access is guarded.

Of course, it depends a lot on what your programming language has to offer. For example,in my experience Ada makes it easier to get parallelism correct than many other languages, primarily because Ada tasks have a precisely defined and specified semantics and because Ada's "Rendesvouz" makes certain kinds of errors impossible. The same is sometimes claimed about Rust.

In more permissive languages like Go, however, almost every parallel program is wrong in one way or another. For example, it is extremely common for parallel Go libraries to stop in a deadlock. The situation is so bad that I have stopped using concurrent Go libraries, except for the well-implemented standard libraries. Even then, many package authors don't tell you what they do internally and their libraries lock up in unexpected places. For example, I've recently been looking for a parallel tree walker library in Go, and every package I tried stopped with an infinite deadlock on my file system, presumably due to the interplay between error handling and spawning goroutines. Ironically, even many goroutine scheduling libraries I've tried have locked themselves up.


> For example, I've seen many times code that erroneously assumed that values can be read from global memory concurrently if only write access is guarded.

As someone who has made that assumption, what's wrong with it?


If you have two accesses on different threads, at least one of which is a write, then you need a synchronization chain between them. Note that this is true even if the read occurs before the write. The most common synchronization chain is the first thread doing some kind of "release" action (releasing a mutex, or storing to an atomic variable using a release (or stronger) ordering type), followed by the second thread doing some kind of "acquire" action on the thing that was "release"d.

When you are missing the synchronization chain, the issue you can run into is that the value of other memory locations is undefined. The sequence of accesses to any given location does comprise a total order (due to cache coherence), but it is not generally the case that the ordering of accesses to different locations will be agreed on by different threads.


I'm not a low level programmer, but can you elaborate on this a bit?

E.g. why would reading a shared mutable value be wrong, in some specific circumstances? In particular, if the value is a word (and the architecture guarantees atomic writes for word-sized values), and if the reader doesn't care if it reads the previous or the new value (e.g. executes a read of a monotonic counter in a loop, so even if I read the current value this time, I'll read the new value next time or after some number of next times), and if the language is sensible (i.e. Java, not C++... although AFAIK even C++ frowns upon "values out of thin air" and they're trying to modify the standard to formally prohibit them... but in general the issue with C++ is the compiler, not the platform if we assume x86).


If you've never had to deal with this stuff before, you probably follow a naive model of memory, wherein all operations are "sequentially consistent": you pick a random thread, and you execute the next memory instruction, and everyone understands that the memory is updated. That model is not how any multicore hardware is actually implemented, because it is slow and often unnecessary.

Now, it is known that, if your program is correctly synchronized, then the existing hardware models are indistinguishable from sequentially consistent execution. This gives rise to the data-race-free model that defines all modern language-level programming models: data races can only exist in the absence of proper synchronization, so we call that behavior undefined [1], and our code goes back to sequential consistency.

The primary issue is that there are two entities playing games with your code to support faster execution: both the compiler and the hardware are making assumptions to speed things up. From the perspective of hardware (i.e., why using "volatile" to stop the compiler from playing games isn't good enough), different processors may have the values in their caches. While cache coherency requires that any given point in time, every processor must agree on the value of every memory location, there is great leeway to reorder loads and stores so that they may execute in different orders than the instruction sequence implies. In the extreme, it is possible that the load of *p may happen before the load of p itself (this is the Alpha memory model, and comes from the existence of uncoordinated cache banks).

[1] Okay, there's a slight untruth here: the C++11 "relaxed atomics" is effectively a data race as could be observed by hardware, but is not undefined in the C++11 sense. This is where it's important to point out that we have been struggling for over a decade to actually come up with a suitably precise definition for relaxed atomics that we're happy with--it is by no means a solved problem.


As a summary to see if I understand you, it sounds like what you're saying is that the CPU will calculate updates to memory, but in the interest of speed and efficiency will not actually write the new value back into RAM for some unspecified period of time?


Both yes and no. If you replace "RAM" in your statement with "cache," then you get a more accurate summary for the situation that gives rise to the issues we're talking about.

But it's also the case that caches avoid writing back into main memory as much as possible, since bandwidth to main memory is quite small. However, cache coherency protocols are used to make all the caches agree on the value, so that committing the value to cache is equivalent to committing it to main memory.


> I don't think you find any version of parallel programming that isn't hard.

Using pure functional programming on an embarassingly parallel workload comes pretty close.


trivially parallel workload is trivial in pretty much any language though.


Still easy to mess up the coordination when you have mutable shared state.


Embarrassingly parallel workloads kinda don't have mutable shared state... that's why they're "embarrassing".


No. "Embarassingly parallel" just means you don't need to share intermediate results.

You still need to create all the parallel threads/processes/fibers/whatever, divide the work between them and collect or merge the results.

And if you work in an environment where state is mutable and shared per default (i.e. your typical shared-memory multithreaded Java/C#/C++/Ruby program), then it's still very easy to mess up somewhere.


>> I don't think you find any version of parallel programming that isn't hard.

Just serialize, pass data onto multiple processes, synchronize and wait on all of them to return their data back to main process. This is one kind of parallel programming that isn't hard.

But I cheatead and created expensive copies and processes and let the OS handle all the nightmare.


That's not a cheat, just a solution with some trade-offs.


I was facetiously referring to not having to use any synchronization primitives, but apparently the meaning is lost in ink.


You really don't need to know all that to do parallel programming, and if you ever are using any significant number of those in one program, you're going to be in a world of pain.

You're much better off just picking a paradigm and sticking with it. I'd say it doesn't matter which one, but there are a few paradigms that are clear winners in my experience: immutability always wins, and Erlang-style messaging seems to work very easily. If you have the opportunity to determine the architecture of a parallel project, most languages can imitate Erlang-style parallelism with builtin libraries. But what I will say is that which paradigm you pick doesn't matter so much that you should switch paradigms mid-project.

A word of caution: concurrency and parallelism are not the same thing. Cooperative multithreading is a way of gaining some of the benefits of parallelism in a single process. If you come from an environment where this is the norm (read: JavaScript) it's quite possible that you've never written a line of parallel code. If you try to apply the same paradigms from this sort of environment to parallel programming, they will not work.


>You really don't need to know all that to do parallel programming, and if you ever are using any significant number of those in one program, you're going to be in a world of pain.

I don't entirely disagree, but most environments (or "paradigms") simply hide all that away from the user. Which is generally a good thing. Chances are you _are_ using a lot of those primitives in any given parallel or concurrent program, but you're only aware of one or two of them. For example, a channel for message passing between threads (or tasks in a cooperative multitasking context) might contain a Semaphore and an atomic reference counted pointer which in turn contains atomic integers. See [0] for an example.

And while concurrency and parallelism are not the same thing, you often end up having to consider the same synchronization problems in your code whether it's concurrent or parallel. Cooperative multitasking does involve true parallelism when the scheduler uses multiple threads on a processor with multiple cores. Code is not just being executed concurrently there, it is (possibly, not guaranteed to be fair) executed in parallel.

[0] https://docs.rs/crate/tokio/0.2.9/source/src/sync/mpsc/chan....


> Chances are you _are_ using a lot of those primitives in any given parallel or concurrent program, but you're only aware of one or two of them.

Okay, but I think it's fairly clear from context that "using" in my post meant "directly using" or "using in such a way that you have to understand how they work". There are a bunch of ways Erlang's message queues could work, but I don't need to know which way they work. If they work with locks you could say I'm using locks, but that's clearly not the point of what I'm saying.

> And while concurrency and parallelism are not the same thing, you often end up having to consider the same synchronization problems in your code whether it's concurrent or parallel.

Often, sure. But also, often not, which is why I said "it's quite possible that you've never written a line of parallel code."


Parallel programming is not hard if you follow the "just use a mutex and you avoid race conditions" mantra. However, this book dives deep into what mutexes actually do, what the hardware below is doing, and alternatives to a mutex for a speed increase. This can be anything from lockless algorithms to RCU, which the author of this book wrote it in Linux kernel.

There's a comment above about Haskell and how you don't have to really put any thought into making things parallel. That's great for 99% of cases, but other times you need higher performance where it matters how the concurrency is implemented.


> Parallel programming is not hard if you follow the "just use a mutex and you avoid race conditions" mantra.

At one extreme of the spectrum, you can do everything under a big lock and be effectively single-threaded. At the other extreme, you can use lots of mutexes and fine grained locking and end up with a mess of potential deadlocks and races and "here be dragons" because the original programmer did not anticipate that you sometimes need data not to change for a bit longer than a short and obvious critical section.

What I'm working on at work right now, oh goodness how many races and problems I've found around locking and multi-threading in general. It's a nightmare.


Sometimes it can be a nice way out to quickly lock the data, copy it and then unlock, if it's acceptable to have a task using outdated data occasionally. If that is not acceptable, there are Read-Write-Locks, which do guarantee that the data being used is current and allow for concurrent read access, but come with the drawback that write accesses might have to wait fairly long until the pointer can be fully locked.

What I'm saying is, don't limit yourself to Mutexes, there's a whole zoo of synchronization primitives and each of them has a specific use case that it's ideal for.

Recently I've dealt with a lot of synchronization in Rust, and I would say it's been as pleasant as it could be, considering the subject matter. Deadlocks are still possible, but at least the compiler stops you from accidentally unlocking a Mutex too early.


Well, parallel programming requires discipline, and a plan. Deadlocks do not occur when locks are always acquired in the same order. The fact that deadlocks occur at all (and I know they are notoriously hard to debug-- I've done it, lol), is due to the fact that one or both of the locking strategy and/or adherence to said locking strategy was flawed.


> Parallel programming is not hard per se

Last time I checked there was no "best parallel sorting algorithm". All depend on your architecture (core, memory, network).


relativistic factor: being taught that programming == sequential imperative turns your brain myopic, makes even harder to unlearn and think about concurrent/distributed systems

bonus point: to an extent sequential imperative is an abstraction layer over the concurrent IC implementing your computer


Direct jump to the PDF download page, avoiding the terrible hosting and Russian ads:

https://mirrors.edge.kernel.org/pub/linux/kernel/people/paul...



Thanks!

FWIW, I've used the original post title ("Parallel Programming: December 2019 Update") to focus on the updates (and partially driven by the fear of comments made right after reading just the book title, "Is Parallel Programming Hard, And, If So, What Can You Do About It?", and focusing solely on answering the question while completely ignoring the content... :-]).


Understand the fundamental problems/principles.

Almost all problems and solutions stem from one limitation: only one writer at any given time. Immutable means one writer. Locking means one writer. Splitting a list and processing each half means one writer. The actor model means one writer. All are valid solutions to your problems.

A piece of personal advice: Avoid acquiring more than one lock at the same time if possible. The solutions are usually incredibly messy. The dining philosopher problem is a pretty good example. Every single problem involving multiple "per object" locks requires a hand crafted solution to the problem. Changing the order in which forks are picked up does not generalize well to other problems.

If you have global locks you may get away with always acquiring them in the same order but if you make a single mistake you're opening yourself up to deadlocks.


Sometimes std::lock [1] is a solution: "Locks the given Lockable objects lock1, lock2, ..., lockn using a deadlock avoidance algorithm to avoid deadlock."

Of course this only work if all locks you need to take are exposed and not hidden behind some abstraction. Then again, when they are it is hard to make sure you only take one lock at a time anyway...

[1] https://en.cppreference.com/w/cpp/thread/lock


What about sequential locking/unlocking with consistent locking order for all resources? I was under the impression that this solved the deadlock problem but I haven't yet tried it in practice.


Queues are my advice. The locking is all in the queue operations. Everybody else just takes and posts work items to the queue. Can even use lockless queues.


how would you implement, say, a concurrent hash map with queues?


Post service requests to the hashmap queue, get the response on your queue.

But more likely a hashmap is too fine a granularity for queues. If a hashmap can be owned by one process, that would perform better.


These are good tips but I would also add that if you can avoid shared memory altogether, then you should avoid it.

If you can shard and distribute the data neatly and evenly so that each process handles a completely separate subset, then that is the most reliable long term solution.

With shared memory, it's difficult to control who has access to what parts and when - and that can make it difficult to detect and analyse bottlenecks.


> Avoid acquiring more than one lock at the same time if possible.

If you know you're going to need 2 locks and you acquire both of them at the same time atomically, deadlock is impossible. The trouble happens when you acquire one lock, wait a bit, then acquire a second one. Maybe that's what you meant and I just misread it.


If you are a C# developer: I'm writing an open-source async Job System, very similar to that found in Unity and other modern game engines.

You can track development here: https://github.com/jasonswearingen/godot-csharp-tech

I'm just starting the implementation, but I've done a similar system before so pretty sure I'm not in over my head :)


One version of the parallel vs concurrency distinction is parallelism is the easy part and concurrency is the hard part.

I would say that for concurrency, FRP (or various incremental programming technics) is essential. The problem is the possible control flow / traces space is just too large to characterize by hand. Another problem is control flow can be bidirectional. If you join together two streams and wish to be push driven, when you reach the merge you need to be able to pull from the other side. Virtually all of Futures/Tokio, hit JS thing du hour, etc etc fail at this. Even forks are a bit hard to do manual CPS as you only have one continuation. Blocking calls fail because the inflexibility of linear 1 input 1 output, but this is better known.


Do you have any links handy discussing the need and implementation of the push/pull dynamic you’re talking about. I’ve seen passing references before, but I’ve never come across any substantial discussion of it outside of knowing Conal Elliott has a paper on push/pull FPR.


I don't know of other papers, sadly. I am at the company behind reflex-frp.org/ so it's code + oral tradition for us.


Oh parallelism, the one thing I used to worry about more, and probably should (as far as using GPUs goes)... but also one I don't really sweat so much. Most of the reason, for me, was finding Elixir/Erlang AND their virtual machine BEAM.

Until Elixir and BEAM for me, parallel programming had been difficult because the competing mental models in other languages pretty much make for a textbook fantasy.

I'm sure some folks just hit rage land, "Pardon me, you contemptuously ignorant and vile buffoon, what did you say?" but if you'd be so kind as to let me qualify it...

Parallelism on computers has more or less evolved from things being asynchronous which gave us the effect of parallelism (yay, for concurrency) to real parallelism with the proliferation of multi-core computers. This jump from emulated parallelism to real parallelism hasn't really been too thoughtfully designed into most languages, and for those starting out, depending on your language and runtime/vm, you might be in for quite an odyssey. We live in a world where almost every language you use has NOT been designed with parallelism as a first class problem.

I'm going to try to give an abridged version of my odyssey, Please note I love anything I seem hate on in this tirade.

First was Ruby, pure and simple, it has (system) processes (forking), threads, fibers, and libraries implementing actors and event-loops :naive-heart-eyes: but _none_ of them offer real parallelism inside the runtime itself. When one writes code with one of those mental models it falls apart at the metal (hey, GIL) and eventually you're left scratching your head, disillusioned with the world, and realize "why am I even doing this if my work is CPU bound? It won't matter."

Next up, I knew the JVM had "REAL THREADS!" I needed to get experience with another language and thought "oh hell, yeah, baby, I'm gonna rock the shit outta these 1.. 2... 4 CORES" watch out CPU bound tasks. Then came reality creeping in like Santa Claus in July... "Creating threads is expensive? Oh. No, shit?" and "oh wow, so like, I now have between 1 and N threads, that I must to schedule, and now I lock my system somehow." I was told apply locks until the burning goes away, but it never does..

And in that a drunken haze of threading hell one night I saw what I thought was a beacon of hope, a JVM based Actor pattern named Akka. "I get the actor model, how could this go wrong?" So. Many. Ways. First and foremost is choking this new fangled "thread pool," Akka had given me with my CPU bound tasks. How does one then gloriously obliterate CPU bound tasks? Oh shit... One goes right back to where they started... in a separate thread. Or one can do as I did, and GALAXY BRAIN the solution: create a separate pool for my CPU bound tasks. This also introduces a new gang of friends "back pressure," "priority," and my favorite "contention."

Eventually, I kind of accepted fate, knowing that some mental model even if totally contrived was better than no mental model-- right? Still I believed there had to be something better, and I kept reading about how awesome Erlang was for this sort of thing [1] at the time and Elixir had just come out.

Peanut butter meet pickles. This shit was good.

Erlang the language, and it's runtime/vm BEAM, were designed to work together to solve a problem space: asynchronous, fault tolerant, and distributed systems. At that point of realization, I blacked out from astonishment that people could write asynchronous code in a language and run-time designed to do so. Turns out its mental model was for me, the end all be all, because they (lang/vm) were, and this my big point, DESIGNED TOGETHER TO SOLVE THE PROBLEM. On BEAM a "process" can be pre-empted (queue choir of angels) and thus _everything_ is non-blocking. The actor pattern was finally able to hit the metal in my head and my beard gained substantial majesty.

I write Elixir nearly all day long right now, and while I love it, I wish there were other options (I should probably check out Pony again) for me to use, but honestly, I don't know of any. I'm beyond thankful for Erlang and BEAM since it really helped out my software development game and understanding how parallelism works.

P.S. Anyone know if is there a language designed for running on GPUs designed to make it easier to facilitate programming even if they're slightly slower to execute, e.g. the Erlang/OTP/Beam of GPUs?

[1] BEAM is actually not great a CPU bound tasks, but that's a different story.


I, too, feel the same way about elixir (but my previous parallel programming experience was only in C). I just figured out how to safely dispatch long running NIFs using zig, so that should be fun!

> P.S. Anyone know if is there a language designed for running on GPUs designed to make it easier to facilitate programming even if they're slightly slower to execute, e.g. the Erlang/OTP/Beam of GPUs?

Try julia. I wish their gpu model was more actor-ish, but it's really easy to do GPU programming in it. You just take your existing code and re-type annotate it as GPU.

If I had a lot of time I would try to make a Julia/elixir bridge since I feel like the two languages will play nice with each other (Julia for computation and elixir for orchestration).


you should learn about clojure. not only it provides nice concurrency semantics (not actors though, more granular stuff), it's also functional (in a pragmatic way, like erlang) and immutable to the core.

tutorial: https://www.braveclojure.com/introduction/

on gpu: https://neanderthal.uncomplicate.org/articles/tutorial_openc...


While we're on Clojure and the GPU, let me shamelessly promote my upcoming books (I'm the author of that old tutorial and the state of GPU dev in Clojure progressed a lot since then):

Deep Learning for Programmers: An Interactive Tutorial with CUDA, OpenCL, DNNL, Java, and Clojure

https://aiprobook.com/deep-learning-for-programmers/

Numerical Linear Algebra for Programmers: An Interactive Tutorial with GPU, CUDA, OpenCL, MKL, Java, and Clojure

https://aiprobook.com/numerical-linear-algebra-for-programme...

Both books progress really well, and you can subscribe now to read the drafts and get the complete books when ready. There's no middle man and 100% of proceeds goes into funding the development of open source Clojure HPC/GPU/ML/DL libraries.


Interesting writeup!

Do you have any ideas why Akka/JVM is so much worse than Elixir/BEAM? I thought that as long as you stick to Akka (i.e. don't use global mutable state but wrap everything in actors) it should pretty much work the same way... Maybe has something to do with On BEAM a "process" can be pre-empted? Though AFAIK JVM threads can as well...


I do! Hopefully it's right, it's been a while since I really hacked on parallelism in the JVM.

BEAM can interrupt (preempt) a running "process" (actor) if it is taking too long at any point, and allow another process (actor) to run for a bit before continuing, this mechanism is how it (BEAM) allows for soft real-time guarantees and for "hundreds of thousands of processes" to be "running" on the system.

The JVM (and subsequently Akka) on the other hand will preempt for a higher priority process but will not actually interrupt a running processes for the purpose of guaranteeing all "tasks" or "processes" eventually have a chance to complete. It cooperatively schedules everything, so at some point, a long running task must complete before a new one can run, "ey mate, can I use that core when your done?"

Example 1: At some point you could have a low priority pool that _never_ gets run because something is always coming in and interrupting it that's more important.

Example 2: Having a queue build up of tasks at the same priority that then causes the end of the queue to start timing out faster than the tasks complete. This is made especially horrible when the queue timing out is higher priority than other queues you have.

Again on BEAM you never have to worry about a lot of this because everything is non-blocking by default. Maybe a concrete example is, on BEAM you can write an recursive loop /function and calling it won't lock the system. You could have a 1000 processes running that same loop, and it still wouldn't lock the system, BEAM just hums along all the same.

Hopefully that helps explain it or lines up with your knowledge of how the JVM works? Again, there are dark parts of the JVM I'm certainly ignorant of or perhaps some of the implementation has changed since I grokked at it.

With all that said, Akka is still pretty fucking great, and the JVM is faster than BEAM for a lot of tasks. If I remember right Akka is built using pretty much the JVM equivalent of Grand Central Dispatch (libdispatch), or was years ago when I was really into it.

Tangentially, IMHO, GCD, is the only sane way to do parallelism and async in OOP.


Ok, I get it now. Initially I was a bit confused because you got some concepts mixed up, but it makes sense now.

Basically, Akka takes tasks and runs them with a limited number of executors each on its own JVM thread. A task runs until it's done, at which point another task is chosen to be executed. This is akin to GCD queue-based dispatch.

Technically, this has nothing to do with JVM per se - JVM threads are OS threads and thus can be preempted, but by the OS only, so when the thread continues, it's still running the same Akka task! An alternative Akka implementation would run each task in its own (OS/JVM) thread, which would solve the preemption issue, but OS threads are quite expensive (though getting less so recently) so there would be a large(r) overhead. Erlang / BEAM runtime (and Golang, Haskell, and maybe others) solves this by implementing preemptive user-space threads, which bring the best of both worlds (assuming a good implementation) - ability to run millions of tasks (each with tiny overhead) while avoiding starvation (by making tasks preemptible / re-schedule-able).


Yeah, sorry for the vernacular issues. That's probably the worst part of (parallel) programming for me these days, ha!

I guess IMHO, it does have something to do with the JVM because that's how it chose to implement it! :) Not to say it doesn't have its merits.

Finally, yeah, the whole OS schedule pre-empting has me curious these days as I dig more in the "system" itself. I'm sort of wondering as you pointed out what sort of things I might be able to get free free that emulate the Erlang way of doing things if I look deep enough.


In theory, every runtime (including JVM or even C!) implementation could be implemented with arbitrary user-space preemption, but it’s rather expensive, incurs a lot of overhead. So, languages/platforms that (to some degree) optimize for performance (definitely C, but also Golang, JVM) avoid that. In particular, tight loops suffer - to support preemption, basically in every loop you need an extra condition, to check whether enough time has passes for the task to be interrupted. This turns out to be expensive enough that basically your language is useless for any kind of mathematical computations (all vector & matrix math and a lot of statistics is full of loops within loops).

Golang in particular dealt with this issue extensively, not just because of starvation, but also because of GC - in order to enable garbage collection of global objects, all threads need to be paused and their local variables & stacks snapshoted. Doing that efficiently is hard. I think they were trying to make threads arbitrarily preemptable with OS signals, but that requires extreme care with implementation (you need to know where local variables are stored at every instruction!) and is not portable (no signals on Windows).

E.g. https://github.com/golang/proposal/blob/master/design/24543-...


Man, you've brought back memories from school when I tried writing recursive Java code that spawned a new thread at each recursive call. Suddenly, a program that used to execute in seconds pretty much locked up the whole workstation for five or ten minutes before I killed it.


It's hard or easy depending on your model (of parallelism) and what you want to do with it.

From a certain POV computers are always doing things in parallel (8-, 16-, 32-, 64-bits at a time, and so on, eh?)

Has anyone mentioned Pi Calculus yet?

https://en.wikipedia.org/wiki/Pi-calculus


Anecdote: for me the actor model has been the most understandable and useful concurrency primitive I've used. Pi-calculus, which was inspired by the actor model, is similarly elegant.


Parallel programming is not hard inherently, just structure your data parallel friendly, i.e. pay attention to the dependency among data.


Parallel programming is not hard if you have already solved all the hard parts.


Someone wiser than me observed, “the thing about parallel programming is that it always ends in tears.”


Haskell.

I'm at almost fifty years of programming (APL was an early highlight; punched cards weren't) in many languages. Haskell isn't the be all / end all of languages, but when I want to add ten lines to a program and have it use every core, I don't know a better choice than Haskell. Nothing else comes close.


There is this halo around function programming languages being trivial to parallelize, but how they are implemented really matters. Historically, only few functional languages have been implemented with parallelism in mind (SISAL, pH).

The choice of the GHC Haskell to choose on-use evaluation as their non-strict evaluation strategy means that it cannot simply evaluate all function arguments in parallel. On-use means that if you do not use it it does not get evaluated, which some code relies on. Plus, as the trope goes, Haskell is implemented in a very imperative way, which can lead to some fun surprises when parallelizing aggressively.

There are like three different flavors of parallel programming in Haskell, differing in how much you want to trade off performance for having to know about compilation, memory management and execution details to make it run efficiently in non-trivial parallelism cases.


I came here to say this except for Clojure. While not as pure as Haskell, it has amazing parallel programming primitives like atoms, pmap, software transactional memory, and more that are extremely easy to us relative to trying to accomplish the same thing in Java etc in my very humble opinion.


Isn't Software Transactional Memory more for concurrency than parallel computing though?


True. People tend to lump these concepts together. I think people in general are interested in utilizing more cores and so they typically are interested in STM as well, although frankly I haven't ever used it in production systems. The rest of the parallel/concurrent tools are super useful and extremely easy to use and also know assertively that you won't have deadlocks or race conditions etc.


I used Haskell for a few months. It seemed like a really well designed language. I didn't look too much about the multi threading stuff (Does it just work automatically for all haskell code?)

The main issue I found was the language and libraries are really hard to pick up as a beginner. A library for ruby or python would start off with a few examples showing how the most common use case works and then full api docs for the library. Haskell libraries usually only provide a function name and type definition and if you are lucky you get a one liner telling you what the function does but not how to use it.

This seems to be not an issue for haskell experts because "PathMultiPiece ps => (ps -> Writer LiteApp ()) -> Writer LiteApp ()" is all the info needed but I found it much harder to use because of that.

I also found that many of the libraries were totally inaccessible unless you had an understanding abstract mathematical constructs. I wanted to do some basic xml parsing in haskell and the prerequisite was reading some huge book on the theory of some math behind how the library works. I then switched to ruby and nokogiri had on the website basically the exact code snipit I needed.


This is a totally valid point. There are now initiatives like https://tech.fpcomplete.com/haskell that gather tutorials for essential stuff, but I think we're far from being beginner-friendly. I don't think there's any other option except for improving libraries little by little, adding well-written tutorials.


Haskell absolutely lacks the soft documentation needed to attract new people. Just about every library could use a "mini tutorial" in their README, at the very least, to allow even novices to quickly bootstrap them and start getting productive without having to understand all the underlying concepts.


To answer the headline, a worthy option is to not play the game. It's another kind of optimization and often it's better to spend the effort on making your program more understandable and accessible, more correct, simpler, have more features, more iterations with user feedback, easier to change, cost less engineering-hours, etc.


Parallel programming in Python is hard. On Windows, it's hell.

In Matlab it's a breeze. Just put "par" in front of the thing. Done.

If you wanna shell out money to enable that functionality, that is.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: