Show HN: Eclipse Deeplearning4j

stolsvik · on Oct 4, 2017

For anyone that hasn't checked out Deeplearning4j yet, and think that Python is the only place where the cool AI stuff happens, you really should check this project out!

Their Gitter channel is absolutely kicking, where lots of the actual devs, including CTO and co-founder Adam Gibson, hang out and /really/ answer /any/ question - I find their responsiveness stunning! (I discussed and reported a minor improvement to the config builder, and it was implemented - literally - the very next day!)

Just recently found the library myself, and I found it very refreshing to be able to utilize my decent Java competence in ML, instead of continuously having to hammer my way through limited Python experience to get things done. (Python also have limited multi-threading capabilities which I found annoying when wanting to do some larger ETL stuff "inline". Python also gives me a strong feeling of "this is a scripting language!": It is super fast and expressive when your needs are met by the vast set of (native) libraries - but when I just want to do something myself with raw code, things gets slower. I also love a properly typed language, which python obviously isn't!)

Do notice that with dl4j, you still get full-on GPU acceleration via Nd4j, which is a "NumPy-style" multi-dim array system.

agibsonccc · on Oct 4, 2017

Thanks! Upcoming will be samediff which will allow a pytorch like api on the JVM as well as our python interface jumpy:

https://github.com/deeplearning4j/jumpy

The hope here is to be able to "meet" the python frameworks here and run side by side benchmarks that anyone who only knows TF could do.

For those that don't know: DL4j is somewhere in between keras and TF flexibility wise.

Samediff will be a bit "closer" to what people in python see/understand.

The goal there is to import models from different frameworks and run them in production.

sandGorgon · on Oct 5, 2017

I would like to chime in favor of python. The ecosystem is just too biased towards python to try and jump in anything else.

Also looking at https://deeplearning4j.org/keras

Could you have documentation which is helpful for the rest of us that use Linux ? Your choice of using docker is brilliant and hope that you make it significantly commandline driven and especially for GPU (nvidia-docker or something). Currently your documentation is extremly dependent on Kitematic and such.

agibsonccc · on Oct 5, 2017

I actually don't think you are my target audience. We aren't targeting end python users. You have to remember: there are 2 ecosystems. Production at big companies and data scientists here.

I am not target data scientists. Granted, we will have python bindings later, but there's a huge amount of data engineers that know java just fine and need it for production. That's who I'm going after here.

You could try again when we have our python bindings out there..for now just wait I guess?

mindcrime · on Oct 4, 2017

Their Gitter channel is absolutely kicking, where lots of the actual devs, including CTO and co-founder Adam Gibson, hang out and /really/ answer /any/ question - I find their responsiveness stunning!

Same here. @agibson and crew are amazing and they've helped me numerous times over the past year or so.

albertzeyer · on Oct 5, 2017

If you prefer Java over Python, some of the other more popular frameworks also have Java bindings. E.g. TensorFlow or CNTK.

agibsonccc · on Oct 5, 2017

I'm not exactly sure I'd call those "bindings". If you want an actual compelling competitor to us because we're not a big name, at least look at mxnet. That's more credible.

It's incredible that folks insist on this still. Both those frameworks aren't even going to be aiming for any level of integration with the JVM.

At least take a look at what makes us different.

A big one being: https://github.com/bytedeco/javacpp

We own the whole stack here and integrate pretty deeply. This has lead to some great performance improvements.

We aren't just "java bindings" but offering a lot more in one place than the other frameworks.

Also: I highly doubt you've even looked at the numbers.

We're actually doing pretty well in the rankings: https://twitter.com/fchollet/status/915366704401719296

We're not #1 but we have an actual user base.

I think a lot of our upcoming things like our python bindings where people don't have to "see" java and integrations with spark will win over some folks at big companies..maybe not DL researchers which is also fine. Our upcoming api will be like pytorch though.

Regardless: I'm not sure the framework will matter long term given that model import and things like ONNX are becoming more common.

Finally, I'll just say this: It doesn't hurt the ecosystem to have more competition and different interests than research.

Every single person who says this usually dismisses us for 1 silly reason or another ranging from:

1. Not a "big name" (despite being put in production by a ton of them)

2. Not a lot of research papers (yeah no crap we're not a research framework, we do have publications though!)

3. They see java bindings and can't tell the difference between a full application suite

and a tensor lib with java bindings generated by SWIG

4. Usually a PHD student at a university who hasn't worked at a large company and doesn't understand my actual target audience

albertzeyer · on Oct 6, 2017

It sounds like you got a bit angry because of my comment? Sorry, I didn't meant to mean it negative in any way, esp nothing negative about DL4J. I just thought that it might be a good contribution to the discussion to have a more complete overview of the DL options in Java, as it was not mentioned here. Of course, there are advantages and disadvantages for each. E.g. TensorFlow might be more well-known but its Java bindings are lacking a lot of the tools which you have in the Python bindings. Also, despite the programming language, each framework underlying design has its advantages and disadvantages.

I know some people / companies who do the research and training in Python with TensorFlow and in production for inference they use the Java bindings (or C++). That works quite well.

Also, competition never hurts.

agibsonccc · on Oct 6, 2017

Oh no problem! I guess what I was getting at with my reply. We get dismissed all the time by the folks doing research (understandly, they wouldn't use us)

I guess what I was elaborating on (partially for clarity of other readers who really don't know the difference, partially as a response to your comment here) was the fact that they really are very different things.

My problem with these things being posted as "alternatives" is: deep learning is just 1 part of the suite.

When you want to do anything with these other frameworks, usually python is somewhere in your workflow.

That isn't my only complaint though: Dl4j "the deep learning framework" is 2 parts. Nd4j (the tensor library directly comparable to TF/mxnet,..) and dl4j the deep learning DSL which is higher level like keras.

As for your point about folks using C++/Java bindings: Even "production" ends up being nuanced there. TF has different deployment modes with almost no out of the box tooling for very common deployment scenarios. That includes things like kafka, spark,..

We take an angular js like approach to this. Rather than have deployment as an exercise left to the reader, you get imports from different frameworks, a clear way of doing everything from ETL to setting up a server, and a way of actually debugging/controlling things from the JVM rather than using some SWIG bindings with a black box.

For example, our tensor library has deep integrations with the java gc which allows reference collecting as well as in java memory management system for cpu and gpu all for handling management on the gpu.

You need to control those things if you say: run things on a tomcat server. You don't get that granularity when just deploying some c++ bindings. There's a reason I emphasize having the full runtime tooling available to you (not just integrations).

So yes: In general when I see someone just off hand pitching these things, I clarify it. As I said in my previous comment: A whole application suite is a very different approach than a "tensor lib with autodiff and some java bindings".

People have different use cases and different requirements. I'd at least like a fair comparison side by side though..

stolsvik · on Oct 5, 2017

The Java-TF binding is so shallow it is absurd, IMHO. Basically, you loose everything you have when coding TF in Python, but yes - you can import models and weights into your JVM and do inference there. If you want to create a graph, then you will have to code on a very low level - most of the abstractions and tools are gone.

vonnik · on Oct 4, 2017

The interesting thing about this process is timing. You shouldn't really contribute a code base to a foundation until it's mature enough.

I think any group of people developing an open-source project wants it to develop quickly and with healthy governance. Joining any foundation, whether it's Apache or Eclipse or Linux, sends a signal that they're mature, thinking about governance and want to make sure the product develops in a way that agrees with the community.

But sometimes governance can get in the way of speed. What we found talking to Eclipse was that we could get the governance and keep the speed. Which means we'll be able to keep pushing DL4J forward without confusion.

Fwiw, here's the DL4J site: https://deeplearning4j.org/

Here are the repositories: https://github.com/deeplearning4j

And the Gitter community: https://gitter.im/deeplearning4j/deeplearning4j

mindcrime · on Oct 4, 2017

DL4J strikes me as reasonably mature anyway, although you probably know more about that than I do. :-)

Does this mean anything specific for Skymind as the commercial vendor behind DL4J?

vonnik · on Oct 4, 2017

That's nice of you. :-) It's mature now, which is why this was the moment to move it into Eclipse.

It means that DL4J & suite is now vendor neutral. On the Skymind side, we will continue to develop all those open-source projects, so you can expect a lot more cool stuff to come: interpretable models, better ETL, Keras as our Python API, vertical-specific apps for EDA, Robotics...

mindcrime · on Oct 4, 2017

Very cool. I'm a DL4J fan. I actually did a talk at Tri-JUG a couple of weeks ago, on real-time machine learning and BPM, which featured DL4J as part of the tech stack. I'm also working on a SaaS offering around AI/ML and plan to include support for DL4J at some point. Hopefully at some point I can get to a place where I can make some useful contributions to the project.

agibsonccc · on Oct 4, 2017

Thanks! Please share what you're doing and we'll promote as much as we can.

crockpotveggies · on Oct 4, 2017

The amount of work we've done this year with Deeplearning4j on performance has been much higher than previous years. We brought DL4J up to par with community standards while maintaining the advantages of Java. I think what a lot of people don't realize is that a ton of effort has been made toward ETL and integration tooling.

It's very difficult to train on multiple GPUs while maintaining performance of ETL. ETL is a scary hidden bottleneck.

I'm very interested to see how Eclipse can continue to push development. I think the people who will especially benefit from this are devops/production teams operationalizing data science.

stingraycharles · on Oct 4, 2017

As someone who is mostly working in the field of data warehousing, ETL has a very meaning to me (Extract, Transform, Load). Is this the same as you’re talking about ?

crockpotveggies · on Oct 4, 2017

Yes, so one of the core libraries within the DL4J project is datavec which is ETL-focused. One key problem that we discovered - and fixed - was that reading and transforming data for training could bottleneck a multi-GPU process. You spend a lot of $$$ on a deep learning computer, but making the library performant enough so that you could load data at the same rate the GPUs could consume it was challenging. This scales to about 4+ GPUs (depending on datatype) and we're building a datavec server so this can scale much larger. There are still good returns if you clean and transform your data and presave to disk, which helps with large machines such as a DGX. However, other bottlenecks still apply (which we are solving right now).

I hope that answers your question. I consider the process of extracting records, transforming them, and loading them for training to be "ETL". I understand ETL also applies to other data consumption.

*I should also note that if you wanted to use datavec for ETL and do not wish to train a deep learning model, it is quite useful for columnar data.

riku_iki · on Oct 4, 2017

Why mix ETL and training?

I am using TF, and in my workflow I first do all ETL in some separate process, dump all training/validation data into TFRecord file, and then my training program consumes it. Clear separation of concerns without any performance penalty.

And I can iterate over training logic with various parameters as many times as I want without touching ETL.

agibsonccc · on Oct 4, 2017

Integrations and deployment. We are targeting data engineers with this not data scientists.

A lot of our user base are people who want to take what the python folks did for creating a feature vector from raw data -> import a model (note this is all in java) and run the training pipeline and test pipeline in the same place.

The goal is to provide an opinionated set of tools for doing this.

Upside here: It's obvious how to both pre process data and create ndarrays from them. Tighter integration also allows us to make some optimizations under the hood with how memory allocation is handled as well as allows us to target different data types like images + sound as well as databases in the same place.

What I'm guessing here is: You're a data scientist focused on building the models. Someone has to take that code and put it in production. If you work at a startup, production might be "TF serving". If you're a fortune 500,

you're likely not deploying that. People we've worked with are usually constrained in some way (especially by the JVM) .

You aren't our target audience, it's your colleagues.

We'll work backwards from that by adding our python interface and the like, but largely we want to solve a cross team concern after models are built.

deepGem · on Oct 5, 2017

But what if your training logic needs a change in ETL - You have to iterate on ETL too. So doing it inline makes a lot of sense. Microsoft has done something like this with SQL server + R offering. Personally I find the MS approach quite appalling. You have to load your model in a PLSQL script, so the ETL + running the model is seamless, but imagine debugging and oh forget about multi GPU support.

agibsonccc · on Oct 5, 2017

The part of the ETL you'd want to "crystallize" is just the transformation from raw data to feature vector.

Beyond that, you already "change in ETL" when you experiment. I'm not sure how this changes anything.

Out of nowhere you've sprinkled in "multi gpu support" which I'm not sure is relevant here. Do you mean as part of training?

We handle the gpu bits for you. All you do is define your transform logic, it runs on one of our backends like spark and then when you go to allocate a tensor boom gpu.

There's no special compilation or process needed to make this happen.

Dl4j supports multi gpu training out of the box. All you need to do is uses our parallelwrapper module.

We can also do distributed training with gpus on spark as well (yes this includes cudnn).

You've also for some reason decided to attach an open ended coding library where you can do whatever you want to a database?

Baked in ML in the database servers is already notoriously bad. The whole point of what we're doing is to provide a middle ground.

Data engineers have to do this anyways.

deepGem · on Oct 5, 2017

Adam, sorry guess I was not clear. What I said was that the MS Sql server + R strategy of baking in ML in PLSQL is appalling and doing something like what you are doing with DL4J is perhaps the right approach. The multi GPU support rant was for SQL Server + R, not DL4J.

"You've also for some reason decided to attach an open ended coding library where you can do whatever you want to a database?" I didn't catch this part.

agibsonccc · on Oct 5, 2017

Right so I think we were agreeing that how database servers bake in the ML is a bit weird and not the way to go.

The "open ended coding library" is datavec here. Comparing them here is only really semi valid. The processes there are definitely brittle.

stolsvik · on Oct 5, 2017

I think he agreed with you..!

agibsonccc · on Oct 5, 2017

Yeah sorry about that..I was a bit confused on how to read it. There was a lot mixed in there. I think we're clear now :D.

stolsvik · on Oct 4, 2017

Because you might want, as I definitely wanted, to also iterate over the feature engineering, not only the network and training parameters. Thus, doing ETL "inline" is really cool, and speeds up your iteration.

It is also easy once you have a proper language that supports multi-threading. Since you do it "full speed" with full utilization of the GPU, there is no disadvantage of doing the ETL while training.

vonnik · on Oct 4, 2017

Fwiw, here are the links for our ETL tool DataVec (it vectorizes data, or tensorizes it if you prefer): https://github.com/deeplearning4j/datavec

https://deeplearning4j.org/datavec

The thing to remember is that this is ETL focused on machine learning. It's not any old set of transforms. It's transforms that help us normalize, standardize and finally tensorize various data types, be they images, video, text or time series.

sandGorgon · on Oct 4, 2017

This is super, super cool!

Love the work you guys have been putting out. How do you generally see the community mindshare in space especially with https://techcrunch.com/2017/02/13/yahoo-supercharges-tensorf... (you guys are also mentioned there)

You guys run this at scale, but Yahoo's TensorFlowOnSpark does look very enticing with its "Easily migrate all existing TensorFlow programs with <10 lines of code change" punchline

vonnik · on Oct 4, 2017

Thanks!

I'm not sure that Yahoo's TF on Spark is fully baked, but maybe its users can tell me more about that. I know it's not commercially supported. That's more a business model distinction, but it has technical consequences. Skymind built the DL4J stack from the ground up, so it's easy for us to maintain and extend, and we give enterprise a way to derisk adoption with a support contract. There is no equivalent offering for TensorFlow outside of GCE, let alone TF on Spark. In the end, it's more useful to think of DL4J as a complement to the Python ecosystem, rather than competing with it. We import TF models through Keras, and soon we'll import them directly to help people productionize the models they train with TF. That's important for large orgs committed to a JVM compute environment.

agibsonccc · on Oct 4, 2017

Doesn't matter. We don't need to be the #1 framework. We're going to be focused on model import and integrating with the big data ecosystem.

None of these frameworks you're mentioning integrate properly, are really supported by an actual community, and don't get meaningful updates.

The biggest problem you run in to pretty quickly is maintenance. TF and spark both update quickly.

Every time someone has attempted to do a "TF on spark" they don't add ETL, proper JVM integrations (proper control of cuda and memory management from the JVM), there are also usually strange interactions with JVM/python run times.

Also, please don't ignore the rest of dl4j. We have a whole suite of tools in there. It's not just a matrix lib with autodiff like TF and co is.

TF in its own way is adding some of this stuff like TF records and some of their readers, but it's not going to add a lot in the way of things like connectors to kafka and a lot of the big data ecosystem.

They've basically added HDFS..that was about it.

Dl4j is also significantly easier to deploy. It's just a jar file or zip file..you don't need a blob of c++ just to run a model.

sandGorgon · on Oct 5, 2017

So just trying to understand this - does dl4j become the essential middle layer if someone wants to run Tensorflow on spark seamlessly?

Because it seemed on the surface that you guys replaced tf. If you guys offer a better ecosystem where tf and spark work together (but don't really rip out tf), then it is even more awesome.

agibsonccc · on Oct 5, 2017

Not just TF: pytorch, keras,mxnet,.. Like I said: We won't be the #1 framework data scientists use. There's a ton of churn in that space yet.

Granted, a ton of it is TF. Google is doing an amazing job now.

We're trying to be a moderate middle ground.

TF and pytorch and anything in python tends to be an interface.

Core logic still operates in C/C++.

That's great when you want speed, but you end up missing the benefits of the JVM (the tooling is great for monitoring and things folks need at scale).

So we do everything via JNI where we push down the math one block with minimal overhead if any.

That and if you want to write production code, you don't need to push logic down to c. You can do something JVM based instead, which gives you kotlin,scala,clojure,..

sandGorgon · on Oct 5, 2017

i think this aspect of dl4j gets lost in the overall message. For me, this is much more powerful - "use dl4j if you want to have a seamless experience running tensorflow/keras/mxnet on spark".

Because right now, it looks like it is tensorflow vs dl4j.

agibsonccc · on Oct 5, 2017

Honestly hard for me to care..They can both compete as well as integrate.No offense here but model import existing will become a common thing as things like ONNX become standard.

We play up what's unique about Dl4j anyways. It's right in the name. We're square focused on the JVM.

We also hedge our bets against spark. Most of our logic doesn't even run on spark.

Spark is just a facilitator..it's not where anything that matters runs. We could just as easily use flink or apex here too. Those are also JVM based streaming engines.

We can't overplay spark because most of our deployments won't even be with spark. We don't even use spark in our own production inference tools. We just deploy as a microservice.

sandGorgon · on Oct 5, 2017

Well interesting parallel you give here !

I have personally been campaigning for ONNX to merge/leverage Apache Arrow (https://news.ycombinator.com/item?id=15195658). It probably makes sense for the efforts to build on top of each other.

YMMV ;)

agibsonccc · on Oct 5, 2017

Integrating with arrow is on our bucket list as well. We plan on integrating with their tensor data type and format.

I feel like you're trying to cross 2 worlds that shouldn't be for arrow/ONNX though. You're trying to map a neural net description language to a columnar format..that doesn't make any sense to me. In our case, our runtime will understand both but for different reasons though.

We will import the format for our neural nets and integrate with arrow for our ETL -> tensor pipelines.

agibsonccc · on Oct 5, 2017

For those wondering, why the heck are you guys java:

Our tensor library has a python interface in the making: https://www.slideshare.net/agibsonccc/strata-beijing-2017-ju...

Our goal with this will be then to write side by side benchmarks with the other libs that people can easily run. We know a lot of folks from python land won't jump over and we don't expect them to. The hope is that we're equivalent and can integrate better with a big data cluster.

Dl4j is not trying to be TF. When I started dl4j 4 years ago, it was theano and torch as the primary frameworks.

I wrote it for deployment in to production apps and for the hadoop/spark ecosystem.

We will continue that going forward at the eclipse foundation as well as using dl4j in our product.

Please check out the oreilly book as well (currently #2 on amazon right next to the goodfellow book):

https://amazon.com/Deep-Learning-Practitioners-Josh-Patterso...

Email in profile if there's any specific questions.