Just to set expectations while this package is decently stable and used by a number of people, both commercially and in the open source, it is basically in maintenance mode at this point in time. The intention a few years back was to build a complete analytical database on top of Julia, but have it be a mostly independent project. That vision didn't quite materialize, partly because not enough people wanted to do distributed analytical work (often a single big box is enough, particularly with a language as fast as Julia) and partly because the main developer left to pursue a PhD. I wouldn't be surprised if it got revived in the future, but for the moment, it's basically just the distributed tabular computing part (which is useful if you have that problem, but more limited than the name might imply).
> not enough people wanted to do distributed analytical work (often a single big box is enough, particularly with a language as fast as Julia)
As someone who's in the midst of a plan to migrate off of Apache Spark, because we've discovered that, for our particular purposes, a single-machine implementation in a language with better crunching and multithreading chops than Java, can easily achieve comparable performance to a Spark cluster, while being good deal easier to develop and maintain, this comment strikes me as being very poignant.
I have come to suspect that the big data ecosystem is a castle that was largely built on a nice, thick technical foundation that is largely composed of pointer chasing.
Yeah, just look at kdb+/q performance compared to Spark. The former is orders of magnitude more efficient and you realize that a cluster is only necessary because the performance and memory usage of the JVM is very poor on numerical workloads. Same goes for Hadoop.
The amount of money and engineering resources thrown at a problem that actually has a pretty simple solution (integrate your db and language, make your program and language runtime fit in L1 i-cache, optimize the hell out of your language) is a little disheartening. But hey, it keeps people employed managing and debugging an unholy mess of DBs, K8 clusters, load balancers, KV stores, and message brokers..
Sure, but there are 12 people in the world who know how to use k and q efficiently, and there is no fail recovery or any way to deal with data that actually does not fit in ram
That is not true. RAM is cheap (terabytes per node is not uncommon). Moreover, a common pattern for kdb+ databases is to have two parts: a real-time database (rdb) that is typically memory backed and a historical database (hdb) that is serialized to disk (mmap'd for near-RAM speed). In practice, most data fetching is sequential (cache friendly), so you get good performance regardless. kdb+ is commonly used in the financial industry, where data volumes are large.
The language itself is extremely minimal, so the learning curve is actually much lower than most languages (it forces you to write in a "vectorized" style, which is not too different than how numpy/R/matlab works). I was able to get up to speed in a couple of days.
Source: Used to work with kdb+ (still do sometimes). Not a shill (I don't even like k/j/q/APL).
I am kdb user, I am not arguing against it. More power to you if you can afford to endlessly scale kdb vertically. Rest of the world have to deal with distributed systems, because people do not want to pay for kdb or the machines to run it on
I dunno man, I think Kx deals with out of memory data in the most sane way imaginable if you assume time oriented problems and understand how computers work (aka latencies in squirting the data across different layers of hardware into the CPU). I have no idea what they're giving people for manuals for setting up for different query patterns these days, but the old K3 manuals were great; classics of database engineering and practical computer science. Petabytes before it was routine.
The APL family takes some getting used to, as it's a significantly different paradigm from Julia and Python looking languages: arguably as different a paradigm as Lisp. It's also fairly natural if you've done any work with Matlab, NumPy or other array oriented packages, but you do have to put in the work.
It doesn't get as much attention here, but Jd is pretty good also. It's also much lower barrier to entry than Kx if you're building a personal project with it (aka its free for personal use). J is a more complex language than K, but I like it better; more oriented towards mathematics.
I gave a talk at some kx/kdb usergroups[2], and definitely saw more than 12 people there. There were hundreds.
Its not a trivial language to learn, you have to be ok with very terse languages in general, and ok using the language as a way to mentally model the algorithm. Somewhat "mathematically".
I'm not using/working with it any more, haven't in a few years. Most of my recent work is in Julia, by choice.
Two very different philosophies at work here. Julia provides a sane, more or less classical language to work with, inclusive of numerous macros and modules to help drive productivity. k/q provides a fairly modern apl like experience, leveraging Iverson's idea of a programming language being a "tool for thought".
FWIW, the forces behind k/q (Arthur Whitney, et al) are now working on Shakti[1]. Looks like an interesting project, more of a platform than kdb was.
Bringing this back to JuliaDB, I had been hoping for a very julia centric/idiomatic persistence engine as well as an analytical engine. I'm no fan of SQL or its variations. In reading this, I've learned DataFrames is supposed to be this. However, my impression working with DataFrames has been that it is woefully slow on small data sets (100-ish rows). It cannot handle one of the slightly larger 50k line tabular data sets that I am importing via excel sheets.
For that work, I may need to resort to Perl (my fall back programming/data munging language). This is important, as Perl has a capability of tieing (binding) a variable to an underlying data structure/file/... . This enables trivial persistence, in a completely idiomatic manner (update/insert into the tied variable by setting a value). It would be interesting if this capability existed within julia as well.
Hmm ... could be some of the other tooling around it. Right now I can't get XLSX to talk reliably to DataFrame. I rigged up some tests for 10, 100, 1000, ... points and performance above about 500 points fell off a cliff.
I should look into this more. Thanks for letting me know it can handle things that size.
kdb+/q mmaps everything, so the difference between in ram and on disk is immaterial. You're right that not that many people know k or q, but they're pretty easy to learn and a lot as fun as well!
Also you can set up clusters using a very nice IPC system. You can seamlessly just send over code and data with very little fuss.
if your use case lets 'your program and language runtime fit in L1 i-cache', it's not a big data use case - engineering for big data tools is not the problem, deciding this is the 'problem' is.
Well, there is that. But, rewind 10 or 15 years to when Big Data was really taking off, and the world was also different in ways that meant that your data didn't need to be nearly so big, even after adjusting for hardware advances, to qualify as Big.
Most the programming platforms and libraries that one might choose instead of Hadoop & friends didn't exist back then, or weren't or weren't particularly robust. So there weren't a lot of games other than Java in town. And Java (idiomatic Java, not benchmark-friendly Java) tends to reduce the volume of work you can squeeze out of a given hardware setup by introducing a hefty does of memory overhead.
Really fun reading on exactly this issue - definitely changed my view on Spark and compute clusters.
People definitely don't realise just how powerful a modern computer is, and how far you can get on a single one.
I often find data scientist (from python persuasion) come into engineering organizations with this mindset and undo alot of progress. They end up realizing that engineering really does have value and isn't just engineering for engineering's sake and end up layering on a thousand hacks that replicate what was already there. Simplicity for simplicities sake is just as dangerous as complexity for complexities sake. There is a reason major enterprise organizations (LinkedIn, Facebook, etc) have java backends. Also it's not a matter of misappropriation. Some of what they do does apply outside the enterprise. You see a ton of python libraries make this mistake, like dask by saying "it's just pandas that runs on multiple machines". "It's just" is usually a sign that there is some complexity they are ignoring or don't have (that most others will)
Distributed computing for data science not only benefits when we go big, but also when we need go small with independent nodes.
e.g. Cluster of scalable ARM SBC nodes or even multi-architectural nodes could be highly efficient.
I've been exploring this kind of setup for a while now, frameworks like Dask, Ray[1], Modin can do this with Python to some extent. But they are still finicky and Dask seemed more stable than other frameworks for this setup.
I wanted to try out language level distributed computing setup, but Julia required same setup (OS/Arch/Path) replicated on all their nodes last time I visited it.
I feel, distributed computing as such hasn't got much love as it deserves in the consumer market, especially since many have several computing units in their houses now(PC, tablet, smartphone, Watch, TV, Game console). If interoperable distributed computing layer was fundamentally baked in with all modern operating systems minimising the latency with Network, Storage, Memory; Then we could leverage huge compute power on demand, which not possible without investing lot money in single compute unit nowadays.
Then again, Compute power is a major strategic advantage for the manufacturers. Why would Apple share its A13X for compute with a Snapdragon or Intel?
Good point. I used Spark and DataFrames at my last job just because the data I needed was conveniently available in that environment. Using a high-memory single server seems better on general principles, until your data won't fit.
Spark runs pretty nicely in single-machine mode, and it forces you to structure your logic as a nice clean map-reduce. So I think it's fine to use it for "small" data.
I would add that while JuliaDB is really great for some applications, the data ecosystem is very much coalescing around DataFrames, which has recently caught up in performance with JuliaDB and has a broader more flexible API. (I'm one of the original authors of JuliaDB, so this is not a knock on the package at all.)
Any plans for DataFrames to support larger-than-memory data though? JuliaDB is super useful for this I've found, basically a solid Julia alternative to Dask.
My current thinking is that it would make sense to implement the standard DataFrame abstraction (with all the generic functionality it offers) with a couple kinds of distributed data frame. One approach is to make a distributed stack of local data frames; another is to make a single data frame where each column is a distributed vector. Not sure which is better. Fortunately, Julia makes it really easy to experiment with these kinds of things. Of course, as soon as you have a system like that you need a scheduler for work at which point you want something like Dask or Dagger (https://github.com/JuliaParallel/Dagger.jl).
We are so strong with the more general Tables.jl abstraction.
Tables.jl took us a step away from having lots of types of `AbstractDataFrame`.
Subtyping `AbstractDataFrame` is hard, I have to deal with two packages that do it and it is a big and not entirely documented interface.
More natural is extending Tables.jl (like DataFrames and JuliaDB does).
and we continue to build more tools that are table agnostic, and have APIs that DataFrames and a distributed table package special case when they can do it more efficienctly
The basic Dask-like functionality is in Dagger.jl, we have plans to move out the distributed array functionality out of it and keep just the scheduler in there.
I'm working on a set of functions to work with many DataFrames (or anything) in parallel and do it out of core if possible, it's basically like JuliaDB but with a FileTree abstraction rather than a table abstraction.
The package currently provides working implementations for in-memory data sources, but will eventually be able to translate queries into e.g. SQL. There is a prototype implementation of such a "query provider" for SQLite in the package, but it is experimental at this point and only works for a very small subset of queries.
Still early days, but sounds like they're working on the same problem.
This is really sad. kdb+ is still really popular in the financial industry (with very few if any competitive alternatives, open source or otherwise). Julia has the potential to breakthrough in this area.
How does JuliaDB compare to DataFrame.jl? I know JuliaDB has row indices, which is a very nice feature that DataFrame lacks. Is there any reason I shouldn't use JuliaDB even if I don't need persistence.
I wish the Julia ecosystem was a little more integrated: there are a lot of different competing libraries that ostensibly do the same thing. Python has the advantage that it's obvious what you should use: numpy, Pandas, scipy, statsmodels, matplotlib, etc.
With Julia, it's less clear. Though I think part of the reason is that actually releasing a new scientific computing Python library is incredibly difficult and requires a lot of expertise.
Julia makes it pretty trivial for anyone to contribute a model that has excellent performance. This fragmentation is a common problem among expressive languages.
I think that this is changing when you see how the community is organized on github.
There is a Julia something (or specific name like sciml) for all major applications.
I have rarely seen this level of integration in a language community.
I find the IndexedTables (the underlying dataframe-like structure) model to be nicer to work with DataFrames.jl . It's much narrower API but it's well designed enough that this is a good thing, imo. The codebase is also a fair bit smaller.
It also uses strongly typed tables (e.g. Table<int, string> etc), whereas DataFrames is loosely typed. Again I think this is a good decision (though it does grate with the Julia JIT's property of "being slow" the first time you run a function on a new type)
Finally its split into IndexedTables and NDSparse is again a good design decision that I have not seen replicated in any other dataframe library.
It just seems all around better designed.
On the other hand it is verging on being unmaintained.
Before numpy became the standard there was Numeric and Numarray, with painful rewrites for whoever had committed to those. And there was Scientific Python building upon those (I think it started using Numeric, then later on was ported on top of numpy), similar to scipy, but that is more or less dead nowadays. And before matplotlib became the standard, there was a bunch of older plotting libs; I recall using something that was a wrapper around gnuplot, IIRC.
So I think it all mostly has to do with the fact that Julia is much newer and many even foundational areas are still in a state of flux. In a few decades the dust will have mostly settled. Of course, in the areas where people are pushing the envelope, it's good that a lot of approaches are tried out before the world settles on a winner.
I always found it rather ironic that this package, which was developed by JuliaComputing, manages to violate two common conventions for package names in the julia ecosystem
1) don't put "julia" in the package name
2) don't use abbreviations in the package name
Of course, JuliaDB is probably older than these conventions, but it's amusing nonetheless.
So, it's a Julia implementation of something similar to Pandas / numpy. It's not a "DB" in the sense Postgres, SQLite, or dbm are "DBs", it's not persistent and data must fit in RAM.
It's cool that it's pure Julia, so it's instantly portable everywhere Julia runs, and the code is safer than it could be were portions of it written in C.
Difference from DataFrames: it's possible to use on larger-than memory data, as long as you have it in separate csv files which can be super useful at times!
> This being any faster than Pandas is a huge compliment to Julia the language in general and its JIT compiler in particular.
Indeed. This 'whole stack under the same language' is an important feature to have. You get to reap the advantages of an improved JIT. End to end autodiff is easier. Some of these things are a problem, for example, in PyPy because of the Python <-> C bridge.
I wish this website spent its banner space on "what it is" rather than "star us on GitHub!". I doubly wish this because it's pretty confusing what JuliaDB is! From the name, I expect it to be a database, but the website immediately compares it to Pandas. I think of Pandas as a library for in-memory manipulation and analysis of tabular data, not a database / persistence engine. So is JuliaDB not a database at all, but in fact an alternative to Pandas / DataFrames.jl?
In fact, I've been using Julia for work and following the ecosystem since version 0.4 (we're at 1.5 now), and I'm still not sure what JuliaDB is. No doubt this is mostly due to me not having reason to look very deeply (and/or not being very perceptive), but certainly doesn't feel like the marketing copy is giving me any help...
I think it is actually a database, so it seems they haven't committed the cardinal sin of putting "DB" in the name of something that's not a database.
> JuliaDB is a pure Julia analytical database. It makes loading large datasets and playing with them easy and fast. JuliaDB needs to support a number of features: releational database operations, quickly parsing text files, parallel computing, data storage and compression.
> This talk is a bottom-up look at the construction of JuliaDB. We will talk about the scope and implementation of underlying building block packages, namely IndexedTables, TextParse, Dagger, OnlineStats and PooledArrays.
> they haven't committed the cardinal sin of putting "DB" in the name of something that's not a database
It doesn't seem to store data in any way - so it is definitely a data processing engine, but a database without an "INSERT" command feels a little off.
From that point of view, this looks a lot like what original MapReduce did - the data lives outside it & is referenced as urls, but the engine itself does processing out-of-core and in-memory for very large datasets.
"It doesn't seem to store data" is a complaint that doesn't really make sense to lodge against a library rather than a standalone program. If your program is using a database library, it's your job to write the line of code that tells the library to load data from a particular file into memory. The library cannot persist as a running process of its own across multiple executions of your program that is using the library. This is as true of SQLite as it is of JuliaDB.
I agree that it's a bit odd to not have a direct analog of SQL's INSERT, but you can definitely add rows to an existing table by making a new one and doing a merge operation.
Side note: this is what we tell every startup about talking to HN: Lead with a clear statement of what your company does. If you don't, the discussion will consist of "I can't tell what your company does". Same for open source projects of course.
Arrow is a file format, albeit one designed to allow easy interoperation of different analytical frameworks, but it doesn't provide any computation functionality itself.
JuliaDB is more like Dask, or DataFrames.jl - it provides that computation functionality.