Hacker News new | past | comments | ask | show | jobs | submit login
Ditch Excel and Use Julia Data Frames (erik-engheim.medium.com)
54 points by socialdemocrat on Nov 26, 2020 | hide | past | favorite | 27 comments



Julia and excel are fundamentally different.

In Excel, the logic/analysis part is on the same level as the data. The data and operations done on it are a single file. A datum is in a cell, and a formula is in a cell. You can move formulae around like you can move data around. You are viewing and operating on the data/logic in the same context. When you open an excel document, what you see right away is the culmination of all analyses done on the data.

With julia, the data and logic are separate entities. You have to first import data[1]. The result dataframe is created by explicitly running cells/scripts on the import. When I view the dataframe, I cannot just point-click-edit it. Any edit I make is a separate line of code to run. Then the whole script is run again over the whole dataframe to get the new result. This is opposed to excel which creates dependencies on a cell-level and live-updates the spreadsheet with each edit, but without processing the whole spreadsheet.

Since logic is its own entity (script) in julia, it is easier to write long and more complex processes: things you cannot easily do in a single-line formula parallelized over cells in excel. Also, since the processing and editing are explicitly spelled out in a script, it is easier to keep track of changes and reason about logic. So they both have their advantages.

[1]: You can even do that in excel. Import csvs, databases etc.


Yes, I think the best approach is to combine tools. But what I think is a big problem is that people stretch the practical application of tools like Excel too far. They insist on using Excel far beyond its practical application.

Personally I never use spread sheets, other than some simple plotting. I often prefer using a simple CSV viewer instead. I find it more practical to use tools which work natively with CSV, as that is the format one will be using in Julia, Python, R or whatever programming tools you use.


Are there any existing FOSS tools that let you do a similar thing to what Excel does in the data layer but with Julia (or another programming language)?

i.e. I’m in my cell and want to type my Julia right in that cell and watch it live update when I hit enter, similar to excel


The closest may be Pluto.jl https://www.youtube.com/watch?v=IAF8DjrQSSk

In Pluto plots, calculations get bound to data and when you change that data the effects immediately ripple through.

Personally if I work with table data in Julia, I will use it with something like Table Tool https://apps.apple.com/us/app/table-tool/id1122008420?mt=12

It lets you view and edit a CSV file. Saving and loading CSV files from Julia is very quick anyway. You can use a couple of key strokes to bring back a previous command.

You also got some newer ones such as Tad: https://www.tadviewer.com


What's with "ditch Excel"? Why not "ditch Pandas", which this is much closer to. Better yet, why not just create a tutorial on using Julia without mentioning other tools, and let the reader decide? Julia lets you SUM an entire column! Unbelievable! And you can rename a column? I'm blown away!

Excel has tons of point-and-click capabilities that don't require IMPORTing anything or knowing what an IOBuffer is. When you try to pretend that Julia can replace Excel, you just make yourself look foolish and uninformed.


It is meant as an example for beginners to understand how you work with data in tables with a programming language. You got to walk before you can run.

The advantage of a programming language such as Julia is that you can simply copy paste code examples to manipulate data by just reading a blog entry as in the case.

You cannot do that with Excel. When you begin getting into more advance Excel functionality, it is not obvious how it works, and learning it will require a very visual approach. You cannot search through a blog and copy some code. You need to look at pictures and videos showing how the GUI is operated.

Alternatively read a textual description of what you need to click on. Rather cumbersome.


People have been saying this for over 20 years now. Excel was, is and will remain a foundational aspect of compute tasks for programmers and non-programmers alike. It is not by chance that Excel is popular.


A lot of that is also simply momentum. Many people are utterly unaware of the alternatives outside the MS Office bubble. I have met so many people who thought the only program that exist for writing text is MS Word. Seriously. It is not even funny.

But that is how people often get trained early on. You get taught to use MS Word at school and other context and never get exposed to alternative tools or ways of doing things.

Here is a simple anecdote. Have on two occasions had long term Excel users get stuck on some task, looking for a solution. Then I ask them to use Apple Numbers. See if they can figure it out. Despite never having use it they figure out their task in no time.

My point is that MS Excel isn't even all that good of a spreadsheet application yet almost everybody use it despite better alternatives existing.


I've been writing software for many years. I still use Excel for quick personal finance stuff, calculator and many other random lists, etc.

It takes 5 seconds to launch and you're already productive.

It has its place. Not in a data science pipeline obviously.


Could you give some examples of some concrete tasks? Because I use Julia for exactly those same tasks, and I don't get why Excel would be any benefit. Julia launches faster. It is easy to just keep around in a Terminal window all day. Doesn't consume much resources.

Any command I have previously typed is quickly accessible. Excel in contrast doesn't know what you did previously or can easily bring that kind of functionality back.


Are you serious?

"Can you give concrete examples for how you use a knife in the kitchen?" It's an absurd question.


I'm deep into the dataframe/SQL way of doing things but I also use Excel a lot.

In my opinion, spreadsheets provide a different paradigm that dataframes do not supplant. In theory anything in Excel can be done in a programming language (that's obvious). In practice, certain tasks are simply easier do in Excel.

Excel is essentially a functional reactive calculator on a visual grid. It lets people quickly create models to test ad-hoc ideas. Because it works at a cell-level, ad-hoc calculations are much simpler. Because it is reactive, you can test different scenarios without recompiling or re-running.

Sure, it has its limitations -- which is why multidimensional modeling tools like Quantrix exist. Also it's true that many people use Excel far beyond its intended purposes (I've come across some really complex Excel sheets with VBA that really should be in a SQL database). And yes, the dataframe paradigm is powerful for structured/systematic and reproducible/replayable transformations of data, e.g. column/row operations, aggregations, filtering.

However, the very structure that makes dataframes so powerful and consistent also constrains it.

In a spreadsheet, you can break a formula at specific cells (for exceptions etc). You can do that programmatically in a dataframe too, but it's a less natural operation which requires ad-hoc coding, whereas in a spreadsheet, you just head over to the cell and change the formula. Or take operations that are easily done via copy-and-pasting, say, like lagging a column by 2 but only between rows 23-65 and skipping certain rows because they're say, weekends/holidays. In Excel, you can carry this out by just manipulating the data directly i.e. physically moving cells around. In a dataframe, you have to extract the required rows, apply the lag operator, and then reinsert the lagged data (and reset indices, or do vertical concatenate) etc. I manage my personal cash flow and do scenario analyses of different kinds (financial, probabilistic, etc.) in Google sheets (and not a database or dataframe) precisely because I can introduce exceptions in the table and perform ad-hoc tasks like test scenarios in real-time easily. If I want to check an idea out, say the calculation of airflow in an aircraft cabin, a spreadsheet (and not a dataframe) is what I'd reach for first.

Navigationally, a spreadsheet also lets you "touch" data and get a feel for it in a way that dataframes don't (in a programming language, you're typically always working with a subset view of a dataframe, e.g. SELECT TOP 1000 * FROM x, or df.head(3) -- these paradigms are the default because of the inherent assumption that all operations have to scale to the largest of datasets). In exploratory data analysis, it's super important to be able to see and feel the data because that's how you discover inconsistencies, sentinel values, exceptions, etc. Spreadsheets give you a low-friction interface (like filter/sort and freeze rows) and a set of powerful functional tools out of the box that generally beats any CSV viewer.


In all honesty, no one (at least, no one in the real actual world) wants alternatives to things that already work. Give me a standard that's good enough and we'll see how to deal with edge-cases. That's what internet is for, dammit.

Yes, I know a million different better pieces of software exists for my peculiar task. No, i don't care. I have other things to do in my life than babysitting a damn computer.


What you should care about is using the software that makes your life the most easy, and that is not necessarily what you are currently using.

What you want is easy to use software, that is flexible and can do what you want to do.

The stuff you learned 10 years ago isn't necessarily that software.


Spreadsheets automatically recalculate on partial updates to their inputs, which automates a lot of the coding that would be involved in data analysis. They are also a portable format that can be universally understood, which is much harder to say about dataframes in a typical business/work setting.


The Pluto notebook for Julia does exactly the same. It creates dependencies between code cells and data automatically. If you change any data or code, that code ripples through through the rest of the notebook. So e.g. you could bind a slider to a variable and if that variable is used in a plot, dragging that slider will modify the plot in realtime.

Demonstration of Pluto.jl here:

https://www.youtube.com/watch?v=IAF8DjrQSSk


IMO you're better off integrating excel into your workflow. You can use excel to just share the results as a report and view them as a report. Then when you want to do something with the data, just import it into your dataframe. Sometimes people are not ready to learn, sometimes organizations cannot be convinced to move forward because the cost of training an employee might end up more than the existing costs right now.


Totally agree. I think people should combine tools more. E.g. I use GUI tools to edit cells in a data frame e.g. But I do the processing in a programming language. Sometimes I may use Apple Numbers to create the visualization. When doing something visual such as creating plots, deciding colors, labels etc then a GUI tool is often more practical.

But the actual data processing seems a lot more practical with a programming language, as you got version control. Keep processing steps separate from data. Can modularize it etc.


you can do this with pivot tables. if your database has more than 1m entries you can do some preprocessing with powerquery.

you don't even have to learn a single line of new code, if you've never coded before. you only need to know how to point and click with a mouse.

and the best thing is, you can take whatever you've done and send it down to whoever happens to have installed Excel on their machine, which, in a corporate environment, is everybody. Don't need to set up a new programming environment + dependencies.


Julia is just as easy to install as Excel. There is literally no difference. If you cannot do a one click install, then how are you supposed to figure out how to make a Pivot table?

Of course learning to code is a higher barrier. I don't think anybody challenges that. But as you begin using Excel much beyond the trivial I will argue it is actually easier to learn how to code.

Code has the advantage, that it is very easy to document and explain with text. Explaining how to do things in a Excel through text is not easy. You will need images, videos etc which are not searchable. You cannot copy-paste and image to repeat an operation from Excel. With code in contrast you can copy and paste in code you find in blogs, help documentation or anywhere else.

Basically is much easier to reuse and modify. Once you learn how to code you can advance your capabilities much quicker.


> If you cannot do a one click install

corporate IT security policies very often block installs of any kind.


An example with Pizza sales data showing how you can use Julia programming language to do common spreadsheet tasks, such as creating a pivot table, plotting and summarizing.


A better idea would be to use excel to arrive at your analysis, on a smaller sample of data, and if its a repetitive thing to do, create a tool with Julia.


What is the benefit of using Julia over R?


A more user-friendly, versatile and higher performance language. E.g. Julia can do everything R, Python, Fortran, Matlab and C can do in principle all in one language.

You can use R and Python packages from Julia. You can call Fortran and C code from it. Unlike R and Python you don't need to have a module rewritten to C or Fortran code to get high enough performance.

In fact Julia will often outperform C and Fortran code. There are researchers which have switched from Fortran to Julia to get a higher productivity programming language expecting to get a performance drop only to realize they got a 3x performance increase.

The main downside of Julia is that it is a very new language. R and Python and old and hence well known, but Julia is a rapidly growing and taking over serious areas. Next generation climate models are built using Julia e.g.


On performance:

Julia really demonstrates the difference between "theoretically fast" and "actually fast in practice". In theory, maybe I could outperform my Julia program in C with careful thought about cache strategies and so forth, the same way I could outperform my C code with hand tuned assembly. In practice, the ability to prepend a 'for' loop with 'Threads.@threads' and suddenly the loop runs in parallel - or the ability to declare an array as type 'cuArray' and suddenly all operations on it run on the GPU - means that it's vastly easier for me to write fast Julia code than fast C code.


Thanks for explaining!

I've used R for a lot of data-visualization (and become familiar with the idiosyncrasies of many packages) so it's nice to know that it's compatible.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: