Lesser-Known Python Data Analysis Libraries

_euac · on April 19, 2016

My 2 cents: I would not recommend basing any new work on MRjob. As someone who inherited and has been maintaining a bunch of code that depends on it, the library seems to be barely maintained, support for VPC is only partial and not very well documented, the auditing tools stopped working quite a while ago and tracking the progress/status of EMR jobs is extremely painful (to be fair, this is more of an issue with Elastic MapReduce than MRJob itself.)

I love the concept and ease of development, but I can't shake the feeling that the infrastructure is so shaky it almost amount to instant technical debt (sorry if this offends anyone, I'm just a dumb customer.)

__derek__ · on April 19, 2016

It looks like mrjob development has been re-started, but there was a disconcerting period (nearly two years) without a release.[1] I used it for rinky-dink projects, and it seemed fragile at the time, so I can understand your inclination to divest from it.

[1]: https://github.com/Yelp/mrjob/releases

irskep · on April 19, 2016

In case anyone's curious, what happened was that Dave (@davidmarin) and I (@irskep), the mrjob maintainers, left Yelp within about a month of each other. (There's no story there, just coincidence.) There was never any momentum with new maintainers, going by the release history.

But now Dave is working on mrjob regularly again, hence the pace of recent improvements.

Grandparent is correct about the second-class support for non-EMR production Hadoop usage. Like any open source project, the code only works well if a major stakeholder invests in improving it. Few non-EMR users spend much time contributing, so the situation doesn't improve.

_euac · on April 19, 2016

Hey guys, for what its worth, MRJob has given us around 3 years of working (if sometimes clunky) EMR, so thanks for that :)

gdulli · on April 19, 2016

I have the opposite experience with MrJob. Classifying it as an inactive project is demonstrably false. The rest are EMR complaints, I use it on my own Hadoop cluster.

_euac · on April 22, 2016

Just read the comment from one of the creators: https://news.ycombinator.com/item?id=11528776

zfrenchee · on April 19, 2016

Do you know of any good alternatives? Any way to write MapReduces in python?

ymt123 · on April 19, 2016

It's not quite the same (since it doesn't become a Map-Reduce job) but if you're mostly interested in the programming paradigm/scalability the Python API for Apache Spark might be a good alternative

tanlermin · on April 19, 2016

Yes! Check out dask: http://www.slideshare.net/continuumio

Its free with a permissive license.

It is also capable of native HDFS integration, Yarn etc and can do more complex and granular parallel patterns than just map reduce. Also has a API for distributed dataframes and arrays with linear algebra ops.

DISCLAIMER: I don't work for continuum. I just want to see its projects succeed because I was a user will benefit.

_dark_matter_ · on April 19, 2016

https://hadoop.apache.org/docs/r1.2.1/streaming.html

pvnick · on April 19, 2016

This is likely the best answer for those who wish to code within the map/reduce paradigm by hand and would prefer to use python.

pwang · on April 21, 2016

BUT WHY

Your performance is going to be complete and utter crap because you're paying for serialization on every single data element.

Dask is higher performance and more pythonic: http://matthewrocklin.com/blog/work/2016/02/22/dask-distribu...

hamilyon2 · on April 19, 2016

Luigi does decent job. It is relatively easy to start with and powerful enough to do almost anything

rch · on April 19, 2016

I've been using Luigi for a few months, with no complaints. It supports running Python jobs on Hadoop and Spark, but it's not really a MapReduce framework unto itself.

However http://discoproject.org/ might be worth a look as a standalone alternative.

sitkack · on April 19, 2016

I have used Disco extensively in the past, nothing but good things to say about it. Fast job launch, easy to write, the DFS has been stellar. This was only using Python for job code.

_euac · on April 19, 2016

Unfortunately, no. We are slowly moving away to a streaming infrastructure, so I've been mostly trying to "keep it running" until we are done replacing it. Sorry.

tanlermin · on April 19, 2016

Check out dask: http://www.slideshare.net/continuumio

Its free with a permissive license and actively growing.

It is also capable of native HDFS integration, Yarn etc and can do more complex and granular parallel patterns than just map reduce. Also has a API for distributed dataframes and arrays with linear algebra ops.

DISCLAIMER: I don't work for continuum. I just want to see its projects succeed because I was a user will benefit.

gshulegaard · on April 19, 2016

Andrew Montalenti did a great talk about scaling out Python at Parsely at the last PyData conference: https://www.youtube.com/watch?v=gVBLF0ohcrE

But TBH, after a certain scale you should really be asking whether or not you should be using Python.

mring33621 · on April 19, 2016

Apache Flink? http://www.kdnuggets.com/2015/11/getting-started-python-apac...

elsherbini · on April 19, 2016

My favorite new (to me) tool is snakemake[0], make files with python 3 support. It allows me to both make my workflow and document it in the same place, hugely helpful for jumping around to different projects or needing to rerun a pipeline with new data. If interested, i recommend taking a look at this tutorial[1] with lots of different snakemake patterns.

[0] https://bitbucket.org/snakemake/snakemake/wiki/Home

[1] https://github.com/leipzig/SandwichesWithSnakemake

melor · on April 19, 2016

I use histogram.py from https://github.com/bitly/data_hacks all the time...

aub3bhat · on April 19, 2016

Its really cool, I wish it handled missing values (empty string), will submit a PR soon. until then here is the issue https://github.com/bitly/data_hacks/issues/34

jyotiska · on April 19, 2016

Wow. This looks really cool.

philovivero · on April 19, 2016

Interesting. Looks similar to my version, albeit with a bit fewer features.

https://github.com/philovivero/distribution

Bjartr · on April 20, 2016

I expect if you mentioned a few of the features yours has that the other doesn't you wouldn't have gotten downvoted.

elsherbini · on April 19, 2016

plotly is a fantastic tool for plotting. It has a python API [0], but also works from R, matlab, and Julia. It also has support for pandas dataframes and jupyter notebook[1], which is by far the fastest way I've found to make attractive plots. plotlyjs[2] is a fantastic wrapper around d3. So I can go all the way from plotting something quickly from a dataframe to building a totally custom chart.

[0] https://plot.ly/python/

[1] https://plot.ly/ipython-notebooks/cufflinks/

[2] https://plot.ly/javascript/

qacek · on April 19, 2016

I like plotly as well but I couldn't stand the python api nor cufflinks for that matter so I created my own wrapper. It's not fully featured but it handles 90% of the cases I want.

https://github.com/jwkvam/plotlywrapper

elsherbini · on April 19, 2016

very nice. I like that it each chart method returns the figure, so if it is needed to do something you didn't implement the figure is available to edit.

qacek · on April 19, 2016

Thanks, I am happy to accept PRs that expose more functionality.

ced · on April 19, 2016

How does it compare with Bokeh?

elsherbini · on April 19, 2016

I prefer the aesthetic of the defaults in plotly over Bokeh. Also, for most of my tasks I can simply use dataframe.iplot() using the library from [1] above, and I value that simplicity. Lastly, I prefer that plotly is built on top of d3js so I have access to that api if I want to do anything crazy, whereas Bokeh reinvented the wheel a bit with BokehJS.

forgetsusername · on April 19, 2016

This is a great list.

I'm equally excited for all the suggestions sure to appear in the comments (hinthint). I got a ton from this thread last time, even though they weren't data analysis specific:

https://news.ycombinator.com/item?id=10782969

qopp · on April 19, 2016

If you're looking for a simple data pipeline, there's pipeless: https://github.com/andychase/pipeless

Also reparse if you want to parse natural language with regular expressions: https://github.com/andychase/reparse

nxzero · on April 19, 2016

Number of good open source data analysis projects by primary developer of the NumPy package and his company listed here:

https://www.continuum.io/open-source-core-modern-software

tanlermin · on April 19, 2016

Check out dask for distributed and out of core parallel programming : http://www.slideshare.net/continuumio

Its free with a permissive license.

It is also capable of native HDFS integration, Yarn etc and can do more complex and granular parallel patterns than just map reduce. Also has a API for distributed dataframes and arrays with linear algebra ops.

DISCLAIMER: I don't work for continuum. I just want to see its projects succeed because I was a user will benefit.

gshulegaard · on April 19, 2016

I think you might be interested by this talk: https://www.youtube.com/watch?v=gVBLF0ohcrE

tanlermin · on April 19, 2016

Thanks...though the whole "GIL being a feature" is sort of a joke.

gshulegaard · on April 22, 2016

I believe he meant it as a joke. At least that's how I interpreted it :)

codenberg · on April 19, 2016

Another option is Agate (http://agate.readthedocs.org) which comes from the journalism community.

teekert · on April 19, 2016

Natsort is a lifesaver when working with filenames numbered by humans (like file1, file2 ... file11), those will be sorted correctly. Beats asking people to "Please add leading 0's oh and when you suspect you will pass 100, add 2 leading 0's."

mzs · on April 19, 2016

I dislike how it changes behavior from release to release, for example foo-1.2, id that foo 1.2 or foo -1.2? Default dpends on release of natsort with new routines to restore previous behavior.

herge · on April 19, 2016

FWIW, the sort method (and sorted keyword) take a 'key' keyword, where you can pass a function to use to calculate the key to sort the sequence with. So in your file11 case, you can do:

sorted(files, key=lambda x: int(x[4:])

, and it will do the right thing.

Although with natsort, you don't have to parse the actual strings yourself.

daveguy · on April 19, 2016

That is a neat trick, but it would be incredibly brittle. Kids, don't try this at home!

solaxun · on April 19, 2016

Pass in an re.match or re.search based function, i would imagine that would be powerful enough to meet most needs.

import re

x = ['foo12901','fooo900','fooooooo980090']

x =sorted(x,key = lambdax:int(re.search('\d+',x).group()))

print(x)

shoyer · on April 19, 2016

+1 this is the right way to build a custom sorting function. The only thing worse than relying on ad-hoc heuristics for processing your data is relying on heuristics that somebody else maintains!

c17r · on April 19, 2016

I'll have to check delorean out, I usually use http://crsmithdev.com/arrow/ for python date manipulation. It works a lot like the javascript library moment.

googletron · on April 19, 2016

cool, let me know if you have any issues. sanx.

BrandoElFollito · on April 29, 2016

I use arrow for all my time related operations. I tried dolorean once (very quickly) and found out it was missing several elelents I needed (which arrow had). Maybe I did not look closely enough, I will try again and be back if there is interest. Thanks.

r0muald · on April 19, 2016

dataset https://dataset.readthedocs.org/en/latest/ is pretty good and for most scenarios as easy as tinydb, but backed by a real SQL database.

xemoka · on April 19, 2016

I've used this with a Flask project before, great module and super easy.

Ruud-v-A · on April 19, 2016

Off topic: I really like the minimalistic approach to your blog. In Minion (my default serif font) it looks better and more readable than the majority of webpages out there.

forgotpwtomain · on April 20, 2016

> delorean

Datetime in python is a really sad state of affairs. I wince every time I have to do it - especially if you've just used ruby/rails recently..

ZenoArrow · on April 19, 2016

tinydb looks like it could be useful, thanks for this.

Whilst this isn't a data analysis library per se, PyOpenCL may be of interest for people doing data analysis work in Python:

https://mathema.tician.de/software/pyopencl/

jbrambleDC · on April 19, 2016

Vincent has not been properly maintained in a year. and is broken at this point since the release of Vega 2.0

Semiapies · on April 19, 2016

Yes, this recommendation puzzled me. It's essentially a dead project.

"Vincent is essetially frozen for development right now, and has been for quite a while. The features for the currently targeted version of Vega (1.4) work fine, but it will not work with Vega 2.x releases. Regarding a rewrite, I'm honestly not sure if it's worth the time and effort at this point."

merlincorey · on April 19, 2016

The new project is here: https://github.com/uwdata/ipython-vega-lite

micah_chatt · on April 19, 2016

I'd also add to this list Pandashells https://github.com/robdmc/pandashells - Basically use Pandas in the command line.

diegosouza · on April 19, 2016

https://github.com/turicas/rows also worths a mention.

metaobject · on April 19, 2016

I've used PrettyTable on a few projects and found it to be very easy to use. Highly recommended!

jventura · on April 19, 2016

Tabulate is also a good alternative, and more recent: https://pypi.python.org/pypi/tabulate

azag0 · on April 19, 2016

I would consider Tabulate much superior to PrettyTable.

metaobject · on April 22, 2016

Looks good! It might be time to update some code ...

TheLogothete · on April 19, 2016

I hear a lot of talk about using python for data analysis. I gave up after trying to find a library to do cross tabs. Is there something to make custom tables in python other than prettytables?

jboynyc · on April 19, 2016

What's wrong with using Pandas? http://pandas.pydata.org/pandas-docs/version/0.17.0/generate...

TheLogothete · on April 19, 2016

Perhaps I should have been more clear. I want to present the results in pdf or html. Like xtables, tables and stargazer packages in R.

closed · on April 19, 2016

I haven't used xtables or stargazer in a while, but ipython + pandas can display tables as html.

Here is an interesting ipython notebook with some examples:

http://nbviewer.jupyter.org/gist/chris1610/f2f4a2e9181f6ec22...

Notre1 · on April 19, 2016

Oooo, I'm going to have to look at that qgrid widget. I've been frustrated when I had to dump a df (DataFrame) to Excel to browse a large df.

Notre1 · on April 19, 2016

You can easily export any pandas DataFrame to html using the to_html() method. To generate full webpage, you'll probably want a templating engine like Jinja2.

The best demo I've seen for generating a PDF report is on Practical Business Python[1]

Edit: I forgot to mention the new pandas Style[1] feature for generating some impressive looking html tables.

[1] http://pbpython.com/pdf-reports.html

[2] http://pandas.pydata.org/pandas-docs/stable/style.html