As someone who scaled the database at a company that dealt with over a petabyte ...

wayoutthere · on Dec 22, 2019

> It seems like the biggest issue Etsy had was a lack of strong engineering leadership.

Know many engineers at Etsy; this is absolutely the problem and it persists to this day. The engineers are fine and do their best to self-organize but there is no cohesive engineering vision. They complain about this issue over beers with some frequency.

etsyanon123 · on Dec 22, 2019

Yes 100%. Its not so much as how should we architect/build as we don't even know what problems we're solving and what is important before embarking on large projects/refactors/rewrites. I don't think unique to any org but many rewrites are done based on tech trends/popularity rather than actually solving a problem.

Even if the python rewrite didn't immediately fail what would success look like? What does failure look like? How to measure?

wayoutthere · on Dec 22, 2019

From a lot of what I've heard there is trouble hiring / retaining line-level managers -- engineering managers are usually the ones thinking about and negotiating roadmaps among multiple teams. When you have developers filling this role ad hoc, it can be difficult for them to context switch from low-level technical concerns to operational business concerns. So the low-level technical concerns are the ones that get focused on rather than being focused on some overarching business-led strategy.

Not picking on Etsy here in particular as lots of companies have this problem. But it's not something people often think about as being crucial for good engineering culture.

yoz · on Dec 22, 2019

_All I'm saying is you can build a pretty successful business even with a bad engineering team._

I completely agree. The most important lesson I’ve learned in the past decade of software development is:

Good product/UX design can save bad engineering, but it doesn’t work the other way around.

endymi0n · on Dec 22, 2019

100% this.

Tackling a real problem > Having a big addressable market > Good product / UX > Great Engineering.

Strictly.

Great Engineering pays back at a much larger time scale and makes for an enabler or a breaker at the critical scaling stage only. You can get surprisingly far with a protoduction service and still build something huge (see Twitter).

Great design is atomic and homogenous. Weak technology leadership ends up in opening up a vacuum that gets filled by Architecture Astronauts (Etsy), every team doing their own thing technologywise (SoundCloud), or at the very worst, some dogmatic homebrew NIH stack (various places I‘ve worked at).

Every unhealty Technology organization looks unhealthy in its own way, but the great ones look all alike: Clean, simple, rather homogenous and logical.

StreamBright · on Dec 22, 2019

This is only true with boundaries around engineering quality. There can be not so great engineering making it impossible for the business to be successful because their solution is more expensive than the revenue generated.

bpicolo · on Dec 22, 2019

Depends on the type of software. If you're building transactional software where your business is doing a few bucks cut per user transaction, the software can be tremendously bad before it hits into revenue, as long as you’re not totally busting the user experience

m0zg · on Dec 22, 2019

>> can save bad engineering

Depends on how "bad". I've seen _many_ counterexamples, where no amount of lipstick could fix the underlying flaws. What good is a pretty website if it's slow, or if it doesn't work at all half the time.

Aeolun · on Dec 22, 2019

I see this fixed all the time by throwing more resources at it.

Case in point, all servers were configured to accept a maximum of 5 connections, so when the load balancer got more than x concurrent requests, they started failing.

Instead of fixing the problem, we can just scale the amount of servers (horizontal scalable yay), until the problem goes away.

Now you have tons of 8GB/4C machines sitting (almost) idle, and immense amount of money wasted.

AlchemistCamp · on Dec 22, 2019

> Instead of fixing the problem, we can just scale the amount of servers (horizontal scalable yay), until the problem goes away.

This works if the bad engineering causes a linear increase in scaling costs. If the bad engineering causes an N! increase or a 2^N increase in scaling requirements, then there's a good chance all the servers on the planet aren't enough.

I'd say engineering is no easier to fix by throwing money at it than UX is.

the-rc · on Dec 22, 2019

Until you eventually run out of connections on the DB... And all the workarounds that ensue.

Aeolun · on Dec 24, 2019

Don’t worry, we have a REST service for that too!

conradfr · on Dec 22, 2019

Let's say software is "data -> code -> UI".

The two most important things are data and UI, the one thing we developers spend all of our time thinking and talking about is code :)

mtbcoder · on Dec 22, 2019

There are plenty of successful applications with bad UIs. Also, business logic is encapsulated in code - if that's borked, no amount of UI will save it.

masklinn · on Dec 22, 2019

> There are plenty of successful applications with bad UIs.

Because data >>> UI >> code.

> Also, business logic is encapsulated in code - if that's borked, no amount of UI will save it.

The assumption here is that the bare minimums are fulfilled. If the software doesn't work at all you have none of the above.

CPLX · on Dec 22, 2019

> a business can succeed even with a bad engineering team

If you’ve spent any time in and around Etsy the reality of this is a little more complicated. Etsy has had some very impressive people on its technical team throughout and an unusually large number of really talented developers. It’s also suffered from some serious disorganization in its technical path and complicated leadership situations.

It’s the mix of both that produced Etsy. I think with a bunch of “bad” developers they would have been dead long ago. It’s been more like a (often) good engineering team making (often) questionable decisions.

SQueeeeeL · on Dec 22, 2019

I think a fairer complaint would be "bad engineering leadership", which is basically "bad leadership"...

Someone who cares deeply about the company will understand their core product is flawed and work to find someone to make it better. Not ignore devs

davewritescode · on Dec 22, 2019

100% this. I try to remind younger engineers that struggle trying to make perfect code that there’s a ton of companies who wrote great code on the way to bankruptcy and even more companies running unmitigated PHP based disasters that are very very successful

_the_inflator · on Dec 22, 2019

> All I'm saying is you can build a pretty successful business even with a bad engineering team.

I liked your comment.

I would say "bad engineering". Look at all the banking stuff (disclosure: finance/banking lead) and you will immediately recognize this as a reality.

Why? Because there are no KPIs that relate to speed, sustainability, maintenance. There is a distinction between input, output and outcome.

Also a lot of architecture is done using power point. Drawing a line in PP is easy, connecting the services in reality is often times anything else than trivial.

GordonS · on Dec 22, 2019

> Also a lot of architecture is done using power point. Drawing a line in PP is easy, connecting the services in reality is often times anything else than trivial

From a lot of the "startup stories" I read on HN, I rather gather that often there simply isn't an architectural vision, PowerPoint or otherwise.

TheRealDunkirk · on Dec 22, 2019

> All I'm saying is you can build a pretty successful business even with a bad engineering team.

Many, many years ago, I was asked to take on a new role as the system administrator for a group of advanced-degreed engineers, with their Unix workstations and FEA/CFD software. I setup centralized servers to host their software, a user directory, shared home directories, backup systems, and a Citrix server for Office. It was a tremendous success, and everyone loved the new setup. Upgrades to software became a breeze, people could access their files from anyone else's machines, and most got rid of their secondary PC for just doing email.

The drafting department, with all of their CAD machines, was a different story. Software and data directories were cross-mounted around all of the machines. It was such a mess that, when we had a power outage, it would take them THREE DAYS to get everything going again. This happened a few times a year.

I moved on to another new role to setup a Sun E10K, 3 cabinets of EMC disk, a 384-tape backup machine, and all the stuff that went with this. I was trying to explain the difference in these setups to my (very senior) IT manager, to get a point across about what I was doing. (I might have been arguing about setting up a third, "backup area network," in addition to the second "administrative network," but memory fades.) I got done, and she stared at me like I had a hunchback, and said (in reference to the drafting department), "Yeah, but it WORKS, right?"

You're absolutely right. Senior execs look at the fact that business is improving, and think to themselves, hey, we must be doing something right, and move on to other things. However, it is a source of continual disappointment that there is SO MUCH lost opportunity in the decisions that get made about how IT should be done. And, what's more, those bad decisions are being made by people who, by their role and their pay, should really be expected to understand the difference in my 2 examples, and what the extra overhead is costing.

I mean, if you had two physical production lines in plant, which were in series with each other, and one of them couldn't do anything for 3 freaking days after a power outage, while one carried on as if nothing had happened, plant managers would immediately fire a line manager who told him, "Yeah, but it WORKS, right?"

I think these situations endure because the emperor has no clothes. The people who are calling the shots can't understand the technology that MAKES the difference, and don't want their ignorance exposed.

0x445442 · on Dec 22, 2019

Great insights. I'd also add; the notion of UI/UX and good business covering up engineering is highly dependent on the business space. The further down the call stack you go the less you're going to get away with bad engineering.

If the product is Digital Ocean or DynamoDB or embedded software installed in Ford F-250s UI/UX and slick business aren't going to mask bad engineering.

rhlsthrm · on Dec 22, 2019

Depends. I've heard that Heroku's backend container networking stack was a total mess (not sure if it still is). However they seem to have done well despite that because they had good docs, good marketing, a good UX in the form of DevEx.

pixelmonkey · on Dec 22, 2019

Yes. I think this Etsy story showcases an interesting split in engineering approach, between being "right" vs being "effective". Perhaps a variant on "worse is better".

Etsy was originally built by two NYU students while they were still in college. It was their first production software project. It grew super fast in a period where the whole industry was still figuring out how to scale web applications; perhaps only Amazon & eBay had actually figured it out by then.

Then VCs got ahold of Etsy and tried to put out the fires by replacing the "kids" who built the thing. This was around 2006, which predates even Twitter's fail whale and Series A funding to round.

The founding team wasn't a bad engineering team. It was a young founding team who built a thing that grew like crazy, and which, had it not been built that way, would have meant no engineer could have ever come along to preside over a rewrite. But because it was crashing and dollars were at stake, and because there was some drama that prevented the founding team from having the time to build a proper scale-up engineering organization, a bunch of people came in and immediately reached for a rewrite.

They had a working thing, on a not-that-uncommon stack, but the new engineers insisted on a rewrite. Why?

Because new engineers always insist on a rewrite. Because, as my friend James Powell once put it, "Good code is code I wrote; bad code is code you wrote."

Python and Twisted. I am a big fan of Python and watched the evolution of Twisted -- the rise in popularity, and then the decline. I can imagine how they came across a group of engineers who were absolutely convinced Twisted was the silver bullet to solve their engineering challenges, at that specific moment in time -- when Twisted was at a peak of popularity.

Just like Django/Rails was the silver bullet in 2011. (Or Scala?) Just like Angular was in 2015 (or Go?). Just like React is now (Or k8s?).

Of course, I am not using the "silver bullet" term accidentally. Tech trends change, but "there is no single development, in either technology or management technique, which by itself promises even one order of magnitude [tenfold] improvement within a decade in productivity, in reliability, in simplicity." (Brooks, 1976)

New tools are just an excuse for a rewrite, and rewrites are the ultimate shiny object to a new engineering team, brought in to "fix" an existing and working system.

As an aside: the founding software team at Etsy didn't get nearly enough credit for building a thing that worked and that grew from nothing. If they had focused on the "right" architecture, none of us would have ever have heard of Etsy, and we certainly wouldn't be debating the rewrite. That founding team didn't get enough credit, neither in equity nor in history. But that's another story.

excerionsforte · on Dec 22, 2019

It is hard as a new employee to not say "stop everything and let's rewrite this" when something is broken. New engineers are not only the ones to say it either as the craze and popularity of what's hot captivates longer tenure employees. If something vibes with the problem you are solving then it is only natural to want to use it. Proving the usefulness and viability is the part no one really wants to do as it is a long slog that is against the need of getting a promotion. Why not just use in a new app to prove it? and that will be the basis for let's use this everywhere (aka generalize it!)

It is very nice to try new things though because without that we can't move forward, but how much forward is enough seems to be the troubling part as investing time learning the "new unproven thing" every year or so is taxing.

> As an aside: the founding software team at Etsy didn't get nearly enough credit for building a thing that worked and that grew from nothing. If they had focused on the "right" architecture, none of us would have ever have heard of Etsy, and we certainly wouldn't be debating the rewrite. That founding team didn't get enough credit, neither in equity nor in history. But that's another story.

Very unfortunate reality here.

pixelmonkey · on Dec 22, 2019

Indeed. And, to be clear, I don't think rewrites are always wrong. I just think they are very, very dangerous, and need to be treated as such. I wrote about a successful rewrite I presided over in this essay, "Shipping the Second System". But anyone who was on my team when we did that rewrite will tell you: we were far from optimistic about it. We only did it after exhausting ALL other options, and it was a risk-managed project all the way through.

https://amontalenti.com/2019/04/03/shipping-the-second-syste...

excerionsforte · on Dec 22, 2019

Interesting read.

> “the second system is the most dangerous one a man ever designs”

Never thought about it that way. Feature creep and aversions to what you thought went wrong from the last time get tested here.

> So, all in all, we did several “upgrades to stable” that were actually bleeding edge upgrades in disguise.

I can relate. Being burned by certain upgrades can push one to veer on the side of caution even if the previous version had bugs. Software is never finished as they say.

lazyant · on Dec 22, 2019

I'm having a Baader-Meinhof effect; I recently read about this "Second system syndrome", I think in "The Unicorn Project".

lowercased · on Dec 22, 2019

> Because new engineers always insist on a rewrite. Because, as my friend James Powell once put it, "Good code is code I wrote; bad code is code you wrote.

Good code is code that is testable and tested, bad code is code that is code that is not tested (and may not even be testable).

As someone who's written too much non-tested (and sometimes non-testable) code, I'm pushing myself to test, and to hold that line, on projects both large and small.

Came in to a project last year that had been around 4 years. It was not only untested, but the original code in place turned out to be somewhat untestable altogether. 16 months later, we're still finding "rock bottom" moments of "holy shit this is garbage". By "garbage" I mean finding out that data that people that was being recorded was being discarded or overwritten, for months. Financial data being simply wrong, and 'correct' data not being recoverable.

First reaction looking at it was "this needs to be rebuilt". No no no, we were told - it's already working it just needs a few more features. To the extent that things were "working", they were working on accident. There were no docs or tests to demonstrate that things were working as expected.

The last year has been spent keeping it 'going' while trying to unravel and repair years of bad data is continually uncovered every few weeks. "We don't have time to rewrite!". The fact is, probably 60% of the code that was in use before has been rewritten, but it's taken over a year. It would have been faster to do it from scratch, and worry about importing/transforming data as we verified it after the fact.

So... good code is testable and tested. Absent these qualities, if I'm the 'responsible party', I will advocate strongly for a rewrite. If that 'rewrite' is 'in place', people will be told up front everything will take 2-3x longer than they expect, because we're now doing discovery, planning, testing and whatever 'new feature dev' you were asking for.

Part of the problem with this particular system was that it was started as a rewrite, but for the wrong reasons, or... executed wrong. "We can't understand the old code", so they tacked on a second framework in place and started blending the code - but still documented nothing, nor wrote any tests, or had a repeatable build process. Nothing. So instead of one undocumented system, they had 2, but one was in their head, so it was 'OK'. Based on what we inherited, one could only assume they really didn't have any understanding of the domain problems they were tackling, which just compounded every bad decision.

pixelmonkey · on Dec 22, 2019

Well said!

lowercased · on Dec 22, 2019

thank you :)

jyounker · on Dec 24, 2019

> All I'm saying is you can build a pretty successful business even with a bad engineering team.

A good business model covers a multitude of technical sins.

mlthoughts2018 · on Dec 22, 2019

Twisted is a widely used and well-understood framework that continues to work well and to be a major consideration-worthy option even in the days of modern Python 3 asyncio.

I can’t say whether the Twisted consultant’s advice was good or bad re: business needs, but your comment seems very wrong regardless. A solution built on Twisted would be nearly the opposite of something no one understands, and there would be big communities to go to for help, and you wouldn’t even have too much vendor lock-in since many frameworks that are competitors to Twisted could be used with minimal code changes.

StefanKarpinski · on Dec 22, 2019

You’re missing the point. At a Python shop, using Twisted would have been a fine choice. Adding a completely unnecessary Python layer at a PHP shop was not helpful. Yes, there exist people in the world who know Twisted, but they do not tend to work at a place where everything is written in PHP. Adding a centralized choke point and extra network hops to reduce latency was a bad technical strategy, and adding a layer in a language that the existing engineering team didn’t know was also a bad idea.