More

jd20 · 2024-05-22T18:04:29 1716401069

I didn't see it mentioned, but why not just use robots.txt? Does Bytespider ignore it?

hajimuz · 2024-05-23T04:17:56 1716437876

Yes. It ignores robots.txt.

chptung · 2024-05-22T18:05:33 1716401133

Not super sure so this felt like a faster way to plug it ASAP.

jd20 · on Sept 4, 2020

I'd really love to see more data on whether ergonomic keyboards actually work. From what I've read, it sounds like the results are mixed: I kind of want to try a split keyboard like the ergodox or Kinesis, but I feel I tend to cross-over a fair amount when typing, and I wonder if a split keyboard would be less efficient.

I also overthink a lot about the position of frequently used keys like Cmd/Ctrl/Alt (on a Mac for instance), and what the optimal placement would be, and I feel like there's very little data about this topic.

dodobirdlord · on Sept 4, 2020

I find the motion of rotating the hands outward past the neutral position, such as to strike the enter key on a standard keyboard, to be extremely unnatural and the source of major RSI. I switched over to an ergodox ez out of necessity, and found that moving all frequently used keys to the thumb pads or to layers near the home row was extremely helpful. I think is because it eliminated those frequent outward flexes and ensured that the wrists remain almost always in a neutral position. I think the health benefits of keeping the wrists neutrally positioned while typing is uncontroversial.

samatman · on Sept 4, 2020

I've been using an Ergodox for a few years now, and the big thing I had to get used to is not cross-typing y. I still do it with a laptop keyboard, but after a week or so I got the hang of it. In the interim, I made the key I would accidentally hit a dead key.

It's a layer key now and I don't hit it accidentally in a typical day.

The ortholinear layout was dead simple for me, I gather that's not true for everyone but one way or the other, your fingers get used to it all after a couple weeks.

I don't have any data on them actually working. But I feel a lot better standing, with my upper arms parallel to the ground, and hands shoulder-width apart, wrists slightly supinated. My shoulders stay loose and my neck and back stay straight. Any number of random aches and pains don't happen any more.

grumpyprole · on Sept 4, 2020

On my Kinesis Advantage I can type any letter whilst keeping my wrists straight, hands still and with much less finger movement than a regular keyboard. It completely cured my RSI and it hasn't returned in over 4 years. I realise this is anecdotal, but it's not really a surprise that a better key layout leads to less stretching and contortion and less RSI.

LargoLasskhyfv · on Sept 5, 2020

I seem to recall from about 20 years ago when I got my first Marquardt Mini Ergo, that the company producing them, and the magazines testing them, referring to the world record in fastest typing at the time was made on them, several times. It is split.

So the thing to do would be to compare on which devices these records, or yearly championships have been achieved.

Maybe there are lists? I tried to find them, but got overwhelmed.

jd20 · on Sept 4, 2020

One does (it's a southpaw, numeric keypad on the left). A few are TKL (full size keyboard without the numpad).

Actually, as a programmer, I pretty much never use the numeric keypad. But when I start seeing smaller layouts with no arrow keys, Fn keys, or even number keys, I tend to agree: there's a definite trade off between function and aesthetics. The beauty of custom keyboards is people get to decide those trade-off's themselves.

kitsunesoba · on Sept 4, 2020

As a dev I also have very little use for a numpad. Unless it’s a the weekend and I’m messing around in Blender, it along with the traditional home/end cluster and arrows are just a bunch of dead space pushing my mouse way too far the right. I much prefer the numpad being it’s own separate thing that can be moved around, but a southpaw setup would be ok too.

The worst thing is when laptops come with numpads, pushing the trackpad off to the left and making it impossible to center my arms while typing. Drives me crazy.

jd20 · on Sept 4, 2020

A clarification: these are interviews with people who assemble custom keyboards, I was expecting chats with the people who actually design and produce custom keyboards (like yuktsi, Rama, Wilba, ZealPC, etc...)

Still, very cool to see what people are building. I've just recently fallen down into the rabbit hole of custom keyboards, after my Apple Keyboard stopped working. As someone who spends almost half my life at a keyboard, I'm surprised it took this long for me to look into improving the tool I interact with most every day.

jetpacktuxedo · on Sept 4, 2020

> A clarification: these are interviews with people who assemble custom keyboards, I was expecting chats with the people who actually design and produce custom keyboards (like yuktsi, Rama, Wilba, ZealPC, etc...)

At least one of them is with a board designer, the V4N4G0N one with Evan (formerly of TheVan Keyboards)

jd20 · on Sept 1, 2020

Some fun facts:

- Applebot was originally written in Go (and uncovered a user agent bug on redirects, revealing it's Go origins to the world, which Russ Cox fixed the next day).

- Up until the release of iOS 9, Applebot ran entirely on four Mac Pro's in an office. Those four Mac Pro's could crawl close to 1B web pages a day.

- In it's first week of existence, it nearly took Apple's internal DNS servers offline. It was then modified to do it's own DNS resolution and caching, fond memories...

Source: I worked on the original version.

ospider · on Sept 1, 2020

> It was then modified to do it's own DNS resolution and caching, fond memories...

Unlike other languages, Go bypasses system's DNS cache, and goes directly to the DNS server, which is a root cause of many problems.

Spivak · on Sept 1, 2020

This is true but a little misleading. On Windows Go uses GetAddrInfo and DNSQuery which does the right thing. But on Linux there are two options: netgo and netcgo -- a pure Go implementation that doesn't know about NSS, and a C wrapper that uses NSS.

Since netgo is faster, by default Go will try its best to determine if it must use netcgo by parsing /etc/nsswitch.conf, looking at the tld, reading env variables, etc..

If you're building the code you can force it to use netcgo by adding the netcgo build tag.

If you're an administrator the least intrusive method I think would be setting LOCALDOMAIN to something or '' if you can't think of anything which will force it to use NSS.

tylfin · on Sept 1, 2020

Yeah, I've never had to implement my own DNS cache for a language before...

If you're on a system with cgo available, you can use `GODEBUG=netdns=cgo` to avoid making direct DNS requests.

This is the default on MacOS, so if it was running on four Mac Pro's I wouldn't expect it to be the root cause.

jd20 · on Sept 1, 2020

It's possible that wasn't the default setting on Macs back then. I don't know that cgo would be a good choice either, if you're resolving a ton of domains at once. Early versions of Go would create new threads if a goroutine made a cgo call, and an existing thread was not available. I remember this required us to throttle concurrent dial calls, otherwise we'd end up with thousands of threads, and eventually bring the crawler to a halt.

To make DNS resolution really scale, we ended up moving all the DNS caching and resolution directly into Go. Not sure that's how you'd do it today, I'm sure Go has changed a lot. Building your own DNS resolver is actually not so hard with Go, the following were really useful:

https://idea.popcount.org/2013-11-28-how-to-resolve-a-millio...

https://github.com/miekg/dns

oasisbob · on Sept 1, 2020

And Java.

As I understand it, Go and Java are both trying to avoid FFI and calling out to system libs for name resolution.

I tend to always offer a local caching resolver available over a socket.

ksec · on Sept 1, 2020

>- Up until the release of iOS 9, Applebot ran entirely on four Mac Pro's in an office. Those four Mac Pro's could crawl close to 1B web pages a day.

Considering the timeline, are those Trash Can Mac Pro? Or was it the old Cheese Grater ?

jd20 · on Sept 1, 2020

Trash cans :)

nothis · on Sept 1, 2020

>Up until the release of iOS 9, Applebot ran entirely on four Mac Pro's in an office. Those four Mac Pro's could crawl close to 1B web pages a day.

The scale of web stuff sometimes surprises me. 1B web pages sounds like just about the daily web output of humanity? How can you handle this with 4 (fast) computers?

raxxorrax · on Sept 1, 2020

Computers are very fast. We just tend to not notice because today's software is obese.

72deluxe · on Sept 1, 2020

Yes, let's all run separate web browsers as the application and run our own JavaScript inside our browser. Who cares if there's 5 other "apps" doing exactly the same!

Insanity.

_c3bx · on Sept 1, 2020

Multiple tabs/browser windows is similar and generally not an issue.

ogre_codes · on Sept 1, 2020

I think they were referring more to apps like Slack and other similar JS browser/ JS based apps which run separate from the browser. Maybe I'm being generous? Slack is certainly itself a beast.

72deluxe · on Sept 3, 2020

Yes this is precisely what I meant. eg. Electron apps, Slack, VS Code, Skype etc. etc. ad nauseum

thdrdt · on Sept 1, 2020

Doesn't it depend on a lot of things? For example you can only do head requests to see if a page changed since a given timestamp. If not then there is no need to process it.

JCharante · on Sept 1, 2020

For anybody wondering how:

The HTTP HEAD method requests the headers that would be returned if the HEAD request's URL was instead requested with the HTTP GET method. For example, if a URL might produce a large download, a HEAD request could read its Content-Length header to check the filesize without actually downloading the file.

https://developer.mozilla.org/en-US/docs/Web/HTTP/Methods/HE...

JimDabell · on Sept 1, 2020

Typically you wouldn’t bother with a HEAD request, you’d do a conditional GET request.

When you request a page, the response includes metadata, usually including a Last-Modified timestamp and often including an ETag (entity tag). Then when you make subsequent requests, you can include these in If-Modified-Since and If-None-Match request headers.

If the resource hasn’t changed, then the server responds with 304 Not Modified instead of sending the resource all over again. If the resource has changed, then the server sends it straight away.

Doing it this way means that in the case where the resource has changed, you make one request instead of two, and it also avoids a race condition where the resource changes between the HEAD and the GET requests.

JCharante · on Sept 2, 2020

Do a lot of random pages return etags? I've only ever seen them in the AWS docs for boto3

account42 · on Sept 2, 2020

nginx sends it by default for static files (example: hacker news [0]), I assume other web servers do too.

[0] https://news.ycombinator.com/y18.gif

throwaway4good · on Sept 1, 2020

I am particular curious about data storage.

Does it use a traditional relational database or another existing database-like product? Or is built from scratch just sitting on top of a file system.

jd20 · on Sept 1, 2020

Nope, you don't really need a database. What you need for fast, scalable web crawling is more like key-value storage: a really fast layer (something like RocksDB on SSD) for metadata about URL's, and another layer that can be very slow for storing crawled pages (like Hadoop or Cassandra). In reality, writing directly to Hadoop/Cassandra was too slow (because it was in a remote data center) so it was easier to just write to RAID arrays over Thunderbolt, and sync the data periodically as a separate step.

ricardo81 · on Sept 1, 2020

Interesting stuff. I've used libcurl to crawl at that kind of pace, is the parsing/indexing separate from that count per day? Also interested in how you dealt with DNS and/or rate limiting

edoceo · on Sept 1, 2020

I've done similar at smaller scale. Instead of messing with underlying DNS or other caching in our code we just dropped a tuned dnsmasq as the resolver in front. The crawler had a separate worker to fill hosts so it was mostly hot when the crawler was asking.

ricardo81 · on Sept 1, 2020

In my case I was just fetching the home page of all known domain names. First issue I'd noticed was ensuring DNS requests were asynchronous. I wanted to rate limit fetching per /28 IPv4 to respect the hosts getting crawled, but couldn't really do that without knowing the IP beforehand (and keeping the crawler busy) so ended up creating a queue based on IP. I used libunbound. Found that some subnets have hundreds of thousands of sites and although the crawl starts quickly, you end up rate limited on those.

Also interested at the higher end of the scale about how 'hard'/polite you should be with authoritative nameservers as some of them can rate limit also.

pronoiac · on Sept 1, 2020

Roughly estimating, each Mac Pro could crawl around 3k pages per second.

polote · on Sept 1, 2020

Which is not possible

tinco · on Sept 1, 2020

Say the average web page is 100kb, and assuming gigabit connection in the office, then that's about a thousand pages per second. If the office switch is on 10gbit that would work out to 4000p/s naively counting. But we're in the same order of magnitude for the speed even on gbit, and we're not accounting for gzip, and the actual average page size might be a bit lower too.

jd20 · on Sept 1, 2020

Everything was on 10gigE. The average page size was around 17KB gzipped. Everything's a careful balance between CPU, memory, storage, and message throughput between machines.

Apple's corporate network also had incredible bandwidth to the Internet at large. Not sure why, but I assumed it was because their earliest data centers actually ran in office buildings in the vicinity of 1 Infinite Loop.

ricardo81 · on Sept 1, 2020

The average is a lot closer to 25KB IIRC, gzipped

derivagral · on Sept 1, 2020

Why not?

NiekvdMaas · on Sept 1, 2020

Can you share some more details about the current state? Is it still written in Go?

jd20 · on Sept 1, 2020

No idea, it's been years since I last worked on it. It was also not the only Go service written at Apple (90% of cloud services at Apple were written in Java), though it may have been the first one used in production.

etaioinshrdlu · on Sept 1, 2020

And I sit here kind of shocked that Apple would use Java for anything, backend or not. I thought Apple had a strong preference for owning its own tech stacks, whether that be ObjC/WebObjects or later Swift...

jd20 · on Sept 1, 2020

I think WebObjects was supporting Java even before it came to Apple from Next. In the early days, many of Apple's services built with WebObjects even ran on Sun server hardware, and XServe's. But nowadays it's all commodity Linux hardware, like you would find in any data center.

esprehn · on Sept 1, 2020

WebObjects has been fully Java since version 5 was released in 2001: https://en.wikipedia.org/wiki/WebObjects#WOWODC

Apple's server stack has been primarily Java for about 20 years.

Cthulhu_ · on Sept 1, 2020

Not sure why you'd be shocked, it's a solid language for enterprise services like Apple offers, and their other languages - C/C++, Objective-C, Swift - aren't very kind for web services.

Great use case for Go though, especially its concurrency features for web crawlers. I reckon Scala could work too, although it's a lot more complicated / clever.

72deluxe · on Sept 1, 2020

Out of curiosity, why would C or C++ not be good for web services?

mdaniel · on Sept 1, 2020

I would guess because the input sanitizing requirement is harder for the web; having a stackoverflow when running locally requires the attacker to execute locally -- having a use-after-free from port 80 would be a much wider audience

jd20 · on Sept 1, 2020

Some Apple services were written in C/C++. One downside is it's very hard to source engineers across the company who can then work on that code, or for those engineers to go work on other teams.

brown9-2 · on Sept 1, 2020

Apple employs the founder of the Netty project, who has given plenty of open talks about Apple’s use of Netty (which implies Java services). Same is true for Cassandra.

protomyth · on Sept 1, 2020

Apple had a very odd obsession with Java right after the NeXT purchase. WebObjects got converted and they tried to do a Java Cocoa. Both were worse than the original.

brian_herman__ · on Sept 1, 2020

At the cocoa heads user group I heard that ruby is very popular for their services more recently.

doh · on Sept 1, 2020

Can you talk more about the specific? What kind of parsers did you guys use? How about storage? How often did you update pages?

jd20 · on Sept 1, 2020

You should check out Manning's "Introduction to Information Retrieval", it has far more detail about web crawler architecture than I can write in a post, and served as a blueprint for much of Applebot's early design decisions.

giu · on Sept 1, 2020

Nice, thanks for the recommendation!

The book is freely available online at https://nlp.stanford.edu/IR-book/information-retrieval-book....

dx034 · on Sept 1, 2020

With 1b pages per day I guess you needed 1gbit/s connections on each of those machines? Especially if they also wrote back to centralized storage.

I guess there are not many places where you can easily get 4GB/s sustained throughput from a single office (especially with proxy servers and firewalls in front of it). Is that standard at Apple or did the infrastructure team get involved to provide that kind of bandwidth?

thatwasunusual · on Sept 1, 2020

Do you have a timeline of how AppleBot has evolved?

Silasdev · on Sept 1, 2020

Was that including the ability to render js driven asynchronously loaded pages, including subsequent XHR requests? If so, it's beyond impressive.

matthewhartmans · on Sept 1, 2020

Thanks for sharing mate. That is amazing insights!

person_of_color · on Sept 2, 2020

Why did you leave Apple?

netsharc · on Sept 1, 2020

Sorry to be pedantic, but your misuse of apostrophes in an otherwise perfect text annoys me.

All three uses of "it's" should be "its".

And I would just write "Mac Pros" instead of Mac Pro's".

jd20 · on Sept 1, 2020

Applebot was built for crawling web pages, to be used for search results in Spotlight and Siri. That user agent might also be used for attachment previews, but the original intent of Applebot was for search indexing.

jd20 · on Sept 1, 2020

Apple built their own search engine over 5 years ago, under the Siri / Spotlight umbrella. When people talk about Apple building their own search engine, they generally seem to expect a website dedicated primarily to web page results, but under the covers what powers Apple's Spotlight results is basically a search engine.

The big question would be what Apple would gain from a dedicated website for search results. Would people really switch to it from Google? Why would it be a better delivery mechanism for search results than Spotlight? Not sure the answers to these questions has changed much, from 5 years ago to today.

animex · on Sept 1, 2020

They have a lock on mobile devices for the western world. They can by default pry away search revenue from Google at a large scale. They have the cash reserves to see this done properly. They'd be crazy not to come onto Google turf. Their hardware lines are sagging revenue growth wise. Service revenue is their biggest growth area. I have no idea why they didn't do this before.

wheels · on Sept 1, 2020

> They have a lock on mobile devices for the western world.

They don't though. There are only a couple countries where they're more popular than Android, and only barely there. There's no country where they have 60% market share.

https://deviceatlas.com/blog/android-v-ios-market-share

krrrh · on Sept 1, 2020

Or just like Iran's nuclear program, they can extract more concessions if they are permanently 1-2 years away from deploying a search engine. Google's spot as the default search in Safari already earns Apple billions per year.

Yetanfou · on Sept 1, 2020

It might look like that for people inside the walled garden but I suggest applying for an exit visa to have a look around the great wide world out there where you'll quickly find out that it is a whole lot more diverse than the picture painted inside the walls. Not only is it more diverse, those rumours spread about the constant onslaught of viruses on those poor creatures on the other side of that big, safe wall turn out to be untrue, most people seem to never have even seen one. Even stranger things will start to become clear, like those supposedly poor and oppressed outsiders needing far less money to get around the world than those inside the walls while at the same time having are larger choice of methods to navigate it. It wouldn't surprise me if you decided to stay, just don't forget to plan this carefully so your identity does not get stuck on the inside after which you won't be able to message your friends any more. It is a bit tricky but it is doable, many others have gone that way before.

scarface74 · on Sept 1, 2020

Google pays Apple a reported $8 billion a year to be the primary search engine on iOS.

Apple’s hardware sells were also up slightly in every category year over year.

https://9to5mac.com/2020/07/30/apple-q3-2020-earnings/

tonyedgecombe · on Sept 1, 2020

Presumably Google are making quite a bit more on that traffic. Apple might just be eyeing those additional profits.

scarface74 · on Sept 2, 2020

Through advertising? How would that be any better than what we have now.

tonyedgecombe · on Sept 4, 2020

It might not, I'm just trying to think of their motives.

jahewson · on Sept 1, 2020

Alternatively: Would people really switch back to Google if Apple changed the default search engine on iOS?

jd20 · on Sept 1, 2020

If Apple Maps has taught us anything, probably not. But Apple would first need to pour an equally large amount of resources into web search, they way it did for Maps.

jfoster · on Sept 1, 2020

The pay-off for doing so (ad revenue) is pretty huge. I'm surprised they have held off for so long.

jd20 · on March 6, 2019

> The phone boots into an operating system known as “Switchboard,” which has a no-nonsense black background and is intended for testing different functionalities on the phone.

I think the article confuses the meaning of "dev-fused" hardware, with what OS is actually installed on the phone. When I used to work at Apple, I always understood "dev-fused" to mean a device on which you could install unsigned builds of iOS.

Internally, Apple puts out new builds of iOS daily. The engineers building features on top of iOS need to install these builds, to do their work. A normal iPhone from a store won't take these unsigned builds, hence the need for these dev-fused devices. There are regular builds like what a customer would get, debug builds with lots of logging and debugging checks enabled, and even bare-bones builds like switchboard, for employees who are not UI-disclosed or work in factories. As someone building higher-level iOS features, all my dev-fused devices just ran a normal looking iOS, unlike what the article describes.

> Two people showed Motherboard how to get root access on the phone we used; it was a trivial process that required using the login: “root” and a default password: “alpine.”

Oh boy, that sure brings back memories!

saagarjha · on March 6, 2019

Specifically, developer-fused hardware allows for stuff like setting boot arguments and having them actually get passed to the kernel. Basically, it lets you get in the way of and modify the "chain of trust" that the bootloader → kernel → userland processes normally ensures.

jd20 · on March 6, 2019

Thanks for clarifying, I figured I was generalizing it a bit.

saagarjha · on March 6, 2019

To be honest, I think the daily builds are signed by B&I as well, so you can install them on production hardware provided you have valid AppleConnect credentials (which I think just authorizes the install). You just won't be able to debug the kernel, etc.

throwabayhay · on March 7, 2019

Not true, and just more complex in general.

EDIT: I just looked at some of your other comments. I think you mean well and have some impressive knowledge for someone not working on those things, but some of it is also guesswork about very complex details that even internal people can get wrong, so I think publicly claiming conjecture as if it were fact is more misleading than you mean it to be.

saagarjha · on March 7, 2019

I'm mostly basing my comments on my knowledge of what the jailbreak community has made public so mistakes are likely me misremembering or not fully understanding something. Is there something in particular that I got wrong?

therein · on March 7, 2019

Very impressive indeed. And the GP is right about many employees even getting these little details wrong. The answer is definitely a lot more complicated.

As far as I remember, the AppleConnect aspect of it is only if you want to connect to the corp NFS where they have the IPSW. And beyond that I think I was able to use PurpleRestore on production silicon by switching the device connected to the host at the right point in time and leaving my phone in a really odd state that had shocked everyone at the Apple store I brought it to. They were so confused that I had to explain to them where I work for them to calm down.

therein · on March 7, 2019

Oh I had forgotten all about the codename disclosed, UI-disclosed, bin-disclosed, src-disclosed distinction.

"dis clos urec heck.co rp.a pple.com" is the most paranoid thing ever too. :)

therein · on March 7, 2019

How about PurpleRestore? :)

I binge-read all of luna and the "other" internal wiki back in the day. :)

liquid9 · on March 7, 2019

I love doing this occasionally, its just really interesting seeing the internal tools.

Is there any videos/screenshots of PurpleRestore and similar tools? I've searched and can only find a single picture and descriptions.

therein · on March 7, 2019

Same here, that's why I sometimes wish I had saved some screenshots for my own use or even for sharing but I have a feeling Apple would have hunted me down for it. That's probably why we don't see so many of them in the wild. Even in the orientation you'll hear stories about how seriously they take their ability to surprise and delight, with an emphasis on the surprise. :)

The best source I can find was this: https://www.theiphonewiki.com/wiki/Apple_Internal_Apps

With this fascinating discussion of Apple insiders talking about exactly the same apprehension imprinted in their minds: https://www.theiphonewiki.com/wiki/Talk:Apple_Internal_Apps

Here are some things I remember:

The "purple" series of tools are basically for managing dev-fused iPhones https://www.betaarchive.com/imageupload/2017-02/1487521492.o...

I also remember there being two internal wikis for development and having access to both. Maybe one is called luna and the other is just straight out called purple?

You get root on the device simply by authenticating as root with password alpine. Sometimes you'll get your hands on iDevices with weird specs like 3.75GB of RAM etc.

There is also AppleConnect which is Apple's internal single-sign on.

What I find fascinating the most is honestly how I am unable to find recent screenshots of these software. They are all screenshots of really old versions with outdated UI.

Apple must have a special way of taking these down or doing offensive-SEO and burying them in results because while I was able to find search results for "apple luna internal wiki", I am no longer able to.

saagarjha · on March 7, 2019

PurpleRestore will refuse to work unless you have valid AppleConnect credentials, AFAIK.

therein · on March 7, 2019

I loved PurpleRestore and never would have wanted to go back to the iTunes way of managing my device.

saagarjha · on March 7, 2019

Alas, such is the life of those not blessed with Apple Internal tools…

bearmcbearsly · on March 7, 2019

Do you work for Apple?

saagarjha · on March 7, 2019

Nope.

tinus_hn · on March 7, 2019

Why don’t they just sign them using a different key?

jd20 · on March 5, 2019

I've found the same to be true for Uber Eats. If my order is wrong, they give a full refund no questions asked (and a meal is typically $20-25). And orders get messed up a lot (maybe 1 out of 5). Even something as small as missing ketchup, I'll get half off.

jd20 · on Feb 12, 2019

Made me wonder how likely it was that "Funded by YC" was one of the criteria used by the analyst to come up with the list.

DenisM · on Feb 12, 2019

It has to be. Many people know how to write software, only few get ongoing advice from the industry's top advisors and have a line out the door to fund them. King-making is a thing, even if only a partial thing.