OK, that's a good start, but I hope you're devoting a fair amount of effort to improving your decision algorithm. At the moment, it considers tretz.com to be 60% pronounceable, but llmyw.com, qenis.com and domyh.com to be 100%. I can kind of see how a naive algorithm would calculate that, and I'm pretty sure you can tweak it to make it better. Bonus points for using some sort of neural net or Bayesian technique to improve scores, perhaps with a button next to the score to allow people to adjust it.
Here's a Python program that builds a markov chain model of a corpus (with laplace smoothing) and then calculates the probability of the domain name being generated by that markov chain.
from collections import Counter
from math import log
corpus = """training corpus here"""
n = 1 # token size
def trans(w):
return ((w[i:i+n], w[i+n]) for i in range(len(w)-n))
tokens = Counter(t for t,l in trans(corpus))
transitions = Counter(trans(corpus))
def score(w):
return sum(log(transitions[t,l]+1)-log(tokens[t]+26**n) for t,l in trans(w))
for w in [" llmyw "," domyh "," tretz "," qenis "," debts "]:
print w, ': ', score(w)
When I use the European sovereign-debt crisis Wikipedia article as the corpus, I get
Higher numbers are considered more pronounceable, so that's in order of pronounceability. Note that scores of words of different length are incomparable.
The variable `n` sets the number of letters of memory the markov chain has. If you set `n` higher then you need a bigger corpus. If you set n=2 and get a larger corpus and preprocess it to filter out anything that isn't [a-z]+ it would probably work fairly well (I just copied the article in as is).
Doing a naive dictionary lookup would also help your algorithm, i.e. limoship.com should rank higher than leonadare.com in my opinion.
Cool website though! I've always wondered how often decent domain names lapsed.
I'm fairly sure qenis's high ranking is the result of using Levenshtein distance on dictionary words. You want phonetics based analysis instead. Then again qenis isn't that hard to pronounce either, so I don't know.
LeonaDare.com would be an exact match for a name check whilst limoship would be a made-up word but pretty good for a local limo-to-port service or a high class pleasure-cruise or something.
I quite like the idea. However, of the about 10 domains I checked (5 letter ones), only one was available. You probably should use a different, more reliable, way of checking for availability?
Great site, excellent idea. If you want to make this useful to slightly larger audience you could add pagerank and backlink information for the domains. Pagerank is simple enough to determine (libraries will check it against the google toolbar). For backlinks you can easily create a link to reports on majesticseo.com or a similar service.
I think the more important question is, how do you find domains that are about to expire? I don't know what the domain hoarders do but here is what I do.
I discovered that pool.com maintains a list of domains that are set to expire. I download and filter the list and then email myself a list of domains that match my requirements (.coms under a certain length, no numbers or other funny characters, maybe .coms with a specific word in them). I actually just wrote this script, it had been on my to-do list for over a year. The daily email contains hundreds of domains so I might have to filter it more.
https://gist.github.com/3914495
Here's my script, it only uses PHP to get tomorrow's date, otherwise it's standard linux utilities like wget, egrep, unzip, cut, sed...
I have it set up as a daily cron job.
I'm interested in suggestions on how to snipe/reserve/etc domains as soon as they become available.
Posted the question to StackOverflow[1]. Debated with myself whether to post there or Server Fault, but here we are. Maybe it will be closed by the-powers-to-be as I'm not sure of the focus (I do want a programmable solution, i.e. an API)
I've done a little bit of domaining myself and the first thing to take away from any list is that recently dropped doesn't always equal available. Also you have guys that are running massive lists that perform thousands of buys requests per second across a massive number of domain name sellers in order to ensure they get to purchase the domain as soon as its released.
In terms of obtaining information, most whois queries can be performed via command line utilities... so to start you off here is a good list for whois servers (http://code.google.com/p/whois-servers-list/). Finally, check out each service, some will allow queries which will return true or false to being registered and generally you get a lot more of these requests then complete lookups (without being IP blacklisted)
Finally, in terms of building and managing an index, I believe manual crawling is the only option available... and start with dictionary terms and work out.
Some companies, such as GoDaddy have access to DNS zone files - this is the then resold to even more companies. Look at my reply to this article to see where you can get a list for free.
To everyone looking for expiring/dropping domain lists, ive been building an app called dropparser that uses a free source (though I dont see it being mentioned in here yet). Some python code for you:
Great idea !
However, I have a problem : let's say I want all the domains starting with the letter "C" with 10 characters max. It says me "Limit of 1500 results reached. Narrow your search parameters to view more results", but there is no way for me to narrow my search parameters while still obtaining the domains starting with the letter "C" with 10 characters max.
Anybody has a hint, or is this an error ?
More useful than pronouncibility would be spellability. I'd score domains with only a single spelling as higher e.g. twang.com over base.com (bass.com). To do this the algorithm would just need to scan for graphemes that share a common phoneme with other graphemes and weight those lower.
Another way to say this could be, "The site does not render well..." That would take the edge off your criticism, and be a more polite way to help your fellow hacker.