Another statistical analysis of the NYT wedding section, looking at the occurrence frequency of certain characteristics in the NYT wedding announcements relative to their occurrence in the general population: http://www.theatlanticwire.com/entertainment/2011/12/odds-ge....
Given how horizontally expansive RapGenius is trying to be (this is the first time I've seen NewsGenius, but I'm familiar with PoetryGenius et al -- there's an annotated Iliad that's pretty cool), I'm wondering if they'd be better off as a layer or a plugin as opposed to a stand-alone site. I'm much less tempted to visit the site for each individual story that pops up than I would be to peruse the annotations as I browse normally.
Either way -- stuff like this is a delight to read.
My understanding was that RapGenius was always a text annotation platform, with the hip-hop stuff serving as an exemplar, rather than being the actual product.
Always a text annotation platform? That's a bit revisionist from what I've read.
Fred Wilson initially told them "I think lyrics is a very crowded space and almost entirely reliant on Google for traffic" and they admitted "our pitch back then was a bit too lyrics-focused.."
Fair enough! I don't have a citation for my understanding, I just read waaaaay too much HN, and that's what was stuck in my brain. Thanks for clarifying.
A layer is not a good idea from a business point of view in my opinion, given how most websites (Facebook, Twitter with their API changes, Spotify with their new messaging system and so on) are trying to lock you in. Maybe simply build a way to take you back directly to RapGenius, like a annotate-this-on-rg button.
Given that that's the defense people seem to proclaim every time someone mentions that disabling JavaScript is now buried as an arcane config flag in Firefox.
They say: "just use a plug-in", "just use an add-on"...
What they mean to say is: "just don't disable javascript at all... ever."
Most people, especially those who don't know what a plugin is, should emphatically NOT be disabling javascript wholesale and without exceptions in Firefox. And I disagree that its incredibly hard to google, find NoScript, and follow the directions to install it.
Yes, generally n-gram-based analyses are a huge minefield. Computational linguists do use them, with a lot of caveats and careful analysis of confounding factors.
One simple one that comes to mind here is that you need to analyze to what extent changes over the period of the data set are caused by underlying societal changes, versus changes in the NYT itself; the end result will be a mixture of those two changes, some of which may be magnifying and others offsetting. The 1980 NYT and the 2013 NYT are not the same newspaper, not edited by the same people, not sold to the same readership demographics, and not soliciting the same advertisers, so it's somewhat questionable to treat it as a stable proxy for a social group.
Another common pitfall is language change screwing up all kinds of measures (since n-gram models just work on word counts). For example, if two words are used roughly interchangeably in 1980, but by 1990 one of them has fallen out of usage, and been replaced wholly by the other one, searches for just the one word will look like the word's on an upwards trend, but it would be misleading to infer an increase in the underlying concept over the period. Of course, you can account for this by merging words into equivalence classes (most analyses will do basic stemming and merging of alternate spellings), but you have to be very careful to get all the equivalence classes (which is not a well-defined notion). Just a list of the top words in a year will tend to be some mixture of 1) top concepts; and 2) concepts expressed using only a small number of wording variations, so their count doesn't get diluted.
Sure it is. 60,000 weddings * 0.02% is an expected number of 12 positive examples, which really isn't much. Assuming a binomial process, n=60000 and p=0.0002 gives a 95% confidence interval of 5.2 to 18.6, which is a really wide range when you want to show trends. I don't know if the percentages are by year, but if they are the issue is even worse.
While the % of republicans does appear to fall, the % of democrats in the last year is lower than in the first year, the opposite of the conclusion they want you to draw!
If this is like most n-gram analyses, the percentages are of the total corpus, i.e. percentage of words, not articles. So 60,000 articles could be 12,000,000 words and 2400 positives if there are 200 words per article (a SWAG).
>What does the y-axis mean exactly? The y-axis represents the frequency of each phrase, as a percentage of all phrases that contain the same number of words. For example, if you search for from New York, the graph shows the number of times those words appear in exact order, divided by the total number of 3 word phrases in all of the articles
I think doing it at a per-article level makes more sense for an analysis like this, but 0.02% is actually pretty significant when n is on the order of millions.
So the implication is they took an SEO friendly subject likely to have plenty of interesting factoids and then went fishing for interesting insights - and write a blog post about it. Page three of the Startup-guide-to-SEO-effectiveness
It could have just barely achieved statistical significance, but it would be hard to draw conclusions from it. Presumably the other 99.98% of families had political preferences, too. We just don't know what they are. And we don't know what caused that tiny percentage to share theirs.
It wouldn't be a bad idea to factor in the number of Democrats vs Republicans holding offices in the area around NYC during that time, either. I know NY state leans Democratic, and Democrats do well in city-level elections. Holding an actual office would probably make you more likely to mention your party.
As you said, caveat 'entertaining' blah, blah. That said...
That's actually a slightly dubious analysis. The question you need to ask is 0.02% of what? In this case, I would take a guess it means 0.02% of all the words analyzed. As a very simple example, imagine analyzing all the letter in a book. If English were perfectly balanced, we expect to see all 26 letters at 1/26 or 0.038%, so seeing the letter 'e' appear at say, 1.0% (or even 0.1%) would be a notable statistical result.
The Times certainly picks and chooses which are printed. That's the whole reason for the interest in the topic. It's a perfect collision between young-adult ambition and old-school establishment vetting.
I understand that, like college admissions, you can hire a wedding planner or consultant who can considerably raise the chances of your wedding being listed.
What is the point of this? Is it some kind of odd "old-school" status thing? The upper class equivalent of being on the 8 o'clock news for a 1 minute interview? Although perhaps this is only for people already investing a huge amount of money into the wedding ceremony?
We used a high-end wedding planner, and when we asked her what to do for the NY Times announcement, she told us to submit via the form on the website. Maybe she pulled some strings behind the scenes, but it seemed to us that she had no pull whatsoever on that front.
I've read these for years and have known people who appeared.
The factors that enter into getting in (from my observation strictly) are a combination of things like:
- parents who live in ny metro
- the parties getting married living in ny metro
- having gone to school in ny metro
- parents or parties getting married working in ny metro
- what the parents do for a living
- any lineage "grandparent governor of NY"
- what the parties getting married do for a living
- school attended as far as perceived impressiveness
- whether an impressive job or title of any of the parties mentioned.
..and so on. That's off the top.
For example, "physician" and "went to school in NY" is probably almost assured to get the announcement printed.
"father a mechanic, mother a homemaker, inlaws are nobodies, parties are cashiers who work at walmart, no college, live in jersey city"[1] and so on either don't get in, don't care to get in, or don't have the drive to even submit a form to get in.
[1] Unless of course one of the parties is related to a famous former politician or some other mitigating factor.
I know someone who hired a planner who helped them get in. I'm not sure how. To be sure, they were somewhat qualified already, but noticeably below the bar.
I've read the wedding announcements on and off since about 1991, but much less these days, because I only get the online edition now.
I don't have any stats, but I know that very few of those who submit are selected. My wife and I received a long form write-up (just under 700 words). The process started with our submission on the website, but then we got a call from a writer a few weeks later. Between talking to us, our parents, and our minister, I'd say he put at least 10 hours into researching and writing.
A day or two before the wedding, he told us that he wrote both a long form piece and a shorter, more typical piece. He wasn't sure which would get published, but he was obviously pushing for the longer piece to get in. It did.
Oddly enough, our write-up isn't included in the Rap Genius dataset. Maybe it's too recent or the longer write-ups aren't included.
Not to be rude or offend you, but, why? You're saying a newspaper spend 10 hours researching and writing a 700-word article about your wedding? Is this a "local flavour" type article, like they might do about assorted residents? Or are your families famous or ? Did you pay for this kind of placement? Perhaps I'm socially inept here, I just feel like I'm missing something.
As far as being in the dataset, certainly they aren't analyzing every flavour-style thing article, just "announcements"? Just like a 2-page life-in-review article on someone famous when they die doesn't really go in the obit section, does it?
I was surprised that we got in and shocked that they put so much time into the article. I can't believe it makes economic sense to put so much into such a short write-up, but I guess it does for The Times.
Together my wife and I check quite a few boxes that the NY Time typically looks for. We aren't famous or all that noteworthy, but we do have an interesting story of how we met. That was the main focus of the article.
They have editorial discretion. I'd be curious to know what the "acceptance rate", so to speak is... I know for a fact that a couple I know was rejected specifically because one of them was undergoing gender reassignment and the NYT didn't know which personal (gender) pronoun to use for him.
I am very surprised by the prevalence of "was graduated from". I have only rarely heard that in 'real life', is the NYT's style guide enforcing this usage?
Agreed. When people say, "I graduated college", they are saying the college graduated from them, not they graduated from college. The correct usage should be, "I graduated from college."
Really interesting - how did you guys downloaded the 60K articles (is there an API i do not know about)? Also - what graphing lib are you using (I see it is not d3)?
You can clearly see the recent tech boom by searching "Google," "Facebook," "Twitter," and "Apple" http://www.weddingcrunchers.com/?q=facebook%2C%20google%2C%2....
The key takeaway here is Google:
Google has raced ahead of establishment NY law firms: http://www.weddingcrunchers.com/?q=wachtell%2C%20cravath%2C%....
Google has also recently overtaken top investment banks: http://www.weddingcrunchers.com/?q=goldman%20sachs%2C%20morg...
Ditto for consulting: http://www.weddingcrunchers.com/?q=mckinsey%2C%20boston%20co....
When do you think Google will start hosting a debutante ball in Chelsea?