Hacker News new | past | comments | ask | show | jobs | submit login
Now I am become DOI, destroyer of gatekeeping worlds (thewinnower.com)
73 points by jmnicholson on Feb 20, 2016 | hide | past | favorite | 17 comments



> The injunction against citing DOI-less documents is unfortunate, because people deserve to get credit for the interesting things they say–and it turns out that they have, on rare occasion, been known to say interesting things in formats other than the traditional peer-reviewed journal article.

Ain't that the truth. There's a paper that came out a few years ago which said that new method B is 3x faster than old method A. That's well and good. But there were two widely-used free software projects which had done the same thing, blogged about it, and for one case had the old code still available as a #define option. While the published article was based on a proprietary in-house package.

I pointed the two previous reports to one of the authors, who had heard of one, but because it wasn't an academic paper, it wasn't worth citing. Grr! Because free software developers really have the time and money to write a paper for the $1,000 per pop open access journal in the field.

I've been reading old literature, from the 1980s and older. There used to be a "Letters to the Editor" section, which was sometimes used to point out or correct these sorts of issues. No more. That same open access journal says my only options are to write a comment on their comments section (which unlike an old Letter to the Editor, no one reads, which is not indexed, and is not citable), or submit an entirely new article, at $1,000.

The other thing a requirement for a DOI does it prevent citing the older literature, which doesn't always have a DOI.

> [Quoting Neuron's author guidelines] "References should include only articles that are published or in press. For references to in press articles, please confirm with the cited journal that the article is in fact accepted and in press and include a DOI number and online publication date. Unpublished data, submitted manuscripts, abstracts, and personal communications should be cited within the text only."

This would make certain publications impossible. For example, in my historical research, I tracked down the paper "Ciphering Structural Formulas – the Zatopleg System", Zator Technical Bulletin No. 59 (1951). This was cited by many chemical information papers in the 1950s and 1960s, and in some patents. However, the only copy I could find was in the Mooers archive at the Charles Babbage Institute. (Mooers founded Zator.) The archive controls the copyright, and lets people make copies for research purposes, so long as the archive/box/folder information is included in the citation.

Which is why if you look at a paper like http://trafficways.org/ascii/ascii.pdf it has citations like:

> Thomas E. Kurtz, letter to Secretary, X3, December 21, 1965, Calvin N. Mooers Papers (CBI 81), Charles Babbage Institute, University of Minnesota, Minneapolis, box 20, folder 1.

Try doing that "within the text only".


My PhD advisor had a paper which was rejected by a few major journals, so he decided to keep it as a preprint and see how many citations it could acquire. It has hundreds now.

These days he'd stick it on arxiv. Arxiv doesn't appear to issue DOIs, you're supposed to add the DOI from a real journal after the preprint gets published.


http://zenodo.org will issue a free DOIs for any GitHub release you tell it to. :)

See https://guides.github.com/activities/citable-code


I was about to suggest the building of this very sort of thing. Glad to know it already exists.


There's also figshare.com where you can publish data, papers or whatever really. I published code there so my wife could cite it and we do public data releases on there too.

Disclaimer: I work for digital science, a parent company.


Rules against citing DOI-less content are easier to understand once you understand a bit more about life within an academic publishing house.

An article with a DOI has more longevity. This is mentioned in the post, but I don't think the author fully grasps the importance of this. Plain URLs break down all the damn time; publishers/journals redesign their sites and migrate between technology partners often. It only takes a few Very Important People to flip out about a missing piece of data or a broken link before a publisher will start to put safeguards into the production workflow to prevent such complaints. Mandating DOIs alongside citations is an easy policy toward that goal. An article with a DOI will almost surely be supported by a technology back-end that can keep that DOI up-to-date, generate manifests for (C)LOCKSS, submit materials to archival services like Portico, etc.

DOI-bound citations are much more efficient throughout the publishing process. Editors can use DOIs to quickly verify and proof citations. If a publisher's workflow requires sending papers to an XML composition vendor for tagging, DOIs can make that process more automated and accurate. Some of the more sophisticated XML vendors can generate all the citations given only the DOIs. This might not sounds like much, but it's notoriously easy to make mistakes on citations throughout the editorial process, even with NLP and other more enlightened software in the mix. If you ever see an article with a citation list like this...

* Suzuki, et al. 2004...

* Suzuki, et al. 2006a...

* Suzuki, et al. 2006b...

* Suzuki, et al. 2006d...

* Suzuki, et al. 2006f...

* Suzuki, et al. 2006g...

* Suzuki, et al. 2007...

* Suzuki, et al. 2008b...

...think about all the double/triple/quadruple checking bullshit that goes into making sure all the text and links in that list are accurate, and know that DOIs take most of that headache away. It's substantially more efficient at scale when DOIs are included.

Recently I've been thinking about how some of these problems could potentially be solved via new ideas like IPFS. A Handle-based system like DOI might not be as necessary in IPFS (given IPNS), and perhaps you could start creating and citing more diverse sources of content with some confidence in preservation/longevity. It could be a great fit for open access content.


I can understand why a publishing house would find the DOI to be useful. I can understand why a publishing house would want to encourage people to use a DOI.

I do not understand why there are rules against citing DOI-less content, as that would seem to place the needs of the publishing house far above the needs for good scholarship.

Elsewhere in this discussion I gave an example of a citation without a DOI from http://trafficways.org/ascii/ascii.pdf :

> Thomas E. Kurtz, letter to Secretary, X3, December 21, 1965, Calvin N. Mooers Papers (CBI 81), Charles Babbage Institute, University of Minnesota, Minneapolis, box 20, folder 1.

That paper has dozens of DOI-less references, like meeting minutes and memos, which are stored in archives like the CBI or the Smithsonian.

How does a paper like that get published in a journal which has rules against citing DOI-less content?


> I do not understand why there are rules against citing DOI-less content, as that would seem to place the needs of the publishing house far above the needs for good scholarship.

Absolutely. Welcome to the industry!

But the publisher's concerns are not without merit. Your cited reference is useless if nobody can find it in a couple years. I'd argue that any article which does not take steps to preserve the availability and connectivity of its sources is neglecting a critical component of "good scholarship", even though this often has nothing to do with the authors or research methodology. (There are, of course, countless justifiable exceptions to this.)

There are creative solutions though. A good journal would take that PDF in your comment, get permission from the author, and package it up with the article as "supplementary data" or some other auxiliary material. A good publisher will find ways to otherwise publish critical outside materials.

We are stuck at an awkward time in digital scholarship. In the old days, you could cite any published material because everything was on paper, and everything was distributed to a thousand libraries, and you could reasonably expect the availability of everything, and there was no expectation that everything would be available instantly. You can't put any of that confidence in naked web pages; URLs are fatally unreliable for this purpose. DOIs, and confidence in the tech brokers that use them, help. And with time, as more and different types of online content become more reliable, it will become easier to cite anything like the old days.


Welcome to the world of scholarship!

I thought I want out of my way to say they publisher's concerns have merit, so feel somewhat insulted by your bringing it up as you did.

> Your cited reference is useless if nobody can find it in a couple years.

Nonsense. I told you exactly where to find it. It's in a box in the archives at the Charles Babbage Institute. Even more, I just visited that archive last month, and if I didn't look through that box I looked through its neighbor boxs. You can also request a copy for yourself. The CBI charges $0.25/page.

Do you believe that the Smithsonian and other archives are not a valid part of "preserv[ing] the availability and connectivity of its sources"? Not only did I pay a copy fee, but I gave extra money to the CBI as a gift - does that not count?

> ... if nobody can find it in a couple years

You speak from the point of view of a publishing company.

Many organizations publish besides publishing companies. IBM has (or had) their own corporate journals, both in-house and public. One of the common reference in my field is: Tanimoto, T. (17 Nov 1958). "An Elementary Mathematical theory of Classification and Prediction". Internal IBM Technical Report 1957. That's certainly not a peer-reviewed journal!

Google Scholar says it knows of 222 citations to this report. I have a copy that I ordered from the Niedersächsische Staats- und Universitätsbibliothek Göttingen. Up until I changed it a few weeks ago, the relevant Wikipedia entry said the report was 'unavailable'. https://en.wikipedia.org/w/index.php?title=Jaccard_index&typ... ). Yes, I wonder how many of the authors of those 222+ papers actually read the original paper

If there were DOIs in the 1950s, do you think this document would be more findable now? Perhaps with IBM, yes. But there are many small companies which also publish their own papers as their own press. (I can provide examples if needed.)

So if my company starts its own press, publishes papers with its own DOIs, then goes under, shuts down the servers, and doesn't bother to transfer copyright to a new provider ... just how will anyone be able to resolve those DOIs and find those documents in a couple years?

> A good journal would take that PDF in your comment, get permission from the author, and package it up with the article as "supplementary data"

Please, yes! I would love all of the journals to take charge of digitizing the world's archival data and make it more accessible! Many of the documents I looked at were untitled, and in cursive, and I can't begin to estimate the cost of indexing all of those properly.

Which journals do this, so I can publish in them? ... Or are there no good journals?

To be more specific, I want to cite pages 61-63 of the Work Book I for Calvin N. Mooers, located in box 13 folder 4 of archive CBI 81. This 1947 entry is a precursor to his 1951 Zatopleg paper and show that there's no connection to Wheland's 1949 connection matrix proposal. The sketch on p62 shows that he thought it had practical application. The double counter-signature on p63 shows that Mooers likely considered this a patentable idea. For a historian of science, this is quite useful.

Will the journal arrange the scans and the copyright licensing details for me? What happens if they and the archive can't come to an agreement on the copyright terms?

Will they scan only those pages, or the entire notebook? (Since there's more I'll want to cite in the future, it makes sense that they make a bulk request.)

If they scan the entire work book, will they make sure to remove the Social Security numbers of living people and other potentially sensitive information?

How much will this service cost me, vs. a standard fair-use quote that nearly all journals now would accept?

However, that doesn't help with the citation I gave, which is to a letter by Thomas E. Kurtz which is in the CBI archive for Mooers' papers.

The CBI archived Mooers' papers. These include works by Mooers and works by others. They control the copyright to Mooers' own works, but not those of the others. They do not have the ability to let people use Kurtz's letter beyond the research purposes allowed by fair use.

Which means no journal in the world (or at least those parts signatory to the Berne Convention) can do as you suggest. Not even good ones.

Should a paper not be published unless it's possible to do what you want, regarding DOI and some incompletely specified definition of what counts as sufficient permanence?

> We are stuck at an awkward time in digital scholarship

All interesting points to consider, but when haven't we been at an awkward time in publishing for the industry and for scholarship? I mean, I read a paper recently from the 1930s concerning the question of scientific priority when the paper is published only on microfilm. It asked questions like, does a microfilm-only publication really count as a publication?

At https://www.asis.org/Bulletin/Bulletin-50thAnniversary-Watso... you can see the founder of one such microfilm service. People could send in camera-ready auxiliary publications to be microfilmed and published on demand by those who want it. If someone sent in a document, which was listed on the ADI list of available documents, does that count as being published?

Replace "auxiliary publication" with "blog" and you'll see it's pretty much the same debate.


>> We are stuck at an awkward time in digital scholarship

I'll add to that. It will still be an awkward time in 20-40 years. Let's say that "Professor X" is a pioneer in a field, and her computer, which has her last 40 years of activity (email, source code, draft documents, downloads, etc.) on it is placed in an archive.

Now I, as an historian of science, go through that blob and figure out the timeline for some event, which is based on recorded IRC sessions, git commit messages, etc.

How do I cite those individual locations in the archival data?


Correct me if I'm wrong but IPFS intends to be torrent-like in terms of re-distribution - so you can no more guarantee that the article can be found there then with ordinary torrents, which I think is hardly more reliable then a url.


The difference is that anyone can "pin" any content in IPFS. So if an article cites any IPFS resource, it can also pin those resources regardless of whether or not they're academic, created by the same publisher, etc. IPFS would make it aggressively easy for any publisher to take responsibility for archiving wayward cited materials.


I see there being two main arguments, the first we agree on, permanence. That's really important when it comes to looking things up.

The second however is not quality. I really don't understand that point at all. The reason it all sounds so ridiculous as they continue to describe it is, I think, because it doesn't make sense. Anyone can get a doi for anything they want really.

No, the second thing is for analysis. DOIs mean you can automatically resolve papers and data. They're unique and unambiguous and easy to detect. Handwritten ones are not always possible to figure out even with a human and certainly not automatically at 100%.

Imagine trying to build something like Google when pages don't have URLs but just describe how a person might go and find what they're looking for manually. That's what people are trying to deal with now.


I think you interpreted the essay as an argument for or against DOIs, which it isn't.

The two arguments in the essay deal with the sentence: "I think there are two main reasons for the disdain many people seem to feel at the thought of allowing authors to freely cite DOI-less objects in academic papers."

As you say, "Anyone can get a doi for anything they want really." This essay describes a services for doing exactly that.

The essay argues why "quality" is not relevant, which I think you agree with.

Switching topics, you write that DOIs are "unique and unambiguous and easy to detect."

They are supposed to be unambiguous. In practice, there are errors. Consider 10.1021/c160038a601 available from http://pubs.acs.org/doi/abs/10.1021/c160038a601 . It's a letter to the editor titled "A Utility Analysis for the MCC Topological Screen System" and supposedly by F. Bruce Sanford. If you look at the page you'll see there are two letters to the editor. Hamilton is the actual author of that letter, while Sanford is the author of a different letter to the editor.

There is also supposed to be unique. However, given that "Anyone can get a doi for anything they want really" this means someone can use multiple DOI services to get multiple DOIs for the same thing.


"The right perspective, I would argue, is to embrace the benefits of technology and seek out new evaluation models that emphasize open, collaborative review by the community as a whole instead of closed pro forma review by two or three semi-randomly selected experts."

Can't agree more with that. The problem is that it's hard to change the system that has already been built around gaming this status quo.

Maybe some sort of blockchain for publications & peer reviews that could decentralize and break with current world ruled by few publishers? Anyway I have a feeling that any system here could be gamed too...


Worth reading. Interesting perspective and useful information, conveyed in good writing.

I think having the feedback at the bottom of the page say reviews instead of comments was a nice touch.


This is a perfect job for ipfs.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: