GitWikXiv follow-up: An open attribution standard?

[Other posts in this series: 1,2,4.]

My GitWikXiv post on making the academic paper universally collaborative got a lot of good comments. In particular, I recommend reading Ivar Martin, who sees a future of academic writing that is very different from what we have now.

Along a slightly more conventional route, the folks working on Authorea made a good case that they have several of the components that are needed to allow universal collaboration, and they seem to have a bit of traction.More generally, the comments on the post gave me the impression that lots of people are working on tools, but not many people are working on open standards. (This isn’t surprising, since software tools are a lot easier to develop by a handful of people.) It may be that a lot of the social/cultural obstacles (in contrast to technical ones), that we all seem to agree are the most difficult, aren’t actually mental problems so much as coordination problems. In other words, it might not have anything to do with old researchers being set in their ways as much as tragedy-of-the-commons-type obstacles. So maybe there should be more focus on open standards like ORCID, smart citations, data accessibility, and an attribution standard like I discuss here. a   I was asked what it would take to solve the remaining problems by my lights, and I sketched a hypothetical way to let Authorea (which is a for-profit company) interface with the arXiv to enable universal collaboration with proper attribution. The key step would be the introduction of an attribution open file standard that could be agreed upon by the academic community, and especially by the arXiv advisory board. (Note the important distinctions between revisions and attribution, and between collaborative tools and central repositories.) Below I am going to give a more fleshed out version of that hypothetical, but let me first briefly describe the problem.

Authorea, ShareLaTeX, Overleaf, FidusWriter, and their siblings are tools that allow many (possibly geographically distant) authors to collaborate on a single document. The document exists online and authors can typically make edits, add comments, merge version, etc., that are generally tagged with relevant information such as the date and author name. However, all these tools have in common that they are directed toward the same familiar goal: journal articles.

Authorea gestures toward universal collaboration by allowing new authors to request to be added to a project, but it still keeps articles tied down to individual teams, and introduces large frictions for new authors. (Wikipedia would never have gotten off the ground if a new author needed to get approval from a mod before starting to work on each article.) Furthermore, the evolving document is effectively confined to Authorea. There are nice export abilities to journals and the arXiv, but this destroys all the Git-style author contribution information, so authors are very unlikely to release their work under a license that allows other authors to modify itYou could release it under a license which requires linking back to the original version, like Creative Commons share-alike, but this throws out essentially all the attribution information except the author’s names. b  . And although I have only heard good things about the Authorea team, it would not be desirable in the long term to have a platform for universal collaboration that is controlled by a for-profit company (or, for that matter, any other organization whose goals could diverge from the academic community) unless the content can easily be migrated.

So as a hypothetical solution, imagine that Authorea and its peers collaborate with the arXiv to agree to an open attribution standard. Under this standard, traditional arXiv articles are attributed to their authors like normal, but new submissions can be augmented with an attribution file that lists the authors who ownNote that I am using “own” in the sense of wiki/document ownership, and am equating it with the authors who are listed on the abstract page for the article. This is distinct from the arXiv-specific terminology of “own” which has to do with who can modify submissions. c   the article as the primary authors, but which also tracks all the small accepted pull requests submitted by other contributing authors. Authorea projects would automatically maintain an attribution record during the collaboration process, containing the complete author-tagged revision history at the minimum.For examples of what a revision (‘commit’) history looks like, see the commit history for a single GitHub file or the diff view of a single revision. d   The resulting attribution file would be uploaded alongside the main document when submitted to the arXiv. The arXiv consumes the data in this file, and then when I search for an author on the arXiv I find both the articles that they own and the contributions they have made to articles owned by others. Here it makes sense to use the ORCID standard for uniquely identifying authors, which is already understood by the arXiv.

The prospect of defining of an open attribution standard raises several issues.

  • Granularity – It seems reasonable that the attribution file would at least keep a record of all changes at the microscopic level, i.e. atomic edits like replacing “the Earth’s diameter is 6,371 km” with “the Earth’s radius is 6,371 km” in the text, or swapping the order of two paragraphs. Now, you need to be careful with what counts as a substantive change (such as whitespace), and defining the atomic edits is somewhat tangled up with the choice of language (TeX vs. a successor). We can look to GitHub for this, although some things won’t translate from code to prose. But a more complicated question is whether to add additional structure to this information. Basic coarse-grained summaries (e.g., “56 total lines written”) can always be generated from the atomic edits, but tagging edits by topics or purpose (e.g., copyediting versus calculation versus key insights) might need to be incorporated into the standard.
  • Subjective contributions – I’m not sure if it’s feasible, but one might want to allow subjective descriptions of the contribution (“came up with 2 of the 5 key ideas”, “did the main calculation”), perhaps controlled by the article owners. This might be something as simple as the paragraph you sometimes find in Nature articles describing which authors did what work. It’s also very feasible for existing arXiv documents to have their “acknowledgements” sections automatically crawled and contributing authors added retroactively.
  • Human editability – It’s probably asking too much to make the attribution file fully human-readable, but we do want to make it possible for someone to fork a document and work on it with a simple text editor or by pressing one button — no Authorea required. Therefore, there should be a dead simple procedure for tweaking an attribution file to represent “Everything before time T was done by Alice, and everything after was done by Bob.”
  • Proper licensing – The licensing is very important. In my utopia, all academic papers should be released under a Creative Commons Attribution/Share-Alike license or something similar. The arXiv has shown that researchers are already comfortable releasing their work for everyone to see on a read-only basis, and it has become a strong enough norm that journal publishers have been forced to accommodate it. I think the jump to also allowing adaptation is smaller, and could be accomplished without coercion. If the arXiv makes such a license the default upon submission (i.e., a “nudge”), and if folk see lots of examples of their work getting improved on in a positive way while they are still given proper credit, then this should be possible.
  • Use same tools for articles as code? GitHub revision control tools are designed for code rather prose or math. Making edits easily understandable and summarize-able is a significant challenge. Ideally one would let folks like Authorea tackle these issues, but to some extent they will affect the attribution standard that the arXiv needs to agree to and maintain.
  • Other question: How to ensure future extensibility? How to accommodate authors’ (very reasonable) inclination to make private tentative changes that shouldn’t be recorded forever? Do the different LaTeX collaborative tools use revision control formats that are incompatible? How much will people actually care about minor contributions?In an extreme example, you could imagine academics settling into a new equilibrium very close to our own with one difference: most minor (i.e., non-primary author) contributions are neglected for the purposes of career advancement except a very small number of “ah ha!” moments where the contributing author makes a major insight and quickly attaches it to existing work as a pull request. In this case, the attribution standard would serve only as a way to establish priority of those rare moments, but would otherwise not be as useful (at least for career advancement). Like the showdown in poker, this could have a big impact on the way the field operates even if the actual events are rare. Researchers might more freely assist others much more on a day-to-day basis if they know that if they happen upon a monumental contribution then it will be attributed to them. On the negative side, researchers might flood the literature with minor unimportant suggestions in the hopes that one of those Hail Marys lands. e  

Let’s assume that these issues can be handled, and we can agree to a reasonable attribution standard. (For the sake of argument, consider a launch with a minimal standard consisting of only atomic edits tagged by OCRID, which could be extended in the future.) Now the arXiv can be populated with collaborative documents alongside traditional documents. This has several advantages:

  • The choice of whether to adopt an ownership model or a free-for-all wiki is determined by the individual projectsI conjecture that the optimal strength of ownership will be inversely proportional to how well-accepted the content is. New speculative work will continue to be done one or a handful of authors (although hopefully these documents will evolve to have less overhead in terms of verbose introductions, etc.) Niche reviews will be maintained by the relevant community with pull requests from outsiders. Large reviews may approach free-for-all wikis f  . The arXiv doesn’t have to force a choice on the community, and different choices can be made for different projects. If the attribution standard is designed with enough forward thinking (or at least future extensibility) it would allow new collaboration styles to develop that no one has thought of yet.
  • The overhaul necessary for the arXiv is very modest compared to, say, hosting a wiki. (This is important because arXiv technical resources are currently spread thin.) In fact, the arXiv might not need to do more than (a) change the default license, (b) index the ORCID numbers in the attribution files (which could be uploaded with the article like any other auxiliary files already can be), (c) tweak the search to display collaborative contributions. Complicated protocols for summarizing an author’s many contributions could be handled separately by someone else as long as the arXiv makes the raw information available through an API.
  • For-profit companies like Authorea can make great contributions to the academic process without being given unreasonable control. Competitors can put out alternative collaboration products which use the same attribution file standard, and they could specialize in different project types, in different academic subjects.
  • When attribute/share-alike licensing becomes universal, a researcher is free to fork any paper on the arXiv and take it in a different direction than the original authors are interested in. However, the attribution file information comes along for the ride (or else the researcher violates the licenseTechnically, I suppose it would take a brand new licence, distinct from creative commons, to make the requirement to keep the attribution file (rather than merely pointing back to an old version of the document) legally enforceable. However, as long as their is a pointer then it will always be possible in principle to trace things back, and community norms should be sufficient to ensure that essentially all documents really do keep their full attribution file. g  ). Authorea could add a simple interface for automatically downloading arXiv documents that you want to fork.
  • Anyone can check someones career contributions by just typing their name into the arXiv. This information is publicly available and maintained by an organization (the arXiv) which is trusted.
  • It builds straight upon the arXiv, so we don’t have to fight network effects and convince everyone to start using a different website. People can keep submitting to the arXiv like before.
  • The possibility for a smooth transition means licensing norms can change smoothly too.

Here are Paul Ginsparg’s thoughts on this:

[An attribution file standard would be] a path forward which permits most of the real work to be done via 3rd party overlay. It’s fairly natural for a document to propagate along with its revision/attribution history (as wikipedia pages have). The hard part remains the chicken-and-egg problem, of finding authors who will engage, and benefit from it. But that can’t be assessed until there is some infrastructure up. I would say some natural JSON or XML formatted file would work perfectly well, and wouldn’t anticipate any technical difficulty parsing out the additional “contributing” authors, and making them visible via the search interface, flagged as “contributing” rather than primary. If it goes that direction, I would recommend nailing down a basically finished standard, then passing it through arXiv to see if any minor refinements would make it easier for them to implement (since it will ultimately have to go through the library development team to set up and maintain).

For completeness, let me list some issues that this hypothetical system does not attempt to address, but which should be compatible with an attribution file standard:

  • Smarter citations: The post-publication addition of neglected citations (which the author’s missed) and backward citations (for works appearing after publication) would be highly desirable. Likewise for citations with semantics (e.g., does the citing work agree or disagree with the cited work? Does it cite for the purpose of attribution, justification, or both?)
  • Language: The TeX typesetting language ought to eventually be overhauled or replaced. (See caveats above about how this might affect attribution standard.)
  • Open data: Many folks would like to see all the information necessary to reproduce the results of a paper (e.g., code, or raw experimental data) included along side the document.
  • Overlay commentary: For discussion that is refined enough to be public but not yet ready to be incorporated into a standalone document, there are many possible systems for commenting on papers through some sort of an overlayFor examples of what a revision (‘commit’) history looks like, see the commit history for a single GitHub file or the diff view of a single revision. d  . Existing examples can be found in SciRate, ThinkLab, and PubPeer.

[Edited 2015-5-11 to better introduce Authorea and explain potential issues.]

.

Footnotes

(↵ returns to text)

  1. More generally, the comments on the post gave me the impression that lots of people are working on tools, but not many people are working on open standards. (This isn’t surprising, since software tools are a lot easier to develop by a handful of people.) It may be that a lot of the social/cultural obstacles (in contrast to technical ones), that we all seem to agree are the most difficult, aren’t actually mental problems so much as coordination problems. In other words, it might not have anything to do with old researchers being set in their ways as much as tragedy-of-the-commons-type obstacles. So maybe there should be more focus on open standards like ORCID, smart citations, data accessibility, and an attribution standard like I discuss here.
  2. You could release it under a license which requires linking back to the original version, like Creative Commons share-alike, but this throws out essentially all the attribution information except the author’s names.
  3. Note that I am using “own” in the sense of wiki/document ownership, and am equating it with the authors who are listed on the abstract page for the article. This is distinct from the arXiv-specific terminology of “own” which has to do with who can modify submissions.
  4. For examples of what a revision (‘commit’) history looks like, see the commit history for a single GitHub file or the diff view of a single revision.
  5. In an extreme example, you could imagine academics settling into a new equilibrium very close to our own with one difference: most minor (i.e., non-primary author) contributions are neglected for the purposes of career advancement except a very small number of “ah ha!” moments where the contributing author makes a major insight and quickly attaches it to existing work as a pull request. In this case, the attribution standard would serve only as a way to establish priority of those rare moments, but would otherwise not be as useful (at least for career advancement). Like the showdown in poker, this could have a big impact on the way the field operates even if the actual events are rare. Researchers might more freely assist others much more on a day-to-day basis if they know that if they happen upon a monumental contribution then it will be attributed to them. On the negative side, researchers might flood the literature with minor unimportant suggestions in the hopes that one of those Hail Marys lands.
  6. I conjecture that the optimal strength of ownership will be inversely proportional to how well-accepted the content is. New speculative work will continue to be done one or a handful of authors (although hopefully these documents will evolve to have less overhead in terms of verbose introductions, etc.) Niche reviews will be maintained by the relevant community with pull requests from outsiders. Large reviews may approach free-for-all wikis
  7. Technically, I suppose it would take a brand new licence, distinct from creative commons, to make the requirement to keep the attribution file (rather than merely pointing back to an old version of the document) legally enforceable. However, as long as their is a pointer then it will always be possible in principle to trace things back, and community norms should be sufficient to ensure that essentially all documents really do keep their full attribution file.
  8. Although there are clearly some benefits from centralizing scientific discussion, this seems much less pressing to me than the ability to universally collaborate on articles. I can imagine that multiple competing venues is the optimal solution.
Bookmark the permalink.

Leave a Reply

Include [latexpage] in your comment to render LaTeX equations with $'s. (More info.) May not be rendered in the live preview.

Your email address will not be published. Required fields are marked with a *.