Beyond papers – GitWikXiv

[Other posts in this series: 2,3,4.]

I had the chance to have dinner tonight with Paul Ginsparg of arXiv fame, and he graciously gave me some feedback on a very speculative idea that I’ve been kicking around: augmenting — or even replacing — the current academic article model with collaborative documents.

Even after years of mulling it over, my thoughts on this aren’t fully formed. But I thought I’d share my thinking, however incomplete, after incorporating Paul’s commentary while it is still fresh in my memory. First, let me start with some of the motivating problems as I see them:

  • People still reference papers from 40 years ago for key calculations (not just for historical interest or apportioning credit). They often have such poor typesetting that they are hard to read, don’t have machine-readable text, no URL links, etc.
  • Getting oriented on a topic often requires reading a dozen or more scattered papers with varying notation, where the key advances (as judged with hindsight) are mixed in with material that is much less important.
  • More specifically, papers sometimes have a small crucial idea that is buried in tangential details having to do with that particular author’s use for the idea, even if the idea has grown way beyond the author.
  • Some authors could contribute the key idea, but others could contribute clarity of thought, or make connections to other fields. In general these people may not know each other, or be able to easily collaborate.
  • There aren’t enough good review articles.When the marginal cost of producing a textbook is near zero, the fact that no one gets proper credit for writing good textbooks isn’t so bad simply because you only need one or two good ones, and the audience is huge. Random fluctuations are sufficient for capturing most of the low-hanging fruit. However, as the potential audience shrinks, it becomes more and more important to set the rewards for writing good document in line with the benefits to the community. a  (Conjecture:) Hyperspecialization in technical fields leads to a gradual slow down in progress as it becomes more and more difficult to learn everything that is needed to go beyond the state of the art. Part of this is just that learning the theory underlying a field requires you to learn all of that theory’s conceptual dependencies. We have been able to mitigate this problem by having researchers specialize and by creating dedicated schools and course, but this becomes worse past the graduate level because we don’t have enough good incentives for didactic material. Relatedly, certain advancements will come from combining ideas from multiple subfields where it’s not feasible to simultaneously be an expert in all of them. We need better ways to become well versed in a field without reaching expert level. b   The gap between textbook and review articles is way too large for anyone looking far from their area of expertise.
  • (Added:)Papers are missing important links (citations) to related work, either because the related work came after publication or because the authors weren’t aware of it. Backwards citation services (“cited by” lists) only fix a small part of this.

The root problem is that academics are writing the same sort of fixed documents, either alone or with a tiny number of collaborators, that they have been doing for centuriesThe arXiv has let people easily release new versions of papers, which is a clear improvement, but this is a minor effect. c  . The natural solution is to make academic articles universally collaborative. Experience has shown that allowing many authors to contribute to the same document can let them produce something none were capable of individually. Needless to say, this would be a monumental shiftThis doesn’t necessarily require much top-down coordination. The ultimate goal could be to introduce a new norm in physics that everyone publishes their paper in a form that is free to be modified, just like the early ’90s brought a new norm for releasing work on the arXiv (which previously might have seemed very unlikely). d  . Let’s list some existing examples we might copy.

  • Wikipedia: This is the stupendously successful experiment that no one in their right mind would have predicted to work. Importantly, it shows that people will sometimes contribute to a collaborative document with no material reward whatsoever. However, the popular accessibility means that the number of readers per author needed is probably much less than in academics.
  • Scholarpedia: Like wikipedia, but intended for academics. It uses a page ownership model like the defunct wikipedia cousins Nupedia and Citizendium. Appointed experts own pages in their area of expertise. Other users can submit edits, but they must be approved by the owner. Requires more top-down administration than Wikipedia, and is currently somewhat sparse, but does have some very nice articles by top experts.
  • Knowen: A new upstart whose lofty aim is to be the wikipedia of scientific knowledge. Claims to cover course notes, introductory texts, and new research all in the same framework (example). At the moment, just a skeleton. (Added: Knowen’s creator Ivar Martin has an illuminating comment below.)
  • Polymath: A “group blog” that has achieved some modest success obtaining novel mathematical results through an online discussion and writing format.
  • The StackExchange family of websites: The best question-and-answer forum on the web. They have demonstrated that properly constructed online norms and discussion format can have a huge positive impact compared to a free-for-all.
  • GitHub: The gold standard in collaborative editing of software, and an important tool for open source. Very detailed tracking of contributions by authors (example), which is useful for setting up incentives.

A big contingent question is whether academic physics should stick with TeX or move to a different document processing standard.Personally, I think the ideal place to aim for is a new intermediate language somewhere between TeX and Markdown in complexity. But that’s a post for another day. For now, we could just really use a decent TeX editor. e   After all, some standards are going to be vastly easier than others for tracking changes, merging different versions, etc. In this post I’m just going to put that issue aside. (You can imagine we’re sticking with TeX if you like.)

By looking at the examples above, we can identify at least three distinct models that might be adopted:

  1. Ownership. Papers are posted to the arXiv just like now, with some fixed list of authors. Changes can be suggested (a la GitHub, Scholarpedia, StackExchange) that the original authors can approve or reject.One could consider an even more modest version of this, where contributors could only submit comments (possibly consisting only of citations), that could be either approved or rejected by the authors, with the option of reply. The current process for in-journal discussion takes several months between replies, at the least. Although having dozens of small back-and-forth replies is tiresome and inefficient for the reader (such as in a free-for-all blog post), it is likely that significant improvements can be made to the traditional journals by allowing somewhat shorter and faster replies electronically. Here, we can learn a lot from StackExchange websites. Ideally, comments on papers could appear alongside the original document at the user’s discretion, perhaps with a voting system to determine which comments are worthwhile for casual readers. Or on each page there could be an adjustable slider that would filter comments made by the authors, commenters approved by the authors, any arXiv user, or the public at large. f  
  2. Forking. Papers are posted to the arXiv just like now, but anyone can make changes and fork the new version as a separate paper without affecting the original posting.
  3. Open-wiki. Papers are constructed from scratch like on Wikipedia. The earliest version of an article need not stand on its own, but might (say) only have an outline. Free-for-all authorship.

In principle, all of these models might co-exist with each other and with traditional papers, although I imagine some would lose out eventually. The ownership method is probably the smallest change and the one that academics would have least difficulty adjusting to.

The above discussion mostly concentrates on the various technical aspects, but these can all very likely be handled with time and money. Much more challenging are the substantial social obstacles that Paul and I spent a while discussing.

  • Author control. Author’s often have enough difficulty getting along with their immediate collaborator, so there will doubtless be some resistance to having their work edited by strangers. This will be entangled with how attribution is done, and whether a ownership, forking, or open-wiki model is adopted.
  • Attribution. Each model will have different possible methods for attributing content to different authors. Are all people who contributed listed as authors, or do the original authors appear in a distinguished position? Are individual changes indexed? At what level of granularity? With the right diffing software, it’s already possible to track arbitrarily small changes to TeX files from many users in a reasonable way.
  • Evaluation. How will others evaluate an author’s contribution when it can now be arbitrarily minor. Are hiring committees going to check which individual lines you added to a document? Many similar issues have not stopped GitHub from assuming a bit of the role of a CV in the software industry.
  • Author incentives. Given the possibilities for evaluation, will author’s have enough incentive to contribute meaningfully to collaboratively edited documents? (Could disincentives arise because of academic infighting that are not found in Wikipedia?)
  • Network effects. We can expect that it will be difficult for any collaborative editing forum to get started, since a lot of the potential benefits have strong network effects. (For example, no one is going to value your contributions on such a forum during your job hunt if they haven’t heard of it yet.) This could be mitigated by driving the frictions to editing as low as possible, so that you can capture the natural human urge to improve something even without reward that Wikipedia seems to tap into. One might also leverage the existing clout of the arXiv by convincing its advisory board to endorse or integrate a collaborative forum into the site.

There’s also the issue of licensing. Collaborative editing requires that academics release their work under more permissive licenses than they usually do currently. Paul was kind enough to look up how many of the papers submitted to the arXiv were released under each of the four available licenses (absolute numbers, and percentages):

License chosen 2014 2015 (YTD)
Minimal arXiv distribution 94416 96.8% 28606 96.9%
CC – Attribution 1662 1.7% 496 1.7%
CC – Attribution, non-commercial, share-alike 1052 1.1% 304 1.0%
Public domain 387 0.4% 117 0.4%

The minimal arXiv license, which does not allow for collaborative editing of the original document, clearly dominates. Part of this is because this is the default option and there’s currently not much reason to change it. But part of it is that people are wary of giving up the rights to their hard work, and will default to maintaining as much control as possible unless they have good reason to do otherwise. So this is another non-trivial barrierThere are annoying copyright interactions with any journal an article might potentially be published in. An increase in collaborative licensing would accelerate the (good) move away from closed-access journals. But that also means you can expect serious push back from existing publishers. g  .

As a closing side note, I was pleased to hear from Paul that he expects the
arXiv will eventually include a space on each article’s abstract page where the author can provide links to related videos, papers, course notes, data, etc. This should make it vastly easier for people to link up video abstracts and video lectures to their work, which I am strongly in favor of.

Edit: Tempered certainty about custom links on abstract pages.

Edit 2: Added examples of new citations as useful paper modifications.

Edit 3: Paperbricks is another idea, with video here. Here is jsweojtj’s description:

There is an awful lot of redundancy and wasted effort that goes into most papers. From introductions that need to be rewritten every time (when linking to a solid introduction would be both better and less time consuming). Each piece of a piece of a full paper (intro, data, analysis, …) could be peer-reviewed and published individually. A full paper could then be built from these paper-bricks. Anyway, recommend reading the paper as it’s well written and clear.

More motivation suggested by pickle27:

  • A paper could rely on a critical reference to build upon and the referenced paper could be disproven down the line but this is not immediately obvious from the paper that used it.
  • Currently it doesn’t seem like any merit is given to researchers who are very good at reviewing papers. Compare this to software where a good code review is celebrated. Editing and cleaning up the state of science should be valued when scientists are looking for work so I think that something along the line of a Github CV for scientists would be valuable.

Edit 4: See Hessam Mehr’s comment below for more possible advantages of collaborative documents.

Edit 5: Many commenters have helpfully reminded me of these existing collaborative editing tools that we could look to as examples: Authorea, ShareLaTeX, Overleaf (formerly WriteLaTeX), and FindusWriter. See this Nature News article for discussion. However, none of these allow for universal collaboration or a plausible path to credible and permanent attribution, so even if they succeeded they mostly wouldn’t answer the question of what the central repository (or other arXiv successor) would look like. And that’s the key part.

Likewise, see SciRate, ThinkLab, and PubPeer for discussion/annotation of articles, but without distillation toward a refined document. And see Force11 for some interesting but non-committal discussion about how future scientific works should be produced (h/t hyperion2010).

Edit 6: coliveira makes a good point that the initial presentation and discussion of results is not suitable for a wiki, since it will always be necessary to give the author some stable platform upon which to present some of their tentative work, and likewise for the tentative discussion by critics. So some form of ownership will be necessary for some formats. However, this platform could often be much more modest than a full journal article for many of the incremental results currently found in journals.

Edit 7: The Stacks Project is a neat example of a large collaborative document with a dedicated website. (H/t tobilehman.) I can’t assess it’s mathematical quality, but it looks very professional at a quick glance. This is more evidence, like Authorea and its siblings, that collaborative tools are actually fairly well developed these days. So that’s probably not the sticking point.

Edit 8: Bas Spitters: “With two dozen researchers we collaboratively wrote a 600 page book [the Homotopy Type Theory textbook] in less than half a year using github”. The book and relevant blog post. Amazing. I still don’t understand how they managed to convince people to do this.

Edit 9: See CodaLab for collaborative tools for computational research and CaseText for crowdsourced forward-citation and annotation system for the law. (H/t graphific and jacech.)

Edit 10: Discrete Analysis is a new arXiv overlay journal launched by Fields medalist Timothy Gowers. (Nature News coverage. H/t Ivar Martin below.) There is also much to be learned from the Stanford Encyclopedia of Philosophy, especially the funding model. Note also that the arXiv administration board is toying with the idea of giving authors the ability to link to external material.

Edit 11: I was recently enlightened to the fact that the Stacks Project isn’t nearly as collaborative as I had thought, with the large majority of the 5000+ page document written by Aise Johan de Jong of Columbia.

Edit 12: Here’s an example of overlay commenting from Fermat’s library, which highlights a paper every week for community commenting. Fun, but not intended to be scalable.

Edit 13: joelg from HN says this:

I’m working at the MIT Media Lab on PubPub (http://www.pubpub.org), a free platform for totally open publishing designed to solve a lot of these problems:

  • One is peer review, which, as some have already mentioned, needs to be done in an open, ongoing, and interactive forum. Making peer review transparent to both parties (and the public) makes everyone more honest.
  • Another is the incentive of publication itself as the ultimate goal. Instead, we need to think of documents as evolving, growing bodies of knowledge and compilations of ongoing research. Every step of the scientific process is important, yet most of it is flattened and compressed and lost, like most negative results, which are ditched in search of sexy click-bait headliner results.
  • Another is the role of publishers as gatekeepers and arbiters of truth. We need a medium in which anyone can curate a journal, and in which submission, review, and acceptance procedures are consistent and transparent.
  • Another is the nature of the medium itself. It’s 2016, and these dead, flat, static PDFs are functionally identical to the paper they replaced! Insert your favorite Bret Victor/Ted Nelson rant here: we need modern, digitally-native documents that are as rich as the information they contain.
  • Another is reproduciblity. We should be able to see the code that transformed the raw dataset, tweak it, and publish our own fork, while automatically keeping the thread of attribution.

Edit 14: Last year Tobias Osborne started tooling around with GitHub to write a collaborative paper on “What is a quantum field state?“.

Footnotes

(↵ returns to text)

  1. When the marginal cost of producing a textbook is near zero, the fact that no one gets proper credit for writing good textbooks isn’t so bad simply because you only need one or two good ones, and the audience is huge. Random fluctuations are sufficient for capturing most of the low-hanging fruit. However, as the potential audience shrinks, it becomes more and more important to set the rewards for writing good document in line with the benefits to the community.
  2. (Conjecture:) Hyperspecialization in technical fields leads to a gradual slow down in progress as it becomes more and more difficult to learn everything that is needed to go beyond the state of the art. Part of this is just that learning the theory underlying a field requires you to learn all of that theory’s conceptual dependencies. We have been able to mitigate this problem by having researchers specialize and by creating dedicated schools and course, but this becomes worse past the graduate level because we don’t have enough good incentives for didactic material. Relatedly, certain advancements will come from combining ideas from multiple subfields where it’s not feasible to simultaneously be an expert in all of them. We need better ways to become well versed in a field without reaching expert level.
  3. The arXiv has let people easily release new versions of papers, which is a clear improvement, but this is a minor effect.
  4. This doesn’t necessarily require much top-down coordination. The ultimate goal could be to introduce a new norm in physics that everyone publishes their paper in a form that is free to be modified, just like the early ’90s brought a new norm for releasing work on the arXiv (which previously might have seemed very unlikely).
  5. Personally, I think the ideal place to aim for is a new intermediate language somewhere between TeX and Markdown in complexity. But that’s a post for another day. For now, we could just really use a decent TeX editor.
  6. One could consider an even more modest version of this, where contributors could only submit comments (possibly consisting only of citations), that could be either approved or rejected by the authors, with the option of reply. The current process for in-journal discussion takes several months between replies, at the least. Although having dozens of small back-and-forth replies is tiresome and inefficient for the reader (such as in a free-for-all blog post), it is likely that significant improvements can be made to the traditional journals by allowing somewhat shorter and faster replies electronically. Here, we can learn a lot from StackExchange websites. Ideally, comments on papers could appear alongside the original document at the user’s discretion, perhaps with a voting system to determine which comments are worthwhile for casual readers. Or on each page there could be an adjustable slider that would filter comments made by the authors, commenters approved by the authors, any arXiv user, or the public at large.
  7. There are annoying copyright interactions with any journal an article might potentially be published in. An increase in collaborative licensing would accelerate the (good) move away from closed-access journals. But that also means you can expect serious push back from existing publishers.
Bookmark the permalink.

47 Comments

  1. Hey, fantastic analysis on the subject. As a researcher with many published papers, I share many of the concerns you raised and have been working on a model for better distribution of knowldge (including a better business plan to replace the current con systems that publishers have). Please contact me if you are interested in collaboration on this topic (building a new system for academic research).

  2. Hi, Jess. Great article!

    Have you heard of Authorea (https://www.authorea.com)? It’s an online collaborative editor / word processor for science writing and publishing. I noticed it’s not in your list of existing examples. (Disclaimer: I work at Authorea.)

    Authorea has a lot of the features you mentioned: version and authorship tracking (using Git), comments, etc., plus some time-savers like reference formatting and one-click export for arXiv or any journal format. You can write using LaTeX or Markdown (or even a combination if you’re a masochist).

    We also have some nifty stuff like data and code figure attachments, live iPython notebooks, and support for d3.js interactive figures.

    The model on Authorea is mostly “Ownership”, with an “Open” component because public comments are allowed on public articles. Papers aren’t static — authors can update any time, respond to comments, add co-authors, etc.

    Archival features like DOIs, Evaluation and Author Incentives aren’t built yet, but are totally possible in the future.

    Our vision with Authorea is to speed up science by creating an environment where research findings are published, reviewed, and consumed openly and effectively. We love feedback, too. What do you think — is Authorea useful? A step towards the goal you described above?

    • That seems to be similar to https://www.sharelatex.com/. Nice to have public comments though!

      • Hi, Anne! Great point. There is a fairly large difference, though. I may sound like an ad, but I can’t resist pointing out a few ways in which Authorea is different from (I’d say better than) ShareLaTeX.

        ShareLaTeX is basically “LaTeX in a web browser”. Authorea is LaTeX but it’s also designed to make writing easier and to be a publishing platform.

        For example, Authorea has tools to handle the annoying parts of writing like automated reference insert and formatting, and export in different journal formats. You can also make an Authorea document public as a webpage, invite comments, etc.

        They’re also a bit different in business model. ShareLaTeX is restrictive: free accounts get no history, no collaborators, and no GitHub sync.

        In contrast, Authorea free accounts have full change history, unlimited collaborators, and GitHub sync.

    • Off-the-cuff, the interface looks nice and I like that it uses Git. This seems like a good example that one could learn from. However, as you say, it doesn’t allow for universal collaboration. Therefore, for me to be satisfied and Authorea to succeed would require something like Authorea syncing up (presumably through an open API that would allow competitors) with a new collaborative arXiv. But in that hypothetical, the nature of the collaborative arXiv is the real question, and in particular how it would handle Git-like tracking, merging, ownership, incentives, etc. Note that the arXiv already allows updating at any time and the addition of co-authors.

      (Incidentally, the CERN document that the website touts as having 200 authors was essentially written by just two people (Patrick Janot and Frank Zimmerman, with very minor contributions from Alain Blondel). So I don’t think this satisfies as a proof of principle.)

      • Thanks for the great feedback. For universal collaboration, do you mean a document that would be as open as a wiki for anyone to edit?

        You can do this on Authorea right now, actually. It would just have to be by request, so a user would e.g. make a comment asking to be added as a co-author.

        If you submit an issue request (https://www.authorea.com/issues) we could build a special feature to make it simpler to do this. 🙂

        Re: arXiv sync, Authorea currently supports export formatting to over 87 different journal/publisher formats including arXiv. And we’re working on one-click direct submission workflow right now for some publishers. We’d love to include arXiv (or future-arXiv) sync in that if an API were available.

        What new features would Authorea need for you to consider it a suitable candidate GitWikXiv? For example, everything could be made citeable with DOIs for articles and versions.

        Re: the CERN paper, although the final CERN document may show only a couple of contributors, I understand it’s based on earlier drafts that had several hundred co-authors involved. In fact Authorea built several advanced user-management features specifically by request for that and other CERN projects.

        • > For universal collaboration, do you mean a document that would be as open as a wiki for anyone to edit?

          I mean something that allows a stranger to take the current best review article on a topic and improve it in a way that (1) assigns me credit in a way that a hiring committee could check and trust, and (2) makes the improved article freely available. This might be in the form of a “fork me” button augmented with pull requests.

          > What new features would Authorea need for you to consider it a suitable candidate GitWikXiv?

          As it stands, Authorea appears to have two problems by my lights:

          (a) The additional friction of requesting to be added as an author seems modest but I conjecture is fatal. Would I be diluting my credit by approving a new author? If I see something I could quickly improve, how many hours/days do I have to wait for satisfaction? What if I want to take an introduction section from one paper and add it to another?

          (b) The credit assigned for the contribution is controlled by a for-profit company which does not have its incentives permanently aligned with Academia. I am completely fine with pay-for services that add value by assisting individual groups of collaborators to work together, but the central hub to which that work is submitted and which controls the attribution systems needs to be trustable in the long term (like the arXiv).

          So here is one hypothetical: Authorea collaborates with the arXiv to agree to an attribution standard. Traditional arXiv articles are attributed to their authors like normal, but new submissions (from Authorea or a competitor) can be augmented with an attribution file which lists the authors who own the article as the main authors, but which also tracks all the small pull requests accepted by other authors. This attribution file is based on the ORCID standard for identifying authors which is already understood by the arXiv. The arXiv consumes this data, and then when I search for an author on the arXiv I find both the articles that they own and the contributions they have made to articles owned by others.

          Of course, there are lots of interesting questions about how the attribution standard would be defined, but I think here we can look to the GitHub.

          I am interested in writing a post exploring this hypothetical. If you send me an email I can run it by you for feedback before I post it.

          > although the final CERN document may show only a couple of contributors, I understand it’s based on earlier drafts that had several hundred co-authors involved.

          Sure, but Authorea is trying to use the long author list as evidence that Authorea is good for collaboration. That isn’t at all convincing if all the difficult the collaboration occurred before the document was uploaded to Authorea.

          • Very cool. I’ll send you an email ASAP.

            Just the other day we had a conversation here about a fork/pull request model of paper editing. This would be a great way to make truly collaborative research, especially if it supports inherited partial attribution as you describe.

            Of course one challenge is that non-developers would be completely lost since this paradigm doesn’t exist anywhere else. Implementing such a system in a friendly, usable way for non-technical users would be a big UX challenge.

            And as other people have said, the biggest challenge is getting people to change the way they think about research and papers. 🙂

            BTW this reminds me of the Shepardizing system (http://en.wikipedia.org/wiki/Shepard%27s_Citations) for caselaw. I’ve often wished we could have something similar for science. In that area you may also find CaseText (https://casetext.com/) to be interesting — it’s essentially a crowdsourced forward-citation and annotation system for the law.

            About the CERN paper, sorry I wasn’t clear before: the prep writing was done on Authorea too in private documents before the finished version was assembled.

          • > Of course one challenge is that non-developers would be completely lost since this paradigm doesn’t exist anywhere else. Implementing such a system in a friendly, usable way for non-technical users would be a big UX challenge.

            Totally agree. What I like about this hypothetical is that it lets you guys do what you’re good at — write collaborative software — while allowing easy entry for competitors and keeping the final product (papers with proper attribution) under control of the arXiv. If you guys solve those UX problems better than anyone else, people will pay for it.

  3. Excellent post! I especially agree with the piece about getting oriented on a new topic. I hope academic publishing can change.

  4. I’d suggest to integrate more semantics in this system, like e.g. in Knowledge Graph in Google or SemanticMediaWiki. This can help for more automated processing years later. Also, adding more interactivity with ipython/webppl notebooks embedded directly in the article and/or comments would be a huge step up for the community. Or even integration with systems like Coq, Agda, Why/Alt-Ergo.

    • (Copying from HN:) I agree. However, I think that the medium of the work is mostly (although obviously not completely) separable from whether papers are collaborative and continuously evolving.

  5. I saw this on the Hacker News front page and got so excited I switched to my PC to read it.

    Reading your blog post really cheered me up. As a PhD student in chemistry, I am constantly advised to read more papers. Yet, I find the process extremely painful especially with newer work.

    * Papers are worded very defensively nowadays to squeeze past the peer review process as effortlessly as possible. A major portion of an experimental chemistry paper discusses a certain thing is desirable, with applications that the authors list solely under the obligation of “selling” the paper.

    * There is a non-trivial amount of overlap between different papers. Due to the fixed nature of the paper format, the actual content of a new paper is minute. Yet, you have to read the whole thing or risk missing that nugget of information. A system where a corpus of knowledge around a certain object, theory, etc. could be continuously and collaboratively expanded would avoid this sort of inefficiency.

    * Especially in my field, people keep their results from each other until publication for fear of being scooped. In fact, I have pitched the Wikipedia-git idea to my boss on multiple occasions and his biggest reservation is that some big group in China will put 20 post-docs to work around the clock to finish the idea as soon as you publish the gist of it on such a system. I tend to think that this is a good thing. We would get credit for the idea; them for demonstrating it: we do what we do best as a smaller group and other groups demonstrate their experimental/synthetic prowess.

    * As you point out, information is very hard to find definitively. Keyword searches turn up only some of the papers on a particular topic. This is usually not an issue on Wikipedia where everything is semantically linked.

    I think that a git-Wiki system would make things a lot better. Being able to fork means collaboration is more seamless and deliberate. Consensus can be reached in the end, hopefully with a merge, or the state of the art represented honestly as a division in opinion. It’s good to know from a historical point of view how opinion around matters evolved with time.

    More importantly, finding a reasonable attribution and credit/evaluation system would encourage people to contribute without fear of their work going unrewarded. From experience I know chemists, at least at UBC where I work, are very conservative, which is understandable. They need grant money to pay for grad students, chemicals, and what not — it’s a very expensive operation — and they need their work touted a highly as possible to compete for grants. In particular, they need a number to represent their contributions that everyone agrees on (journal impact factors, author H-index, etc.). I believe it wouldn’t be too hard to come up with a credit system. Moreover, we don’t have to get it right at first: the definition can adapt and evolve to one everyone is happy with.

    I will definitely follow your work on this and would be happy to contribute with ideas, code, or a second opinion from a person in a slightly different field as you guys.

    • Strongly agree with all this!

      20 years ago I would have thought that building a consensus on a collaborative document is impossible, but I think Wikipedia shows it’s feasible. (Major caveat: People have stronger personal incentives to distort an academic consensus statement than Wikipedia.)

      • Yes, but don’t you think a visible fork in opinion is more useful than a hidden divide? At least then it’s visible what the games in town are and what each has to say. Some of the best pedagogy comes from comparing competing points of view in the past and how/why one turned out to be invalid, etc.

        • I didn’t mean to suggest that one would get a consensus on the scientific questions. I just mean you can plausibly form a useful consensus on the document itself (which might need to describe a scientific disagreement), something that is not at all obvious pre-Wikipedia.

  6. Hi,

    The main problem I have with your ideas is that they seem to not take into consideration the social aspects of research, instead it focuses on the technical ones. If you want get to something useful to more than the casual wiki editor you need to look at these guys that are doing research full time and consider their interests, the interests of those who pay these full time scientists and their expensive toys, and the ones that decide how much money people who pay the scientist is will have…

    None of these guys are paid to get insightful new ideas that will change the world, it doesn’t mean that it never happens, but it is more by accident than by construction. Most of the money in this game is there to get the best place it can “buy” in the “research club”. If you don’t help professional researchers that are paid to play this game to score more points you’ll never get them in your platform.

    In my opinion, for these guys, the main problem with The idea of a wiki like paper edition, is that they are paid to *deliver* things, and a wiki article is never “delivered”, it will always be in an incomplete state, waiting for some updates. The second problem is that people in the research world need to deliver *relevant* things. In the research world, relevance is measured by selectivity, if your paper is selected by a hard to go conference/workshop/journal/editor, than it is relevant. If instead you have collaborative online documents that don’t need to be selected to exist, these articles are not relevant.

  7. Great ideas here. Not enough time to read all the comments but I wanted to put in my two cents. I’m just starting out as a PhD but what I find challenging is collaboration within co-authors. Right now we just use MS Word and versions and edits go back and forth.

    The stand-alone value prop could be to allow for simpler online collaboration that can also maintain existing references and citations, as it doesn’t always carry through (like if I’m using EndNote for my citation manager). If authors can use the system for their own papers, it can build up your user-base even if they aren’t comfortable with having outsiders edit their articles.

    ResearchGate I think is moving in this direction, as academics and post papers they work on there.

  8. Matteo Cantiello

    A lot of good ideas. Most of the technology to achieve universal collaboration is available. To me the most difficult part is convincing scholars to collaborate in the open. I believe this will happen, but one has to take baby steps. Research achievements are the metric of the academic world, scientists would only participate in such large scale, open collaborations if a solid reputation system was ensuring their contribution was appropriately measured and acknowledged. So one first need to create an environment where scientists get used to collaborate in the open with their chosen co-authors (for example that’s what we’re trying to facilitate at Authorea). Then, as the confidence on open science builds up, one could start pushing towards universal collaboration. Incidentally I set up an experiment of this kind some time ago: I invited anybody to join a sort of “open collaborative paper” on a very speculative idea https://www.authorea.com/2155 Sort of a proof of concept. Never really pushed this, but I am with you that this should be the way to go.

    • Agree with most of this. However, I don’t yet see why folks will bother to collaborate with chosen co-authors in the open. Is the idea that you first create the credit system for comments by outsiders, and then authors will collaborate in the open in the hopes of getting assistance from comments by credit-seeking outsiders? Or is some movement toward collaboration in the open something you guys at Authorea are empirically seeing, without necessarily having a solid explanation?

      • Matteo Cantiello

        An immediate advantage is open peer review. Here I argue that the peer review process (in its current form) is not sustainable https://www.authorea.com/20770 . Writing in the open and having a powerful commenting system can represent a new trajectory for communal review of documents. And yes, since everything is logged, occasionally outsiders can get credit or even get on board on a paper if their help is substantial. So I think you see my point: This could smoothly progress into a mode of collaboration close to what you envision. The technical steps for doing this jump are minimal for a platform like Authorea. The real challenge is to slowly change the way scholars look at collaboration. Create that mindset. I think the movement we are trying to promote goes in that direction.

        • Sure, although I’m not sure that authors will individually see the correct incentives to pursue open peer review even if, as you argue reasonably, the closed peer review system as a whole is becoming more dysfunction.

          • Matteo Cantiello

            Yes, it is not completely clear how that will work. One possibility is that some metrics will show undeniably that open collaboration leads to faster, better science (I think we both agree this is true, but demonstrating it it’s another story). The other forces at play are the general public, funding agencies and universities/institutions: the idea of having science done in the open is clearly appealing to taxpayers, hating paywalls and constantly hearing from politicians of bad practices among scientists to secure their fundings. Universities might also be the agents pushing for their scholars to do open, transparent, collaborative research if that would secure better funding / better reputation. So I am with you this is a complicated landscape. I am also with you that we do not exactly know how, but this “revolution” has to happen.

  9. I think one of the problem that is not, or only slightly, discussed here is grant acknowledgments and money. Really, 90% of the papers are published so the main author can put the grant number in the acknowledgement section. Good or bad it doesn’t matter, it is the fact. So any system that to be successful in this space should be recognizable by funding agencies (NSF/NIH) and with the possibility to depict personal (or group) contribution, preferably with the corresponding author possibility (you all know the ego sizes in the academia)

  10. How about a two layer model? The bottom layer is invidual/group contributions which can be in somewhat raw/specialist form but with clear attributions, while the top layer can have a collaborative, wikipedia like model with properly organized material in standard terminology making it more approachable to a bigger audience.

    • Yes, I think this is the correct direction.

      Note that you’ll need to have clear attribution on the wiki document too to get the correct incentives. (Or do you? If enough people where reading and being influenced by the wiki document, would this alone create enough and the right incentives to get exerts to contribute to the wiki document?)

  11. This reminds me of the Algebraic Stacks project: http://stacks.math.columbia.edu/ it’s a big collaboratively edited mathematics textbook, the subject is under active research, and the book’s versioning uses git commits, as the whole thing lives in a git repo.

  12. This is a really good discussion. In addition to all the points you and others have made, I wonder how this would work culturally. Like wikipedia, where sometimes random people make enormous edits then fight back and forth about who is right? Or like collaborative writing in groups where people are respectful and use comments and small changes to more gradually work on a document. It would be interesting to see if we could come up with some standards of behavior for this setting, as well as thinking about how we would incentivize/enforce them. Personally, I think it’s fairly easy to work collaboratively with people I know (either in real life or online, and whether writing or other work), but working collaboratively with people I don’t know is much harder. How to build up a sense of understanding the other collaborators while starting from scratch might require some other channels of communication beyond the simple text collaboration, such as at least IM.

    • Definitely an important facet. Pre-wikipedia, I would have said it’s impossible. However, I think wikipedia has shown that *enormous* groups of people can come to a reasonable consensus, more-or-less and most of the time, on how to structure a document that properly treats a controversy. Furthermore, they have managed to do this without real-world incentives! I am optimistic that a smaller number of PhD scientists with with their true identities attached will be able to do even better. My biggest concern is that the real-world career and monetary incentives will induce *bad* dynamics that are not found on wikipedia.

      As has been pointed out by others, wikipedia has the advantage that it need only form a consensus about verifiable information, whereas scientists will need to produce documents and have discussions about disputed things. However, here I think that we can fall back to the ownership model.

      I strongly agree that there needs to be channels of communications besides the text under discussion itself. IM is fine, but wikipedia seems to have functioned pretty OK using just the text-based, and rather slow, “talk” section. Perhaps you are right that this will not be sufficient when they are talking about hard scientific problems rather than just editorial concerns.

  13. Hi Jess,

    I enjoyed your analysis and appreciate the amount of work that went into it, I learned a lot.

    I am an active theoretical physicist (have a paper with your PhD advisor :), and as much an any of you have been suffering through the inefficiencies of pretty much all aspects of scientific work — the difficulty of finding relevant prior publications, sifting through them, preparing ms, trying to put a new twist in the introduction (the infamous first lines “Recently, there has been a lot of interest…” ), and of course the review process (both as author and a referee).

    I can closely relate to the PaperBrick process that you highlighted, however, I would like to take it to extreme. An epiphany, which may seem as a triviality, that I had is that a lot (all?) of our scientific knowledge is hierarchical, i.e. there is a deep sense of scale invariance. The simple meaning of this is that the same topic can be discussed at various levels — kindergarten, Scientific American, Nature/Science, “Specialized Journal”, notes-too-detalied-to-be-published, etc — each deeper level providing additional “zoom” into the problem. The difficulty, of course, is that such a nested hierarchical structure does not naturally map onto the linear publishing medium or journals and books. However, this is the structure that is optimal for learning.

    This was my motivation to set up knowen.org as a structured alternative to wiki, with some additional bells and whistles (below) to incentivize serious people to use it. The distinguishing feature of knowen is that it is designed to promote structuring and distillation of knowledge. The content is arranged into rather compact nodes, each having one of more logical parents. Going deeper in the hierarchy reveals progressively more details (each node can also have various attachments, such as scanned notes, figures, data, scripts and their outputs). The motivation was quite self-serving since I knew that my personal productivity and the effectiveness of collaborations would be enhanced, which turned out to be the case. The link that you posted is my current unpublished project on quasicrystals, that has more content than will go into a publication, which is in the process of being finalized.

    As you rightly note, at this point the amount of content on the site is rather limited.
    I am actually surprised that you even found it (perhaps through Paul?), since it is still a work in progress. Mainly it contains private collaborative projects, which are not visible to the outside, some Private Visible (can read and comment but not edit), and some lecture notes.

    The “big idea” for public content generation is two-fold. Knowen provides fairly useful work/collaboration features (markdown/Latex, history of changes, simultaneous editing/drafting, ability to comment) which I hope will make it attractive for private collaborations. However, when the collaboration is done with their work, and have published what they wanted to publish, I expect that at least some of the private content will be voluntarily folded into the public hierarchy by assigning proper public parents, where it will be easy to locate by others interested (visibility!). The other mechanism, is when people who keep working in their standard ways, would use knowen as a way to popularize their work, even if by putting some small snippet in the right place and adding a ref to arxiv or a journal publication (remember, PaperBricks).

    The attribution is natural since all the changes are tracked (they must have a title of change and can have detailed description of change for the less trivial ones), and in fact can be evaluated by other users (now we have implemented system similar to Stack Overflow). There are some other features that people may find useful, e.g. ability to follow activity of selected users, or changes in nodes or subtrees.

    Forking is naturally implemented by the branching structure of the (acyclic) graphs structure — even Evolution & Creation can be accommodated as alternative views on the same problem (sorry, Authorea, but I don’t think that git in its bare form will fly with majority of scientific community 🙂

    As I mentioned, it is still a work in progress, so if you’d like to see your favorite features in soon, let me know!

    [The braves ones, who read till the end of this enormous comment – I salute you!]

    • Great to hear from you Ivar! I did indeed hear about Knowen from Paul, but I hadn’t known we had both co-authored with Wojciech. Small world!

      > a lot (all?) of our scientific knowledge is hierarchical, i.e. there is a deep sense of scale invariance….The content is arranged into rather compact nodes, each having one of more logical parents.

      I have an important question. On knowen.org, you say “parent” and “child”, so I assume this is a directed graph. And you say “root” so it sounds like a rooted tree where, presumably, the directionality of the edges respect the partial order associated with the root, i.e, all edges point away from the root. (An “arborescence”, wikipedia tells me.) So this sounds like a single preferred hierarchy for all knowledge, with a single sense of direction going from less detail to more detail? Is that right?

      If so, I have a serious worry about this, but I’ll wait for you to confirm this before I expand.

      > However, when the collaboration is done with their work, and have published what they wanted to publish, I expect that at least some of the private content will be voluntarily folded into the public hierarchy by assigning proper public parents, where it will be easy to locate by others interested (visibility!).

      Yes, I thought this was a brilliant maneuver! When I was discussing Knowen with my great friend Godfery Miller, this was the aspect I emphasize the most. If you can build a collaborative tool people already want to use, then you can make a nice frictionless way for them to fold it into the public knowledge base. Wonderful.

      But here’s the big challenge: how will you produce a more useful collaborative product than the 8-person Authorea team, or similar competitors? Have you considered merging with them, if that is at all possible?

      > and in fact can be evaluated by other users (now we have implemented system similar to Stack Overflow).

      This is very cool! I had no idea.

      > Forking is naturally implemented by the branching structure of the (acyclic) graphs structure… (sorry, Authorea, but I don’t think that git in its bare form will fly with majority of scientific community 🙂

      Wait, do you mean that git in its bare form won’t have enough structure for powerful forking to be used in collaborative document of all scientific knowledge, or do you mean that the scientific community won’t be able to usefully fork individual documents with just git?

      Cheers!

      • Hi Jess,

        Small world indeed 🙂
        thank you for your very relevant questions. Here is what I think about them:

        > So this sounds like a single preferred hierarchy for all knowledge, with a single sense of direction going from less detail to more detail? Is that right?

        This is a very big question. To be honest I am not sure if such universal hierarchy exists. However, I would approach this issue constructively. Suppose you start working on a project. First you define the general scope, formulate a hypothesis. Then you start going deeper and deeper, exploring different logical branches, once in a while backtracking and reevaluating your hypothesis etc. In the process, the logical structure emerges naturally, even though at times you may need to do some splitting/merging/rewiring of logical connections.

        Knowen does not limit the number of parents that any node can have, which makes it a data structure more general than tree (technically, DAG). Cyclic parent-cild connections are not allowed, but otherwise it is quite flexible.

        But having said that, there can be multiple approaches to building hierarchies, and they will appear different, at least at the initial stages. One project that I have high hopes for is Hight-Temperature superconductivity — lots of information, little consensus. One can structure a hierarchy based on families of materials, or based on phenomenology. I personally think that latter is more interesting, but crystal growers may disagree. But it is totally ok if several hierarchies begin to grow for the same field. Eventually they can be interconnected, by logical links, or at the minimum by standard hyperlinks. One can build more complex ontologies than parent-child, but I suspect that the amount of ambiguity there will be much higher.

        > But here’s the big challenge: how will you produce a more useful collaborative product than the 8-person Authorea team, or similar competitors? Have you considered merging with them, if that is at all possible?

        Authorea/ShareLatex and Knowen are solving different problems. Authorea is a beautiful tool to simplify writing papers. I was seriously considering using them to produce a paper based on my Knowen notes, whose structure is too “vertical” to be simply dumped into the conventional journal format. While we continue publishing in linear journals, tools like Authorea or ShareLatex will remain valuable. I personally dread constantly rereading 95% finished manuscripts, and any tool that makes that less painful is very welcome. However, at this point the versioning functionality of the tools that I looked at remains less convenient than simply diffing tex or even pdf. That is my problem with git; as powerful as it is, article is not a computer program, and git does not naturally map onto paper writing process.

        Knowen is a tool for prototyping, record keeping, and eventual voluntary “publication” of the distilled new content — with minimal redundancy. In this sense it is very different from arXiv, books or journals.

        My personal view is that accumulation of knowledge should be largely decoupled from the packaging problem. The value of work should not be gauged by whether it got into Nature or Science. Textbooks, reviews, and articles are the means of popularization, and will remain of importance indefinitely; however in my opinion
        they primarily represent a view of a set of authors, with their preferences and biases, however enlightening or entertaining those may be.

        In the utopian world that I like to envision, there will be a globally vetted repository of knowledge, that one can be confident of. And if one wants to learn about a topic, however general or specialized, one could just start from the Root, and in a few steps would be able to reach the forefront of scientific research, with all the caveats and derivations being also there, just a click away. And if one does discover or even conjecture something new, however small or grand, that knowledge will find its place and would be taken advantage of by the community, with the original attribution naturally preserved.

        • Hi Ivar, thanks for your thoughtful answers!

          > Knowen does not limit the number of parents that any node can have, which makes it a data structure more general than tree (technically, DAG). Cyclic parent-cild connections are not allowed, but otherwise it is quite flexible….One can structure a hierarchy based on families of materials, or based on phenomenology…. it is totally ok if several hierarchies begin to grow for the same field.

          OK, good. So there can be multiple directed (parent–>child) paths from A to B? I guess it’s then possible to, say, divide animals into both land vs. sea and also into warm vs. cold blooded. Right?

          (Incidentally, is there an easy way to see why it’s so important, presumably for performance reasons, to avoid cycles?)

          > as powerful as it is, article is not a computer program, and git does not naturally map onto paper writing process.

          Do you think this is just because the software field is more mature with regard to revision control, and that revision control software properly designed for prose and math will emerge? Or do you think it’s fundamentally a different problem?

          > My personal view is that accumulation of knowledge should be largely decoupled from the packaging problem… Textbooks, reviews, and articles are the means of popularization,

          What do you think of this argument:

          A sort of wikipedia for research would be very valuable as a separate project, but papers are necessary because they enable a technical conversation. One or more authors present their particular view of a topic, and other researchers join the discussion by writing their own papers. On the other hand, a shared page is an amorphous piece of information that is very difficult to use as a discussion medium. In this sense, wiki articles are much more primitive than traditional research papers.

          Do you agree that author-owned, conversational articles are necessary both for the creation of knowledge (through academic discussion) in addition to the popularization and dissemination of it? (I guess that would put Knowen in the core, with the bleeding edge of research on one side and the popularization of known facts on the other?)

          • Hi Jess,

            thanks for the followup!

            >OK, good. So there can be multiple directed (parent–>child) paths from A to B? I guess it’s then possible to, say, divide animals into both land vs. sea and also into warm vs. cold blooded. Right? (Incidentally, is there an easy way to see why it’s so important, presumably for performance reasons, to avoid cycles?)

            Indeed, that is the point. And each route is in the direction of
            specialization:

            Animal -> Sea ->whale->…
            Animal -> Warm-blooded -> whale->…

            Both routes lead to whales, no matter which one you choose.

            I believe that directed structure is particularly effective, as it “forces” contributors to structure information hierarchically. In the above example, if I want to learn and understand an arbitrarily detailed aspect of whale life or anatomy, I should be able to get there through a small number of “zoom” steps, without the need to read the whole tome on zoology.

            > Do you think this is just because the software field is more mature with regard to revision control, and that revision control software properly designed for prose and math will emerge? Or do you think it’s fundamentally a different problem?

            I think that computer code and general knowledge are rather distinct types of information. Code has to compile and work in a predictive fashion ( http://www.explainxkcd.com/wiki/index.php/1513:_Code_Quality ). There is either one version or the other that is running at any given time. In a general scientific discourse, however, we expect to have divergent opinions, all presumed viable until proven otherwise. Thus the “branches” or “forks” should be explicitly visible, with no bias as to which one is “more correct”. I don’t think in the present form git is geared for this purpose.

            > What do you think of this argument:

            “A sort of wikipedia for research would be very valuable as a separate project, but papers are necessary because they enable a technical conversation. One or more authors present their particular view of a topic, and other researchers join the discussion by writing their own papers. On the other hand, a shared page is an amorphous piece of information that is very difficult to use as a discussion medium. In this sense, wiki articles are much more primitive than traditional research papers.”

            I agree with this statement in regards to current status of scientific wikis, Wikipedia included, that I have seen – they are primitive and amorphous, most of the time. However, there are beautifully elaborate online knowledge bases from software world that use wikis. Thus I don’t believe that the assertion is correct in general. We just have to learn to use wiki in a smart structured way.

            > Do you agree that author-owned, conversational articles are necessary both for the creation of knowledge (through academic discussion) in addition to the popularization and dissemination of it? (I guess that would put Knowen in the core, with the bleeding edge of research on one side and the popularization of known facts on the other?)

            There is more than one way to skin a cat. The chances are however that “the best” way will be the most efficient one. I don’t believe that the conversational articles and their processing are efficient. As long as everyone uses that approach, it does not matter. However, when a new more efficient way starts catches on, everyone will have to adopt. That is not to say that there isn’t a certain charm in writing manuscripts in ink, or playing chess by mail.

  14. Pingback: Papers: ecosistema privativo vs. dominio publico y creative commons | Virgilio Leonardo Ruilova Castillo

  15. Thank you for the article.

    Please also take a look at http://arximedes.org/. We have built a more intuitive search and ability to rate papers. There is no reason we could not have a community rating of papers, both pre- and post-publication. Here’s a brief video: http://youtu.be/kj_Q7iqNq8o. We presently cover the Physical Sciences (NASA ADS) and in the process off integrating biology (PubMed).

    Don’t hesitate to email me: dmitry@arximedes.org.
    We are looking for collaborators and always open to chats and ideas.

    Thanks.

  16. Hi, I thought I might point out the nLab (http://ncatlab.org/nlab/show/HomePage) which does some things for mathematics (in particular category theory-related stuff) that you write above.

  17. You may be interested in the experiment we did at the IAS in Princeton.

    With two dozen researchers we collaboratively wrote a 600 page book in less than half a year using github:

    http://math.andrej.com/2013/06/20/the-hott-book/

    The book is released under CC and has been a big success.

    We even got an ACM “Notable book” award.

    http://homotopytypetheory.org/book/

    • I am very interested in this. Where should I go to read more about the actual social logistics (/ mind control) that went into this? The post by Andrej Bauer makes it sound like these things just happened magically.

      > mathematicians benefit from being a little less possessive about their ideas and results. I know, I know, academic careers depend on proper credit being given and so on, but really those are just the idiosyncrasies of our time. If we can get mathematicians to share half-baked ideas, not to worry who contributed what to a paper, or even who the authors are, then we will reach a new and unimagined level of productivity.

      > Truly open research habitats cannot be obstructed by copyright, profit-grabbing publishers, patents, commercial secrets, and funding schemes that are based on faulty achievement metrics. Unfortunately we are all caught up in a system which suffers from all of these evils. But we made a small step in the right direction by making the book source code freely available under a permissive Creative Commons license.

      • I am not sure what to add to what Andrej wrote. It was a magical time at a magical place. It was clear we were doing something important together, something none of us could have done alone. Moreover, many of us had substantial experience in writing (open source) software and computer proofs. So, we were aware of the issues and knew the tools.
        I’d be happy to explain more, but don’t know what to say explicitly.

        • I haven’t read through all your posts, but I guess you are aware of selectedpapers.net.

        • Why did people contribute? Do you think they just were an unusual bunch who were passionate about this, or could it be replicated? Who led the project, and how did it originally get off the ground? Was it primarily young researchers, old researchers, or a mix? Was there a conference/workshop to kick things off, or did people just start sending emails? Did everyone know each other beforehand? Were most contributor physically at IAS, or did lots of people contribute remotely? Have the contributor felt like they received recognition for this work in applying for jobs later? Will it be repeated? (You don’t have to answer all those of course; I’m just trying to give you a sense of what I’m curious about.)

        • This is very interesting indeed. I am assuming, from Andrej’s blog, that this is the way that he continues to work, using git/github. Do the other participants of the project as well, did the trend catch on?

Leave a Reply

Include [latexpage] in your comment to render LaTeX equations with $'s. (More.)

Your email address will not be published. Required fields are marked with a *.