Scope of this blog #1

[Edit] This statement was replaced by a newer one which links to the About page. This is no longer a work blog and is written in my private time.

This blog was formerly called “Talat’s Repositories Blog”. The new name reflects a widening of its scope, not any loss of interest in the original subject area.

Although my current work centres around Application Profiles, working for UKOLN has made it increasingly obvious to me that information science has always been, and must remain, an interdisciplinary area where it is counter-productive to limit oneself purely to a narrow area of research and development. If services and standards are developed without reference to the wider dynamic, user-facing world of information science, they will be likely to place limitations on their long-term useability and life cycle in a given community. As far as metadata developments are concerned, systems other than repositories such as Current Research Information Systems and publications systems are arguably of equal importance, if not greater, to those to which we would usually accord the title “repositories”.

In re-branding and re-purposing this blog slightly, then, the intention is to maintain a focus on repositories and associated systems but also to keep an eye on wider developments in the dissemination of intellectual communications on the Web, through whatever technology may be relevant to its purpose.

Facebooktwitterredditpinterestlinkedinmailby feather

DuraSpace

[Edit] This is a legacy post. This blog is no longer connected with my work activities in any way and is written entirely in my own private time.

The recent announcement of the merger [Ed.: press release deleted] of DSpace and Fedora Commons as DuraSpace is potentially a very significant advance in the repositories sector. Although the two platforms will continue to exist as separate entities, they will no doubt collaborate to their mutual benefit in technical development. In addition, new software products such as DuraCloud are described.

I have personally found DSpace to be an effective and flexible platform, although until version 1.5 it was missing some fundamental functionality that meant it was overall an inferior product to EPrints. However, I have always said that if certain issues were sorted out, such as a more granular permissions system, versioning and so on, it is otherwise as good overall as EPrints. (I particularly appreciate how easy it is to see full metadata records in DSpace, from the point of view of research, though this is an entirely trivial technical point – it just happens to suit my work in my present and previous post!)

DSpace is clearly second to EPrints in terms of market penetration, but it is the only other major competitor to enjoy such a sizeable market share. Fedora is third not on its technical merits but largely because it is not a “packaged” product and requires much more customisation. It is evident that both platforms have much to gain from the collaboration. I would bet that the EPrints people may well have cause to worry about their future market dominance, given this development.

I’m particularly interested because Fedora makes much greater use of RDF, a technology that has its supporters and detractors, but has not been the basis for a wholesale change to the promised Semantic Web that might have been hoped. However, one can see the potential application within content management systems such as repositories. One stumbling block seems to be that triple stores are not particularly efficient databases and need significant optimisation efforts before they rival traditional relational databases, a subject on which I am not a great expert at present. (I thank a colleague at UKOLN for educating me on this.) It is particularly interesting, then, to note the reference to efforts to improve the triplestore-based storage layer Mulgara.

I’m awaiting further developments with considerable interest, noting the new version DSpace 1.5.2 and recent references to the planned versions 1.6 and 2.0. I wonder how much the repositories community will have changed in a year’s time? Things seem to be moving fast right now.

Facebooktwitterredditpinterestlinkedinmailby feather

The collaborative research environment: publications management, CRIS systems and repositories

[Edit] This is a legacy post. This blog is no longer connected with my work activities in any way and is written entirely in my own private time.

Some months ago, I intended to write a post inspired by this post by Chris Rusbridge on CRIS systems. One particular motivation for this was reference he made to some comments by Stuart Lewis, particularly relating to the Symplectic Publications Management System. This was the subject of an impressive demo in Aberystwyth by Daniel Hook of Symplectic and Imperial College, London, while I was in my previous post as Repository Advisor. We were very interested by CRIS systems as back-end systems for managing research outputs at source. I believe that Stuart had been in touch with Niamh Brennan, who has since been kind enough to give me some general details about the CRIS system that has been produced in-house at Trinity College, Dublin. I should note that I have not seen it in operation.

One reason that I never published the original post is that most commentators in the repository community probably agree that a CRIS is innately a Good Thing in any event, so it needs no detailed endorsement from me. (A further concern is that I do not wish to be seen to be making specific comments on developments at Aberystwyth, which is a matter for them, so I shall confine my remarks to the systems at issue here.) In short, as Stuart’s comments cited by Chris in the above blog make clear, the CRIS does not replace the repository but “sits behind it” and provides content. There are of course a great variety of ways in which this could be achieved in practice, depending on the particular system in question. I will address the definition of a CRIS further below. However, the Symplectic system deserves more detailed commentary here, for reasons that will become clear.

Symplectic Publications Management System

There seems no reason to doubt Symplectic’s obvious competence in publications management, having witnessed how effective this piece of software had already been, albeit having only been tested in a limited number of institutions at the time. It was, to my mind, impressive that Daniel Hook was both frank and receptive about the merits and development potential of his system. Indeed he had no real need for the hard sell because the software had been built “by academics for academics”, so it consequently fulfilled most of the basic requirements already and many of the other features that he was questioned about were already in the pipeline.

However, this post is by no means intended as a wholehearted eulogy directed at the Symplectic system. It was developed in the context of scientific disciplines, and had not at the time been tested in arts and humanities disciplines, although to be fair this point was made openly by Daniel Hook himself. I do not see why it should fail in principle in non-science disciplines, although the take-up may be less enthusiastic in the same way that one finds in repositories. The real problem is the limited coverage of these disciplines by union databases such as Web of Science, though this is not the fault of Symplectic. Their efforts may in time provide one reason to improve those databases, which underlie the core functionality of the system.

Briefly, the system replaces the university’s reporting system. The academic logs on and merely has to choose whether to agree that a new paper in Web of Science belongs to him or her. They are able to declare that it is identical to another version, perhaps in another database, or else they can separate two papers of the same name, e.g. a conference paper and a published paper. They can optionally correct the metadata. Evidently, for papers not found in Web of Science, academics can enter the details of the papers manually, although this process then loses all of its automated advantage over author self-deposit in a repository. For this reason, the system has the greatest advantage in science subjects. The administrator can see whether academics are responding to its suggestions and thus, in disciplines with good database coverage, can gain a good idea of how comprehensively they are archiving content. Finally the system is entirely interoperable with DSpace, so that items may be deposited through either interface (I am less sure about EPrints and Fedora). Permissions can be set and delegated on a fairly granular level, unlike in repositories, so that academics can have control over who can deposit items on their behalf, such as co-authors.

The system provides useful tools to help simplify and automate research reporting that is already mandatory for academics, rather than introducing new obligations and unnecessarily duplicating the deposit/ingest phase. By contrast, repositories require this as an additional process, after research reporting. Of course, certain repositories are used for research reporting, which requires an effective or explicit institutional mandate, but in their present forms they are badly suited to describing the finer details of funding grants and projects. This is handled rather better by Symplectic, but obviously requires manual metadata input.

It may come as no surprise, given my involvement with SWAP, that I should level a criticism at the Symplectic system on the basis of its simple versioning model. Clearly one might wish to be able to say more than whether two papers are identical or not. It was clear that no complex relationships could be described at the time of the demo. I hope that this is addressed in future iterations of the software.

CRIS systems

A further issue is whether the Symplectic system qualifies as a CRIS at all, as pointed out to me by Niamh Brennan. Apparently it does not. There is no indication that it supports the CERIF standard maintained by EuroCRIS, and moreover it does not provide the means to support the creation and versioning of research outputs and related project information from their very inception, which is one of the purposes of a CRIS system. Instead, it only deals with papers after publication, in the same way as a repository.

Whether or not academics, particularly those whose working practices are long-established, would willingly switch over to using a CRIS in this way is perhaps a matter for doubt. It might be possible to compare, for example, mandated and voluntary use of CRIS systems, but at present I have no evidence for the relative feasibility or success of either approach. I hope to be able to see other CRIS systems in action, and that an Open Source platform will become available before long. I suspect that they have considerable potential in managing research and making it freely available. However, it may be naïve to simply assume that they will provide a complete solution for research management. At present, the only commercial software seems to be PURE, given the apparent demise of UniCRIS there appear to be three commercial competitors: CONVERISPURE and UniCRIS [updated 22/09/2009].

The functionality that these systems highlight that traditional repository platforms lack – with the proviso of course that no two in-house CRIS systems necessarily share the same functionality – is the ability to offer a collaborative environment for research management and research reporting. In the case of a CRIS, this is supposed to extend to the creation of research from its inception. As a repository manager, I recall being asked by several academics why they were unable to manage their own metadata in our repository, considering that responsibility for their own research is a core function of their employment. This is an entirely fair criticism, to which I will return.

It is difficult to see why a CRIS should not share all of the main characteristics of a repository. Both manage the ingest of bitstreams and the creation of metadata. Both repositories and the Symplectic system can be used to expose these materials to the web or alternatively set permissions to view them, and to generate publications lists for authors or research groups. Ultimately these are a set of very similar systems with slightly different but related, complementary purposes.

Implications for repositories

This discussion highlights the conceptual similarity of the current repository platforms to publication platforms, despite the usual claim that they are not publishing research but merely archiving it after publication. In legal and practical terms at least, they are self-evidently re-publishing it by making it publicly available in a further location. (One might speculate that this is the reason for the animosity of certain publishers towards repositories.) After all, to publish ultimately means to make public. But in academic usage, publishing also includes peer review. In my view, it might be a good time to start drawing a clear distinction between these functions.

I have heard Paul Walk refer many times to the “silo effect”, which seems to be at the core of the problem with the present repository platforms. Unlike other websites, the repository is not an interactive site. Having seen the statistics for repository access using Google Analytics, I can report that only a small minority of visitors access the site directly or by referral from university web pages. Virtually all papers are found using Google (only rarely other search engines) and those visitors do not remain on the site after they have found the content. They are largely oblivious to the existence of the repository. Very few users were referred by Intute or OAIster. All of these problems are to some extent unavoidable and well-known, but the repository should at least be more interactive than it is from the point of view of the depositors.

I should note that the large part of my experience with repositories relates to DSpace, though I am also familiar with EPrints. The My DSpace (user area) function offers the user no functionality other than being able to change their personal details and submit content, and even the latter usually requires manual authorisation by the administrator. EPrints is generally similar. This immediately results in confusion, erroneous error reports and so on from sometimes irate new users, but it is quite logical for the repository manager, who needs to know who is submitting content and generally wants to intervene to give basic initial copyright advice in order to save problems later with incorrect versions being supplied.

There is clearly a role for automating the setting of permissions for deposit on the basis of staff records. Though it presents problems, it can be done routinely for other university software systems (including apparently the Symplectic system). I would be interested to hear whether this has been done successfully in the majority of EPrints and DSpace repositories.

But the issue runs deeper: the only tool that the repository manager has to safeguard copyright liabilities is the ability to monitor submissions. It would be most unwise to set editorial permissions for all authors over their own metadata because there is no function to view recent changes, only recent deposits. Such changes might easily be in breach of copyright, despite the author’s best intentions. (In DSpace up to version 1.5, permissions can only be set by group, not for individual authors.) There is no easy versioning system, such as the sort used in wikis, and any changes to bitstreams can normally only be reversed by recourse to restoring from general back-ups.

It is evident to me that the present situation is not scalable. When the majority or even the plurality of research outputs are available, there will be far too much in each repository for its staff to deal with. A more scalable, collaborative solution is needed before that situation is reached, involving all parties concerned with the production of research in maintaining it on the web.

General recommendations

To summarise, the consequence is that items are “frozen” as soon as the repository manager completes checking for copyright checking and basic metadata compliance, after which any changes must be requested by email. The following features would need to be in place in order to allow repository managers to implement a more collaborative worklow:

(1) historical versioning control needs to be in place for editors to have the ability to easily roll back records and choose
which may be viewed by the public.

(2) in addition to the initial checking step in the workflow, all recent changes to the repository need to be visible to the repository manager. These could be vetted, i.e. they could remain pending until allowed by the repository manager, if desired.

(3) users need to be better identified as authors, even where another individual, e.g. co-author or administrator, has deposited on their behalf.

(4) the permissions system needs to allow granular control over what an individual may change within their own deposits, whether bitstreams or metadata, and within groups in the repository.

(5) bitstreams and links to versions elsewhere on the web should be treated equally in the interface, since the user is concerned with getting the resource, not where it is.

(6) as in the Symplectic system, users should be allowed to delegate those rights to individuals within their workgroup.

(7) as in Symplectic, users should have control over which publications appear in their publications list, and in what order.

It would also be desirable for repositories to use a method similar to that used by Symplectic to compare records to new items appearing in Web of Science. In that way, authors would have an ongoing dialogue with the repository, and an impetus to use it as a collaborative tool. In effect, there should be no conceptual difference between a publications management system and a repository – and, frankly, the former term is instantly meaningful to the user, while the latter means nothing to them at all. Such public relations disasters have led to the present stagnation in repositories. I would suggest that in any working system, the CRIS, the publications management system and the repository ideally need to be modular parts of the same software.

The repository world currently suffers from a dictatorial, top-down management structure that is in considerable part imposed by the design of the current repository platforms. At present, repositories do not meet the standards of collaborative software that we come to expect from Web 2.0 services. They also seem to be in direct conflict with the traditional responsibility of the academic and/or department for proper research reporting within the institution. The Symplectic Publications Management System and the concept of CRIS systems offer a more collaborative model that can help avoid repeating the same tasks. In addition, they can help record complex grant information, some of it confidential or commercial, that may not be for public release on the Web. Moreover, they offer an interactive workflow with real time-saving benefits for academics. In contrast, self-archiving in repositories merely repeats a task that is mandatory elsewhere, and does not represent joined-up thinking.

Incorporating the already mandatory process of research reporting into such systems would effectively by-pass the need for institutional mandates for self-archiving by incorporating it into the existing workflow using new software tools. Progress with such mandates has been crushingly slow, despite the success of the few that have been achieved. However, since all institutions are organised differently, it is simply not the case that only one single method for acquiring freely accessible content, however effective when implemented, is the only possible way to success. Collaborative research management is clearly a good idea for this end, as well as for saving time and effort on the part of academics, irrespective of the merits of institutional mandates.  It would also be a more scalable, sustainable basis for repositories to work that the present situation. On that basis, I suggest that it may well be a Good Thing.

Facebooktwitterredditpinterestlinkedinmailby feather

Not just another repositories blog (hopefully)

[Edit] This is a legacy post. This blog is no longer connected with my work activities in any way and is written entirely in my own private time.

First, a brief statement of intent. I’m very much aware that everybody has a personal blog these days and this is not intended to be another one. The title merely identifies me as the principal culprit, but the subject here is going to be confined to the world of repositories, which might better be described as a particular form of web content discovery. Repositories are by no means the only form of web content discovery, so we must always be clear in our minds why they are different, what they are intended to uniquely achieve and, of course, whether current technological and organisational approaches mean that they are in fact succeeding in doing this. However, it does not serve us well to look at one web solution in isolation, so many other related technologies may be discussed here. If the user doesn’t use a service and/or the content isn’t forthcoming to build that service, we must ask ourselves if the service model itself is valid and useful.

Several people have asked me to keep track of various repository-related conversations, and it seems best in this cutting-edge, collaborative Web 2.0 world to record the results as best I can in the open, for all to comment on. Please remember that this is intended as a discussion forum, not a place for arguments. All random ideas are welcome, however wild, within reason and provided they are on-topic.

Facebooktwitterredditpinterestlinkedinmailby feather