Thursday 17 July 2008

JISC Innovation Forum

Earlier this week, this JISC Innovation Forum took place, with the aim of getting together projects and programmes to discuss cross-cutting themes and share experiences. I attended the theme on research data - 3 sessions in all each focusing on a different aspect:

Session 1 - Legal and policy issues
This session followed the format of a debate, with Prof Charles Oppenheim arguing for the motion that institutions retain IPR and Mags McGinley arguing that IPR should be waived (with the disclaimer that both presenters were not necessarily representing their personal or institution's views).

Charles argued that institutional ownership encourages data sharing. Curation should be done by those with the necessary skills - curation involves copying and can only be done effectively where the curator knows they are not infringing copyright therefore the IPR needs to be owned "nearby". He also explained how publishers are developing an interest in raw data repositories and wish to own the IPR on raw as well as published data. There is a real need to encourage authors from blindly handing over the IPR on raw data. He suggested a model where the author is licensed to use and manipulate data (e.g. deposit in repository) and the right to intervene should they feel their reputation is under threat. The main argument focused on preventing unthinking assignment of rights to commercial publishers.

Mags suggested that curation is best done when no-one asserts IPR. There may in fact be no IPR to assert and she explained that there is often over-assertion of rights. There is in general a lot of confusion and uncertainty around IPR which leads to poor curation - Mags suggested the only way to prevent this confusion is to waive IPR altogether. Data is more than ever now the result of collaboration relying on multiple (and often international) sources of data so unravelling the rights can be very difficult - there could be many, even 100s of owners across many jurisdictions. Mags concluded with the argument that it is easier to share data which is unencumbered by IPR issues and quoted the examples of Science Commons and CC0.

A vote at this point resulted in : 5 for the motion supporting institutional ownership; 10 against; 7 abstaining.

A lively discussion followed - here are the highlights:
  • it's important to resolve IPR issues early
  • NERC model - researchers own IPR and NERC licenses it (grant T&Cs)
  • in order to waive your right, you have to assert it first
  • curation is more than just preservation - the whole point is reuse
  • funders have a greater interest in reuse than individual researchers - also have the resources to develop skills and negotiate T&Cs/contracts
  • not just a question of rights but responsibilities too
  • issues of long-term sustainability e.g. AHDS closure
  • incentives to curate - is attribution enough?
  • what is data? covered range of data including primary data collected by researcher, derived data, published results
  • are disciplines too different?
  • duty to place publicly funded research in the public domain? use of embargoes?
  • can we rely on researchers and institutions to curate?
  • "value" of data?
  • curation doesn't necessarily follow ownership - may outsource
  • proposal to change EU law on reuse of publicly funded research - HE now exempt - focuses on ability to commercially exploit - HEIs may have to hand over research data??
And finally, we voted again : this time, 6 for the motion; 14 against; 3 abstaining.

Session 2 - Capacity and skills issues
This session looked at 4 questions:
  1. What are the current data management skills deficits and capacity building possibilities?
  2. What are the longer term requirements and implications for the research community?
  3. What is the value of and possibilities for accrediting data management training programmes?
  4. How might formal education for data management be progressed?
Highlights of discussion:
  • who are we trying to train? How do we reach them? The need for training has to appear on their "radar" - best way to reach researchers is via lab, Vice-Chancellor, Head of School of funding source.
  • training should be badged e.g. "NERC data management training"
  • "JISC" and "DCC" less meaningful to researchers
  • a need to raise awareness of the problem first
  • domain specific vs generic training
  • need to target postgrads and even undergrads to embed good practice early on
  • need to cover entire research lifecycle in training materials
  • how is info literacy delivered in institutions now? can we use this as a vehicle for raising awareness or making early steps?
  • School of Chemistry in Southampton has accredited courses which postgrads must complete - these include an element of data management
  • lack of a career path for "data scientists" is a problem
  • employers increasingly looking for Masters graduates as perceived to be better at info handling
  • new generation of students - have a sharing ethic (web2.0) but not necessarily a sense of structured data management
  • small JISC-funded study to start soon on benefits of data management/sharing
  • can we tap into records management training? a role here for InfoNet?
  • can we learn from museums sector? libraries sector?
  • Centre for eResearch at Kings are developing "Digital Asset Management" course, to run Autumn 09
  • UK Council of Research Repositories has a resource of job descriptions
  • role of data curators in knowledge transfer - amassing an evidence base for commercial exploitation
  • also a need for marketing data resources

Session 3 - Technical and infrastructure issues

This session explored the following questions:

  • what are the main infrastructure challenges in your area?
  • who is addressing them?
  • why are these bodies involved? might others do better?
  • what should be prioritised over the next 5 years?
One of the drivers for addressing technical and infrastructure issues is around the sheer volume of data – instruments are generating more and more data – and the volume is growing exponentially. It must be remembered that this isn't just a problem for all big science – small datasets need to be managed too although the problem here is more to do with variety of data (heterogenous) than volume. It was argued that big science has always had the problem of too much data and have to plan experiments to deal with this e.g. LHC in CERN disposes of a large percentage of data collected during experiments. In some areas, e.g. geospatial, data standards have emerged but it may be a while before other areas develop their own or until existing standards become de facto standards.

Other areas touched on included:
  • the role of the academic and research library
  • roles and responsibilities for data curation
  • how can we anticipate which data will be useful in the future?
  • What is ‘just the right amount of effort’?
  • What are the selection criteria – what value this data might have in the future (who owns it, who’s going to pay for it), how much effort and money would you have to regenerate this data (eg do you have the equipment and skills to replicate it?)
  • not all disciplines are the same therefore one size doesn't fit all
  • what should be kept? data, methodology, workflow, protocol, background info on researcher? How much context is needed?
  • how much of this context metadata can be sourced directly e.g. from proposal?
  • issues of ownership determine what is stored and how
  • what is the purpose of retaining data - reuse or long-term storage? Should a nearline/offline storage model be used? Infrastrucutre for reuse may be different from that for long-term storage?
  • Should we be supporting publication of open notebook science? (and publishing of failed experiments). What about reuse/sharing if there’s commercial gains?
The summing up at the end concluded 4 main priority areas for JISC:
  1. within a research environment – can we facilitiate the data curation using the carrot of sharing systems? (IT systems in the lab)
  2. additional context beyond the metadata
  3. how do we help institutions understand their infrastructural needs
  4. what has to happen with the various dataset systems (fedora etc) to help them link with the library and institutional systems

1 comment:

Anonymous said...

The big issue with failed experiments is that scientists are not keen about putting a lot of time writing reports about them. Using an Open Notebook solves that problem because it does not require additional time to process - scientists keep lab notebooks anyway. But you will never get a peer-reviewed (or even PI reviewed) true Open Notebook getting updated several times a day. There is value in the information there but people need to understand that what they are looking at.