UKOLN Informatics Research Group » storage http://irg.ukoln.ac.uk Expertise in digital information management Mon, 09 Dec 2013 15:09:09 +0000 en-US hourly 1 http://wordpress.org/?v=3.5.2 Defining institutional data storage requirements http://www.dcc.ac.uk/blog/defining-institutional-data-storage-requirements?utm_source=rss&utm_medium=rss&utm_campaign=defining-institutional-data-storage-requirements-2 http://www.dcc.ac.uk/blog/defining-institutional-data-storage-requirements#comments Mon, 18 Mar 2013 10:44:55 +0000 jonathan.rans http://blogs.ukoln.ac.uk/irg/?guid=d4f05db38fcdf5fa6f27d5627690e167 Institutions developing infrastructure in support of research data management are engaging with a whole range of issues, both cultural and technical. One that stands out as a clear priority is that of research data storage, both for “live” data, during the active phase of research, and post-project archiving.

read more

]]>
http://irg.ukoln.ac.uk/2013/03/18/defining-institutional-data-storage-requirements-2/feed/ 0
TMTOWTDI: Three data repository experiences in UK HE http://www.dcc.ac.uk/blog/tmtowtdi-three-data-repository-experiences-uk-he?utm_source=rss&utm_medium=rss&utm_campaign=tmtowtdi-three-data-repository-experiences-in-uk-he http://www.dcc.ac.uk/blog/tmtowtdi-three-data-repository-experiences-uk-he#comments Wed, 21 Nov 2012 12:36:08 +0000 Monica Duke http://irg.ukoln.ac.uk/?guid=376188f824abad759632b7f39daef523 Three data repository experiences at UK Higher Educational institutions: A report on parallel session 2B on data repositories and storage (group 1) at the JISC Managing Research Data Meeting in Nottingham, 24 October 2012.

This session on data repositories and storage considered the repository systems that support storing, depositing and exposing data at three universities. We heard from the University of Hull, the University of Hertfordshire and the University of Lincoln, where repository systems have been set up or extended through funded projects.

read more

]]>
http://irg.ukoln.ac.uk/2012/11/21/tmtowtdi-three-data-repository-experiences-in-uk-he/feed/ 0
MRD Hack Days: File backup, sync and versioning, or “The Academic Dropbox” http://feedproxy.google.com/~r/Research360/~3/_s21XhJDI-A/?utm_source=rss&utm_medium=rss&utm_campaign=mrd-hack-days-file-backup-sync-and-versioning-or-the-academic-dropbox http://feedproxy.google.com/~r/Research360/~3/_s21XhJDI-A/#comments Fri, 04 May 2012 14:09:19 +0000 Jez Cope http://blogs.bath.ac.uk/research360/?p=239 This post emerged from discussions at the JISC MRD Hack Days, particularly with Joss Winn of the University of Lincoln’s Centre for Educational Research and Development. The event brought together developers and data management experts for two intensive days to discuss and prototype tools for research data management.

Joss has also written a more discursive post about our discussions of file synchronisation, particularly with respect to handling of large files.

For a bit of context, both Joss and I make regular use of Dropbox and Git where appropriate.

The problem

Many researchers store the majority of their live data on local disks, with little or no redundancy, leaving them open to data loss through accident or theft. To solve this problem, we provide research users with high resilience, high performance, high capacity network storage, but in spite of these advantages, they often don’t use it as well as they might.

Another requirement is for easy sharing of files. Most data sharing still takes place via email.

The main reason for this is that many researchers do a lot of work on their laptops, in locations where their network connection may be intermittent, slow or completely sent. On the train, on a plane, in a cafe, they need access to some or all of their data wherever they are.

When faced with this problem, many researchers turn to Dropbox because it is easy to use and requires no user interaction beyond the initial setup. However, there are serious issues with using Dropbox to store research data, primarily the fact that confidential data is being stored on servers outside the institution’s control.

What is needed is a tool to transparently synchronise local and network storage, effectively providing an offline cache which provides the convenience and speed of local disk access combined with the resilience of network attached storage.

Desired features

Control over storage locations

For confidential research data, it is highly desirable for all copies to remain under the control of the institution(s) who are responsible for looking after it. Any solution should at least have the option of storing data on a university-run storage service.

Encrypted network transfer

Control of the storage locations on their own is not sufficient. If data is sent over the internet with weak (or worse non-existent) encryption, it can easily be intercepted by an attacker. Strong encryption should be used to protect all data.

Large file support

Many of the files which researchers routinely work with may be tens or hundreds of megabytes, or in some cases gigabytes or terabytes. Clearly, there’s a limit to this – it’s reasonable to expect that researchers will have to manage files of a gigabyte or more differently. But a suitable solution should at least work well for files of tens or hundreds of megabytes.

Two important factors spring to mind here. First, the user needs feedback about the progress of a sync so that they aren’t surprised when changes they were expecting haven’t propagated yet. Second, the tool needs to gracefully handle a user cancellation or a dropped connection without losing or corrupting data. Ideally, if this happens it should be able to resume where it left off.

Conflict resolution

Once you have two copies of your files, you have two different places to modify them, giving you the possibility of making different changes to the same file prior to synchronising. This becomes even more likely when you are sharing the same files between multiple users.

Metadata

Storing metadata along with data for later (perhaps automated) deposit in a repository is a core research data management practice. It tends to be readily available only when the data is created, but often only useful when data is finally published or archived. With a dedicated “Academic Dropbox”, it may be possible for users to associate metadata directly files at creation time, and then keep that metadata with the file throughout its life through to deposit in an archive.

Existing solutions that might work

Here is a whistlestop tour of some of the options we dug up. I’ve listed the pros and cons as I see them (though feel free to correct/update me in the comments), and I’ve not commented on features for which I couldn’t find enough information to judge.

Unison

http://www.cis.upenn.edu/~bcpierce/unison/

I (Jez) use this daily.

Pros

  • Excellent handling of conflicts (choose which copy to use, or merge the two where possible)
  • Allows synchronisation of arbitrary pairs of folders, so will work with any storage that can be mounted by the user
  • Gracefully handles interruptions of the transfer, and restarts next time from where it left off
  • Uses the rsync protocol to transfer only the parts of files which have changed
  • Can transfer files over an encrypted SSH connection

Cons

  • Requires initial configuration by the user: in Mac or Linux this can only be done by editing configuration files, though the Windows client has a
  • Needs to be explicitly run by the user, which is easy to forget; could be run on a schedule, but this typically requires
  • For large sets of files, scanning for changes can take a long time, particularly over the network

Rsync

http://rsync.samba.org/

Pros

  • Allows synchronisation of arbitrary pairs of folders, so will work with any storage that can be mounted by the user
  • Very flexible, and operates without user interaction, so can be adapted to fit many situations
  • Uses the rsync protocol to transfer only the parts of files which have changed
  • Can transfer files over an encrypted SSH connection

Cons

  • Syncs in one direction only, so full synchronisation requires two runs, one in each direction
  • Command line tool has many complex options, and the available GUIs only go a small way to improve this, so it can be difficult for non-technical users to understand

Git and other distributed version control systems (DVCS)

http://git-scm.com/

Both Joss and I (Jez) use this regularly.

Pros

  • Can transfer files over an encrypted SSH or HTTPS connection
  • Outstanding conflict resolution by intelligent merging of files
  • Support for common software development activities such as branching (e.g. to make experimental changes)

Cons

  • All actions require manual running of commands, either via a command line or a GUI, so requires quite a major change to the user’s workflow
  • Merging only works well for text-based file formats, though it is possible with some work to use alternative merge tools for, say, Word documents
  • Poor handling of large binary files generally, although extensions are available to mitigate this (see below)

Large file extensisons to DVCS

E.g. git-bigfiles http://caca.zoy.org/wiki/git-bigfiles, git-media https://github.com/schacon/git-media, git-annex http://git-annex.branchable.com/, mercurial large files extension http://mercurial.selenic.com/wiki/LargefilesExtension, and also Boar https://code.google.com/p/boar/

Boar is a VCS designed specifically to work with large files, while the others are extensions to existing VCS systems.

Pros

  • Similar to DVCS (above), but vastly improved handling of large binary files (reduced memory requirement, for example)

Cons

  • Similar to DVCS, but requires additional configuration

SparkleShare

http://sparkleshare.org/

Pros

  • Very little configuration required to achieve similar results to Dropbox
  • Can store data anywhere a Git repository can be placed but there is potential to build alternative storage backends
  • Git features not exposed by the SparkleShare interface can be accessed using other git-based tools

Cons

  • Software is very new and seems unstable (it crashed a few times for me under Mac)

Sharebox

https://github.com/chmduquesne/sharebox-fs

Pros

  • Implemented as a filesystem, so completely transparent to the user once installed
  • Uses git as a backend, so shares many of its advantages, including the ability to transfer data in encrypted forms

Cons

  • Still very early in development so difficult to get working and only available on Linux

Oxygen Cloud

https://oxygencloud.com/

Pros

  • Commercial offering with enterprise support available
  • End-to-end strong encryption to protect confidential data
  • Option to use your own (institutional) storage instead of the provided cloud storage
  • Access via iOS and Android smartphones

Cons

  • Enterprise service with commercial pricing

Summary

There’s no simple solution to this, but we now have a whole range of things to try and to suggest that our users try. Who knows, some of them might even work!

]]>
http://blogs.bath.ac.uk/research360/2012/05/mrd-hack-days-file-backup-sync-and-versioning-or-the-academic-dropbox/feed/ 0
Hitachi Content Platform object store arrives http://blogs.bath.ac.uk/research360/2012/03/hitachi-content-platform-object-store-arrives/?utm_source=rss&utm_medium=rss&utm_campaign=hitachi-content-platform-object-store-arrives http://blogs.bath.ac.uk/research360/2012/03/hitachi-content-platform-object-store-arrives/#comments Mon, 12 Mar 2012 14:05:08 +0000 Jez Cope http://blogs.bath.ac.uk/research360/?p=214 HCP object store

HCP object store

Just a little bit of geekery really. The object store that I mentioned the other week has now arrived! The cabinet on the left in the photo is almost entirely disks.

We have one for each data centre, but installing them into our existing infrastructure is a non-trivial task, so it’ll be a while before they’re in service.

]]>
http://blogs.bath.ac.uk/research360/2012/03/hitachi-content-platform-object-store-arrives/feed/ 0
Object stores http://blogs.bath.ac.uk/research360/2012/02/object-stores/?utm_source=rss&utm_medium=rss&utm_campaign=object-stores http://blogs.bath.ac.uk/research360/2012/02/object-stores/#comments Mon, 20 Feb 2012 13:08:58 +0000 Jez Cope http://blogs.bath.ac.uk/research360/?p=176 Kitchen ShelvesAlthough my involvement in Research360 is at the level where technology and people interact, I’m also doing my best to understand how our infrastructure is developing at a much lower level so that I’m in a position to better advise non-technical stakeholders.

Bath University Computing Services (BUCS) are currently in the process of procuring a new file store which works in a very different way to our existing storage systems, and I recently had the opportunity to learn more about it from our Database & Systems Manager, Paul Jordan. Since this is a very new area for me, my apologies to you and him for anything that I’ve got wrong.

Like our existing storage, this will be arranged into tiers, with Tier 1 containing the most expensive storage with the quickest access times, and lower tiers providing slower but cheaper storage. Data will be moved between tiers automatically (and invisibly to users) based on configured policies.

Where this new storage differs from our existing systems is that the lowest tier will not be a tape carousel, but an “object store”. Where traditional a file system stores data in an ordered, hierarchical way, an object store stores individual data objects in a flat namespace.

The major advantage of this is that much more of the available space on the physical disks can be used to store actual user data: the the overhead is much lower than for traditional filesystems. By virtualising storage across a network in a new way, it’s also very much more scalable than anything we currently use — we could easily grow this to the petabyte level or expand out into the cloud if need be.

Now, most users need never know that their data is stored in an object store, just like they don’t need to know whether the disks were made by Hitachi or Western Digital. An extra layer on top does some translation, allowing you to store files over the network just like any other networked attach storage (NAS). Users can access it via a mapped drive in Windows or an NFS mount .

However the object store is also accessible directly via a RESTful API over HTTP/HTTPS (in fact, that’s how the NAS layer interacts with it too). Despite being sold as a replacement for tape archival, it’s very quick to access over the network, and authentication of users via LDAP or Active Directory is also built in. In addition to this, an object store can perform other clever functions during or after ingestion, such as transforming data into other formats or making use of metadata.

It therefore seems like the perfect back-end to a digital repository such as EPrints, DSpace or Fedora. A load of overhead could be cut down by having the repository target the object store directly, rather than doing so via files on a virtual file system using the NAS layer.

Alternatively, if the object store itself is clever enough, it could be used directly as a repository, using only a very thin user interface on top. A SWORD2-compliant interface would open up even more options.

If you’re interested in learning more, there are a number of white papers and other resources available on the Hitachi Content Platform web page.

Are other institutions implementing similar types of storage? Is it possible to integrate a repository with an object store directly via HTTP and if so has it been done?

It would be interesting to hear from anyone else who’s come across anything similar.

Image credit: Kitchen Shelves by John Martinez Pavliga

]]>
http://blogs.bath.ac.uk/research360/2012/02/object-stores/feed/ 0
IDCC11 Session 3B: environmental data http://www.dcc.ac.uk/node/9314?utm_source=rss&utm_medium=rss&utm_campaign=idcc11-session-3b-environmental-data http://www.dcc.ac.uk/node/9314#comments Fri, 09 Dec 2011 12:57:06 +0000 kerry.miller http://irg.ukoln.ac.uk/?guid=6acda243a4bcbc18c996d6037c17fcdb The presentations in this session looked at different aspects of data creation, storage, and sharing in environmental research. A significant theme was the need to encourage data sharing by researchers by providing them with suitable rewards.

Read more

]]>
http://irg.ukoln.ac.uk/2011/12/09/idcc11-session-3b-environmental-data/feed/ 0