|
Making Sense of File Systems Through Provenance and Rich Metadata
Published as
Storage Systems Research Center
Technical Report UCSC-SSRC-12-01.
Abstract
Modern high end computing systems store hundreds of petabytes of data
and have billions of files, as many files as the internet of only a
few years ago. Even modern personal computers store numbers of files
that would be massive for the largest mainframe computers of 40 years
ago. The quantities of data in modern computing have long since
overwhelmed anyone’s ability to manage it manually, and the 40 year
old tools currently in use for file finding and management are
reaching the limits of scale. In an environment like this, secure,
effective, and efficient search algorithms and automatic file
management become a necessity, not a nicety. Our proposal addresses
the question of how users can quickly find and manage files, without
burdening the file system with expensive brute force searches, or requiring
the user to become an expert in query languages. We propose a number
of algorithms to improve file management in a large scale scientific
computing environment. By collecting new metadata, including file
system provenance, we propose to provide new ranking algorithms which
are efficient and effective on large multi-user file systems. We
intend to reduce the burden of file naming, allowing the system to
generate expressive, unique file names on the fly; we have identified
a statistical property of data that is likely to select meaningful
attributes for file names. And since security is a concern on many
large scientific computing systems, we intend to analyze the security
properties of the proposed ranking algorithms, and demonstrate how our
ranking algorithm degrades gracefully from the ideal ranking when applied
in a setting with restrictive security permissions. We will validate
our results using real world scientific data, and provide statistical
analyses of rich metadata and provenance from this data. And we will validate our ranking and naming algorithms through a series of in situ
user studies. Modern data management must be automatic and scalable,
allowing users and file systems to focus on what each does best. By
exploiting patterns of human behavior, the system can provide faster
searches and more interpretable interfaces to the file system. Data growth
is not expected to level off anytime soon, and file systems must be
ready to handle the load.
Available for download:
Bibtex entry
@techreport{parkerwood-ssrctr-12-01,
author = {Aleatha Parker-Wood and Darrell D. E. Long and Ethan L. Miller
and Margo Seltzer and Daniel Tunkelang},
title = {Making Sense of File Systems Through Provenance and Rich Metadata},
institution = {University of California, Santa Cruz},
number = {UCSC-SSRC-12-01},
month = mar,
year = {2012},
}
Last modified 3 Oct 2012
|