• tommaguire

    Ok, so in the digital world the distinction that you make between data and metadata gets AWFULLY blurry. If I'm searching for “Angels and ministers of grace defend us!” is the full-text index of Shakespeare's “Hamlet” the data of the play or metadata of the play? The distinction between data and metadata has more to do with the physical nature of things and categorization of those physical things. For digital objects that really fades away.

    In the end you can never hope to capture all the “metadata” in the object so you are going to be forced to deal with your option 2 anyway.

    To me, it's all just linked-data anyway….

  • Tom,

    thanks… in the end, I'd definitely agree with “it's all linked data anyway” sentiment but the method of linkage has the important. if any of those links “break” you've effectively “lost” or “isolated” data that, depending on the importance to your business, could be critical. *shrug* Hopefully, I'll be able to dig a bit deeper into Atmos' method of metadata linkage a bit more.

    cheers,

    Dave

  • tommaguire

    I agree that lost or isolated data may be important to a business. So “broken link”? That could mean:
    1) the domain hosting the link is gone
    2) the network is down
    3) the “linked-data” resource has been destroyed/deleted
    4) the “linked-data” resource has been migrated/moved/archived

    Most of these point to a more interesting question of ownership, not linkage. So if you assume you “own” all the data then you either have control of the above situations or you don't. So if you don't have control and you need to mitigate then the typical strategy is to “copy” the data locally. IMO, that has just as many downsides (perhaps more) than “lost” or “broken” links. Missing data is better than, stale/bad/incorrect data that is out-of-sync with the “authoritative” data.

  • I'm going to posit that there's a level of persistence WITHIN the cloud datastore that may not be “externally” evident whereby these links (UID to meta db records) are consistently checked for validity, etc. Atmos uses a “garbage collect” process that takes a look at this type of thing (orphaned objects, links, bad/missing UIDs) and scrubs them, thus maintaining a cleaner underlying cFS (cloud file system).

    interesting note on ownership to which I'd say that there has to be dual ownership, one from the system level (with immutable meta such as creation date, etc.) as well as mutable data (e.g. user generated meta). The meta db then needs to maintain and track 2 different levels. Policy can affect either, fwiw.

    dave

  • tommaguire

    I think we are talking at several different levels. So let's constrain the conversation about “linked-data” to Atmos cloud storage. There is a resource that represents the content object and there is a resource that represents the “metadata” object. What you are suggesting is that in the cloud storage world you'd like to see the consistency model for the relationship between the metadata and the data ruled by “ACID”-like semantics (perhaps I'm overstating your position).
    I would argue that “linked-data” requires polar opposite semantics where the user of the data needs to assume that data will be missing and inaccessible (read 404). The ownership point I was making before is part of this discussion; can I assume that I can make geographically distributed updates (or even just validations of garbage collection), do I have the rights? Even if I have the rights; can I tolerated the latency?
    IMO, in this world we need to move to “BASE” semantics (Basically Available Softstate that is Eventually consistent). It is more like the way the web works today and isn't that the point of Cloud storage?

  • mtimjones

    Nice food references — making me hungry.

    Having separate metadata (such as in a database) allows the metadata to be queried much easier than if the metadata is associated with the data. But as you point out, there are trade-offs all around… Perhaps a hybrid approach is best?

  • Tim,

    Definitely think that hybrid models work especially as you look to bridge between “classic” block 'n file environments (where metadata is 99% system generated) and object environments where the emphasis is on user-generated content NOT system meta. (Objects, by nature, redact the meaning of the underlying file and thus require descriptors).

    dave

  • Ahh, but you can have your Bacon and Scallops separately, yet not be concerned about database replication if the storage interface supports it. Examples are the XAM standard as well as the new SNIA Cloud Storage standard. Even filesystems with extended attributes let you keep separate metadata around, although they lack the needed query.

    — mark

  • Pingback: Twitter Trackbacks for Micro-burst: Metadata — Dave Graham's Weblog [flickerdown.com] on Topsy.com()

  • As much as I love bacon, you have to separate the object from the metadata. When dealing with auditing, the metadata includes some of that audit information. Well, if I capture that someone changed the object, I have to update the metadata to track that, thus changing the object.

    The challenges of the second model have been encountered for years in the Enterprise Content Management world. How do you keep your metadata in your database in synch with content on the SAN for backups when users access the system all the time. It is a headache, but not an unknown one.

    There are systems that can do this though. If you just want a good scalable database to house the metadata in, look at xDB, formerly X-Hive. Conveniently enough, owned by EMC.

    -Pie

  • Mark,

    thanks for the comments! I'm very well aware of the XAM standard as well as the (in process) SNIA Cloud Storage Standard (am a member of EMC's SNIA group; though silent) and frankly, I see great things being done there.

    As i've said elsewhere (the first “micro-burst”), these articles are meant to be more Socratic than anything else as I'm really not trying to chose one particular position over the other. In cFS words (cloud file systems), the capability to expand beyond a limited NAS role has to be tied to forward looking object-based storage (imho, of course). Replication the metadata is obviously part of that process and in the Atmos world, we're doing synchronous metadata replication along with background consistency checking for “issues” that can arise by running separate and discrete object/meta repositories.

    as always, keep up the constructive comments!

  • @piewords (Pie),

    Thanks a lot for the feedback! I agree that there's potentially greater flexibility in running meta separate from object and, while I'm not intimately familiar with ECM and it's requirements, I'm becoming more aware as time moves on. (thanks to the likes of Craig Randall and others in Documentum!) 😉 I know we're doing some REALLY cool things with some of the other groups within EMC using Atmos and Atmos Online (that I can't discuss right now…stay tuned!) and I think that these two services (using a common REST API) are just the ticket for keeping meta and data functional, flexible, and powerful!

    cheers,

    Dave

  • OK now you've just made me hungry. Am looking at my shrink wrapped ham sandwich I am not inspired/impressed..

    Anyway to Metadata. Which is not all about protection schemes etc, most value from metadata comes from the ability to search. To find or not to find etc. It's the bit that interest me most anyway.

    1. Wrapping it around the data implies proprietary and thus platform and vendor tied. For archive this won't do. Your data should outlive your vendor.

    2. Metadata is a side-dish best served cold. OK if you have metadata in a DB loosely coupled to the data itself then there will be an inevitable split one day. The old adage of “If you can't find it you do not have it” rings true here. Lose the DB, lose the data. Potentially.

    3. Metadata Objects. Going for a bit of metadata on the side? Then why not tie the metadata to the object so that they will always live on the same archive media (same node, tape etc). Build an in-memory DB on each node that looks after search for its own data. Cluster those nodes. Have those nodes do fancy self-healing/failover. Clustered distributed search.

    4. FileSytem support. File Systems like XFS support metadata for files on the FS. Pretty sure ZFS offers the same. Data and Metadata tied together using the FS. Open(ish)

    5. Hybrid. Go for a 3 + 4 cocktail. Use some open format (XML, Java Class) to store your MD as a file on the FS. Tie the file to the Content (GUIDS etc). Have that backed up by using an FS that supports MD natively.

    Just my musings..

    om_nick

  • Nick,
    definitely love the points you've made! One of the bigger issues I see out there are block/file storage companies who are trying to “trespass” and reinvent themselves into “cloud storage providers” without a thought (or maybe just not apparent yet) for how to accomplish this. The beauty of the Atmos product family is that we've tackled some of these item proposals (e.g. your point #3) by using a self-healing filesystem that runs consistency checks to ensure appropriate linkage between meta and db. From the ground up, we designed this for scalable, large/small object storage and put a heck of a lot of effort into making that data accessible. 😉

    again, thanks for your comments!

    dave

  • om_nick

    Not simply a problem for the cloud. For those not wanting to reach for the skies there are storage solutions available now that offer this.

    Glad you tackled Metadata as for us it is becoming as important and the essence itself.

  • I think I also missed commenting on one of the principles driving the use of metadata as well (which you and others like @piewords) have mentioned: search! Metadata is absolutely critical for the ongoing control and (dare I say it?) manipulation of content. By giving you hooks to the underlying object, you can now operate against an object in ways that you couldn't before.

    Additionally, you're right in saying that this isn't just a cloud problem. I look at the previous work I've done with SAN-based products and we're so used to looking at data as 01010101 versus a more descriptive method of understanding. I think Tom Maguire made the point earlier that a movie file (for example) is nothing but a file name in a file system until you start describing it and providing characterization against it for your programs. (since I know you do stuff with Final Cut Pro). sorry my thoughts this morning are a little less collected than yesterday but I appreciate the dialogue!

    cheers,

    Dave

  • om_nick

    As well as all the other good things (clustered storage stylee) that MatrixStore does search is a biggy. Like you say, we tie into Final Cut Server and other MAM/DAM apps to pull out the important stuff that can be archived with the data.

  • Dave,

    Sorry for the delay, good post and blog commenst!

    I've tried to collate some of my thoughts and comments here :-

    http://grumpystorage.blogspot.com/2009/08/objec

    Cheers

    Ian

  • dave

    Thanks for the read, Ian! definitely happy to take a look at what you’ve written!

  • I'm a little late to this conversation, but it's important and one I sure hope continues. Metadata and storage is an area I've been involved in since the late 90's and have developed product strategy for content management and index/search software as well. The value of metadata was little understood in storage back in the old days and it was a challenge to get the concept across. The best way to consider metadata is to use a metaphor we're all familiar with, the library. In fact, it's a good way to really think about storage in general. In the library books are stored in rows and on shelfs with the title showing and based on a certain taxonomy (whole science around this library stuff ya know). This is analogous to the file system using the folder, sub-folder and file name. Now, in the library you have the card catalog that contains basic information (metadata) to help you locate a particular book. Each book contains that basic information as well as an index (additional metadata) that describes in more detail what's in the book. This is an example of the hybrid model you suggest. Let's say there's a flood and the card catalog is lost, but the books remain. The librarian can rebuild the card catalog from the information contained within the book.

    In the digital world ECM, records management and other applications store the most important metadata in a database and the file is stored in some folder. The context of that file in some folder is lost if that database is lost or disappears after a period of years. However, by storing more descriptive metadata along with the file allows that context to persist over time without worrying about losing the ability to easily determine the value of a file. It's what I like to call Content in Context. The question of how much metadata to store in an object becomes a question for the organization and its requirements for information management. There are a number of industry standards for metadata that can be used for specific types of content such as the NBII Biological Metadata Standard, the Content Standard for Geospatial Metadata and Dublin Core (there are many more). Perhaps only a subset of the metadata standard is required or maybe all of it. That's a level of flexibility information/records managers should have. A lot of content is going to be kept for decades and keeping that context alive is critical. The hybrid model makes a lot of sense, at this point in time and I expect new applications will emerge that take advantage of object stores allowing dynamic views/organization of content/files.

    Derek Gascon
    VP Marketing, Caringo, Inc.

  • I'm a little late to this conversation, but it's important and one I sure hope continues. Metadata and storage is an area I've been involved in since the late 90's and have developed product strategy for content management and index/search software as well. The value of metadata was little understood in storage back in the old days and it was a challenge to get the concept across. The best way to consider metadata is to use a metaphor we're all familiar with, the library. In fact, it's a good way to really think about storage in general. In the library books are stored in rows and on shelfs with the title showing and based on a certain taxonomy (whole science around this library stuff ya know). This is analogous to the file system using the folder, sub-folder and file name. Now, in the library you have the card catalog that contains basic information (metadata) to help you locate a particular book. Each book contains that basic information as well as an index (additional metadata) that describes in more detail what's in the book. This is an example of the hybrid model you suggest. Let's say there's a flood and the card catalog is lost, but the books remain. The librarian can rebuild the card catalog from the information contained within the book.

    In the digital world ECM, records management and other applications store the most important metadata in a database and the file is stored in some folder. The context of that file in some folder is lost if that database is lost or disappears after a period of years. However, by storing more descriptive metadata along with the file allows that context to persist over time without worrying about losing the ability to easily determine the value of a file. It's what I like to call Content in Context. The question of how much metadata to store in an object becomes a question for the organization and its requirements for information management. There are a number of industry standards for metadata that can be used for specific types of content such as the NBII Biological Metadata Standard, the Content Standard for Geospatial Metadata and Dublin Core (there are many more). Perhaps only a subset of the metadata standard is required or maybe all of it. That's a level of flexibility information/records managers should have. A lot of content is going to be kept for decades and keeping that context alive is critical. The hybrid model makes a lot of sense, at this point in time and I expect new applications will emerge that take advantage of object stores allowing dynamic views/organization of content/files.

    Derek Gascon
    VP Marketing, Caringo, Inc.

  • Pingback: Why Policy is the future of storage — Dave Graham's Weblog()

  • Pingback: Why Policy is the future of storage — Dave Graham's Weblog()

  • Pingback: Towards a Private Cloud Architecture » privatecloud.com()