//Clarifying the Concept of Metadata

Clarifying the Concept of Metadata

Metadata is a difficult word to define, or so it would appear.  After all, why is it that the best that Wikipedia can do is:

Metadata (meta data, or sometimes metainformation) is “data about data”, of any sort in any media. An item of metadata may describe an individual datum, or content item, or a collection of data including multiple content items and hierarchical levels, for example a database schema.  (Source: Wikipedia)

Ewww. 

OK, to be fair, that’s not the best definition I could find, but I did want to point out one problem with the way that Metadata is perceived in the world of computing.  Metadata is usually defined in some lame way (data about data) and then described through examples, to help people to understand the meaning of the term.  Those examples are often incomplete or even misleading, like the example above or this example from further down in the same Wikipedia article:

In the context of an information system, where the data is the content of the computer files, metadata about an individual data item would typically include the name of the field and its length. Metadata about a collection of data items, a computer file, might typically include the name of the file, the type of file and the name of the data administrator.  (Source: Wikipedia)

What is so bad, you might say, about this example?

After all, it conveys the concept of “data about data” using terms that the reader may find familiar.  That is true, but there are much more powerful, complete, and correct examples available, ones that would provide a richer context to an understanding of metadata, and perhaps an understanding of how important metadata can be.  In addition, by supplying a small set of metadata elements, the example tells such a small part of the story, that it errs by omission.  It would be akin to describing a political leader as “human” and supplying their date of birth.  There is considerably more information that is both useful and simple to collect.

For example, let’s say that someone sends you an e-mail, and in that e-mail is a document.  The name of the document is “Functional Specification for Vista.” You could draw all kinds of conclusions from that bit of metadata (the title). 

Now, you open the document and you find that it is a short document (one page).  On that page is a list of the people, and their business functions, who proofread text submitted to a commercial printing company called Vista Printing.  Or even better, it turns out it is a Powerpoint deck, created by a salesman, that shows pictures of how Vista Printing functions!

After examining the metadata, what was missing from our understanding of this document?

We could (and I argue, should) know a great deal more about an artifact like this one, before we can say that we understand it.  We need to know, at least, who, what, when, where, why, and how. 

  • Who created the artifact?
  • Who is the intended recipient?
  • Who has accessed it?  (And when did they access it, and what process were they performing when they did?)
  • What business process called this artifact into existence?
  • What business outcome was it intended to support?
  • When was it created (date)?
  • When was it created (in what relation to the beginning of the process instance in which it was created)?
  • Where was it created (physical systems used to create it)?
  • Where is its address (URL?) for it’s ‘official’ storage location on a network?
  • Why was it created (name of the process activity that required this artifact as input)?
  • Why was it created (description of the personal objective of the creator in creating it)?
  • How was it created (using what tools and techniques)?
  • How was it created (using what thinking / creative / collaborative process)?
  • How was it created (using what audit / change control / approval process)?
  • How was it paid for (which goes to the motivation of the person who desired it’s creation)?

While it is true that the creation date and file name are, technically metadata, they are far from reasonable examples to help people understand the concept of metadata.

To this end, I’ll suggest an alternate definition: one that I believe is simple, easy to read, and provides a better understanding than ‘data about data.’  It goes like this:

Metadata is the surrounding contextual information required for a person or system to “understand” an element of information in the context for which it was intended. 

Metadata answers fundamental questions about a bit of information, such as who created it, who may access it, what does it refer to, why it was created, and how it should be used. 

A sufficient amount of metadata is captured when a consumer of a data element is able to correctly place the information in context, even if that consumer is using the information for a different purpose than it was originally intended.  The list of fields considered ‘sufficient’ for one purpose may not be sufficient for another.

Of course, you may not want, or need, all of those questions answered.  It can be difficult to capture everything, and some of those difficult items may not be beneficial for any of the consumers of that information. 

On the other hand, it would be very simple, in many cases, to capture quite a bit of metadata, nearly always more information than we normally capture.  Capturing this information, and using it appropriately, can help in the automation of business processes, the correct categorization of information for retrieval (and use), demonstration of compliance to a standard or business rule, and the maintenance of appropriate information security.

The data that we frequently collect, like the user who created it, or the date it was updated, is not interesting if there is not a business process that ties to that data, either as a producer or consumer.  How many database tables have columns for ‘last modified date?’  How many business processes use that information?

So, whether you are working on a repository of information, or just creating the schema for a database, consider carefully what metadata you want to capture and how you want to capture it.  Ask yourself who, what, when, where, why, and how, both for fields and tables.  Ask these questions for all artifacts, including the documents, source code, and test execution logs.  Then, consider carefully if that information would be useful to a business process somewhere else in your system. 

Capture useful metadata.  You’d be surprised how valuable it can be.

By |2008-09-17T03:42:00+00:00September 17th, 2008|Enterprise Architecture|8 Comments

About the Author:

President of Vanguard EA, an Enterprise Architecture consulting firm in Seattle focused on the Pacific coast of the US. Nick has over 30 years of professional experience in management, systems, and technology. He is the co-author of the influential paper "Perspectives on Enterprise Architecture" with Dr. Brian Cameron that effectively defined modern Enterprise Architecture practices, and he is frequent speaker at public gatherings on Enterprise Architecture and related topics. He coauthored a book on Visual Storytelling with Martin Sykes and Mark West titled "Stories That Move Mountains".

8 Comments

  1. John Cavnar-Johnson September 17, 2008 at 1:05 pm - Reply

    Here’s my suggestion for a simpler definition of metadata:

    Metadata is the context needed to understand a particular dataset’s informational content.

  2. Bob McIlree September 17, 2008 at 5:47 pm - Reply

    Your definition is a bit too verbose. I like the definition from the Washington DC Government Federated Data Model: "Information that describes the content, quality, condition, origin, and other characteristics of data or other pieces of information."

    And "data about data" usually works very well with business-types in defining metadata for them. They understand that better than any other explanation I’ve heard over many years.

  3. NickMalik September 17, 2008 at 8:46 pm - Reply

    Hi Bob,

    The FDM definitions is better than the one in Wikipedia, but the problem is that it describes "what it is" without any reference to "What it is for."  For the definition of an item in nature, that would be sufficient, but not so with things that we construct.

    Example: the American Heritage Dictionary defines an automobile as:

    n.   A self-propelled passenger vehicle that usually has four wheels and an internal-combustion engine, used for land transport. Also called motorcar

    Note that the definition defines "what it is" as well as "what it is used for."  

    The problem with "Data about data" is that it is accurate and completely insufficient.  It works because it describes a concept that is potentially so vague that people can safely ignore it.  

    On the other hand, a definition that refers to understanding, and not simply content, requires the reader to recognize not only that metadata is important, but also allows them to describe instances of metadata that they have seen in other contexts.

    A useful definition for a useful term.

    I will try to craft something shorter.

    —- N

  4. NickMalik September 17, 2008 at 8:55 pm - Reply

    Hello John C-J,

    Your definition is good, and considerably shorter.  If I compare it with the first sentence of my suggested definition above, there is only one really distinct difference: the type of understanding involved.

    I have found that different people will (correctly) understand an entire concept in different ways, depending on their viewpoint.  

    I am not talking about the "limited view" problem, aptly described in the famous poem about five blind men describing parts of an elephant as a snake, a tree, a rope, etc.  

    I’m talking about five different people all looking at a fast car.  One sees the mechanics.  Another sees the horsepower and performance characteristics.  Another considers comfort and amenities, while a fourth considers the effect that the car has on casual bystanders who see it.  A fifth thinks about the time it will take to travel on a specific road.

    The concept of "information" fails if we say that the context needed to describe information does not capture the viewpoint of the person who needs to use it.  

    Compromise:

    Metadata is the contextual information needed to understand a particular dataset’s informational content as it was intended to be used.

    How does that sound?

    — N

  5. John Cavnar-Johnson September 24, 2008 at 11:41 am - Reply

    (Sorry it took me so long to get back here, but I live in Houston and we’ve had other issues lately.)

    I still like my definition better. To explain why, I’ll compare them word by word:

    Metadata is the

        context

        contextual information

    I value conciseness highly in all forms of writing and in definitions above all. I think the fact that we are defining metadata makes the addition of ‘information’ unnecessary.

    needed to understand a particular dataset’s informational content

        as it was intended to be used

    I think this addition is too limiting. It captures the view of the data/metadata creator, but misleads the consumer. In the hands of an effective analyst, well chosen metadata will often reveal informational content beyond what the designer intended. BI, data mining, and a whole lot of social science research is based on that fact. Of course, that shouldn’t be used as an excuse to goldplate your data structures with extra, needless metadata.

  6. Data Modeler September 24, 2008 at 1:36 pm - Reply

    Another problem with the "data about data" definition is that it fails to get business users to understand that they need to manage meta data about more than just databases and records.

    For instance, songs on an iPod have meta data, images have meta data, and paper documents have meta data.  Sure, we IT pros understand that digital images are really just data, as are MP3s, but an average business user hears the word "data" and assumes a much smaller subset of his world than we do.

    Karen Lopez

  7. NickMalik September 24, 2008 at 8:47 pm - Reply

    Hello John C-J,

    My thoughts go with you and your family and neighbors in Houston as you pick up the pieces.  Having gone through Hurricane Andrew in Miami, I understand and sympathize.

    First:context

    American heritage dictionary defines Context as

    The circumstances in which an event occurs; a setting.

    Information is not implicit in that definition.  Information is not implicit in our definition unless we put it there.  Therefore, Contextual Information is needed.  Remember, when defining a term, the definition needs to stand along, outside the context (pun intended) of a blog entry or long discussion thread.

    Second:As it was intended

    True: social science can extract information beyond the original intent, but the person SAVING that information cannot foresee that use, nor should they.  The ‘saver’ of metadata has an intent.  The ‘reader’ of metadata must understand their intent to understand the data.

    If that ‘reader’ is then able to infer further information through the addition of context NOT IN EVIDENCE in the metadata, that is an art and a talent, but it is NOT part of the definition of metadata.  It is the definition of analysis.

    Metadata must be understood, first, in the context in which it was intended.  The only way to extract further knowledge is through FIRST understanding that context.  

    Therefore, the context in which it was intended is a prerequisite for understanding.  Understanding is a prerequisite for inference.  To be useful, this concept of ‘prerequisite’ must be included.

    I would not find your definition useful to constrain this meaning, and thereby make the concept of metadata useful.

    I think we should agree to disagree on this point.

    — N

  8. NickMalik September 24, 2008 at 8:47 pm - Reply

    Hello Karen,

    Point well made!

    — N

Leave A Comment

16 − thirteen =