June 5, 2010

MIME and the Web

I originally wrote this as blog post & made updates, but now available as IETF Internet Draft, for discussion on www-tag@w3.org.

Origins of MIME

MIME was invented originally for email, based on general principles of ‘messaging’, foundational architecture. The role of MIME was to extend Internet messaging from ASCII-only plain text (other character sets,  images, rich documents, etc.) The basic architecture of complex content messaging is:
  • Message sent from A to B.
  • Message includes some data. Sender A includes standard ‘headers’ telling recipient B enough information that recipient B knows how sender  A intends the message to be interpreted.
  • Recipient B gets the message, interprets the ‘headers’ for the data and uses it as information on how to interpret the data.
MIME is a “tagging and bagging” specification:
  •  tagging: how to label content so the intent of how the content should be interpreted is known
  •  bagging: how to wrap the content so the label is clear, or, if there are multiple parts to a single message, how to combine them.
“MIME types” (renamed “Internet Media Types”) were part of the labeling, the name space of kinds of things. The MIME type registry (“Internet Media Type registry”) is where someone can tell the world what a particular label means, as far as the sender’s intent.

Introducing MIME into the Web

The original World Wide Web  didn’t have MIME tagging and bagging. Everything transferred was HTML.
At the time, ('92) other distributed information access systems, including Gopher (distributed menu system) and WAIS (remote access to document databases) were adding capabilities for accessing many things other text and hypertext and the WWW folks were considering type tagging.
It was agreed that HTTP should use MIME as the vocabulary for talking about file types and character sets.
The result was that HTTP 1.0 added the “content-type” header, following (more or less) MIME. Later, for content negotiation, additional uses of this technology (in ‘Accept’ headers) were also added.
The differences between Mail MIME and Web MIME were minor (default charset, requirement for CRLF in plain text). These minor differences have caused a lot of trouble, but that’s another story.

Distributed Extensibility

The real advantage of using MIME to label content meant that the web was no longer restricted to a single format. This one addition meant expanding from Global Hypertext to Global Hypermedia:
The Internet currently serves as the backbone for a global hypertext. FTP and email provided a good start, and the gopher, WWW, or WAIS clients and servers make wide area information browsing simple. These systems even interoperate, with email servers talking to FTP servers, WWW clients talking to gopher servers, on and on.
This currently works quite well for text.  But what should WWW clients do as Gopher and WAIS servers begin to serve up pictures, sounds, movies, spreadsheet templates, postscript files, etc.? It would be a shame for each to adopt its own multimedia typing system.
If they all adopt the MIME typing system (and as many other features from MIME as are appropriate), we can step from global hypertext to global hypermedia that much easier.
The fact that HTTP could reliably transport images of different formats allowed NCSA to add <img> to HTML. MIME allowed other document formats (Word, PDF, Postscript) and other kinds of hypermedia, as well as other applications, to be part of the web. MIME was arguably the most important extensibility mechanism in the web.

Not a perfect match

Unfortunately, while the use of MIME for the web added incredible power,  things didn't quite match, because the web isn’t quite messaging:
  • web "messages" are generally HTTP responses to a specific request; this means you know more about the data before you receive it. In particular, the data really does have a ‘name’ (mainly, the URL used to access the data), while in messaging, the messages were anonymous.
  • You would like to know more about the content before you retrieve it. The "tagging" of MIME is often not sufficient to know, for example, "can I interpret this if I retrieve it", because of versioning, capabilities, or dependencies on things like screen size or interaction capabilities of the recipient.
  • Some content isn’t delivered over the HTTP (files on local file system), or there is no opportunity for tagging (data delivered over FTP) and in those cases, some other ways are needed for determining file type.
Operating systems use using, and continued to evolve to use, different systems to determine the ‘type’ of something, different from the MIME tagging and bagging:
  • ‘magic numbers’: in many contexts, file types could be guessed pretty reliably by looking for headers.
  • Originally MAC OS had a 4 character ‘file type’ and another 4 character ‘creator code’ for file types.
  • Windows evolved to use the “file extension” – 3 letters (and then more) at the end of the file name
Information about these other ways of determining type (rather than by the lable) were gathered for the MIME registry; those registering MIME types are encouraged to also describe ‘magic numbers’, Mac file type, common file extensions. However, since there was no formal use of that information, the quality of that information in the registry is haphazard.
Finally, there was the fact that tagging and bagging might be OK for unilateral one-way messaging, you might want to know whether you could handle the data before reading it in and interpreting it, but the MIME types weren't enough to tell.

The Rules Weren’t Quite Followed

  • Lots of file types aren’t registered (no entry in IANA for file types)
  • Those that are, the registration is incomplete or incorrect (people doing registration didn’t understand ‘magic number’)

A Few Bad Things happened

  1. Browser implementors would be liberal in what they accepted, and use file extension and/or magic number or other ‘sniffing’ techniques to decide file type, without assuming content-label was authoritative. This was necessary anyway for files that weren’t delivered by HTTP.
  2. HTTP server implementors and administrators didn’t supply ways of easily associating the ‘intended’ file type label with the file, resulting in files frequently being delivered with a label other than the one they would have chosen if they’d thought about it, and if browsers *had* assumed content-type was authoritative.  Some popular servers had default configuration files that treated any unknown type as "text/plain" (plain ext in ASCII). Since it didn't matter (the browsers worked anyway), it was hard to get this fixed.
Incorrect senders coupled with liberal readers wind up feeding a negative feedback loop based on the robustness principle.

Consequences

The result, alas, is that the web is unreliable, in that
  • servers sending responses to browsers don’t have a good guarantee that the browser won’t “sniff” the content and decide to do something other than treat it as it is labeled, and
  • browsers receiving content don’t have a good guarantee that the content isn’t mis-labeled
  • intermediaries -- gateways, proxies, caches, and other pieces of the web infrastructure -- don’t have a good way of telling what the conversation means. 
This ambiguity and ‘sniffing’ also applies to packaged content in webapps (‘bagging’ but using ZIP rather than MIME multipart).

The Down Side of Extensibility

Extensibility adds great power, and allows the web to evolve without committee approval of every extension. For some (those who want to extend and their clients who want those extensions), this is power! For others (those who are building web components or infrastructure), extensibility is a drawback -- it adds to the unreliability and difference of the web experience. When senders use extensions recipients aren’t aware of, implement incorrectly or incompletely, then communication often fails.  With messaging, this is a serious problem, although most ‘rich text’ documents are still delivered in multiple forms (using multipart/alternative).
If your job is to support users of a popular browser, however, where each user has installed a different configuration of MIME handlers and extensibility mechanisms, MIME may appear to add unnecessary complexity and variable experience for users of all but the most popular MIME types.

The MIME story applies to charsets

MIME includes provisions not only for file 'types', but also, importantly the "character encoding" used by text types: simple US ASCII, Western European iSO-8859-1, Unicode UTF8. A similar vicious cycle also happened with character set labels: mislabeled content happily processed correctly by liberal browsers encouraged more and more sites to proliferate text with  mis-labeled character sets, to the point where browsers feel they *have* to guess the wrong label.

Embedded, downloaded, launch independent application

MIME is used not only for entire documents "HTML" vs "Word" vs "PDF", but to embedded components of documents, "JPEG image" vs. "PNG image". However, the use cases, requirements and likely operational impact of MIME handling is likely different for those use cases.

Additional Use Cases: Polyglot and Multiview

There are some interesting additional use cases which add to the design requirements:
  •  "Polyglot" documents:  A ‘polyglot’ document is one which is some data which can be treated as two different Internet Media Types, in the case where the meaning of the data is the same. This is part of a transition strategy to allow content providers (senders) to manage, produce, store, deliver the same data, but with two different labels, and have it work equivalently with two different kinds of receivers (one of which knows one Internet Media Type, and another which knows a second one.) This use case was part of the transition strategy from HTML to an XML-based XHTML, and also as a way of a single service offering both HTML-based and XML-based processing (e.g., same content useful for news articles and web pages.
  • "Multiview” documents: This use case seems similar but it’s quite different. In this case, the same data has very different meaning when served as two different content-types, but that difference is intentional; for example, the same data served as text/html is a document, and served as an RDFa type is some specific data.

Versioning

Formats and their specifications evolve over time. Sometimes compatibly, some times compatibly, sometimes not. It is part of the responsibility of the designer of a new version of a file type to try to insure both forward and backward compatibility: new documents work reasonably (with some fallback) with old viewers; old documents work reasonably with new viewers. In some cases this is accomplished, others not; in some cases, "works reasonably" is softened to "either works reasonably or gives clear warning about nature of problem (version mismatch)."
In MIME, the 'tag', the Internet Media Type, corresponds to the versioned series. Internet Media Types do not identify a particular version of a file format. Rather, the general idea is that the MIME type identifies the family, and also how you're supposed to otherwise find version information on a per-format basis. Many (most) file formats have an internal version indicator, with the idea that you only need a new MIME type to designate a completely incompatible format. The notion of an “Internet Media Type” is very course-grained. The general approach to this has been that the actual Media Type includes provisions for version indicator(s)  embedded in the content itself to determine more precisely the nature of how the data is to be interpreted.  That is, the message itself contains further information.
Unfortunately, lots has gone wrong in this scenario as well – processors ignoring version indicators encouraging content creators to not be careful to supply correct version indicators, leading to lots of content with wrong version indicators.
Those updating an existing MIME type registration to account for new versions are admonished to not make previously conforming documents non-conforming. This is harder to enforce than would seem, because the previous specifications are not always accurate to what the MIME type was used for in practice.

Content Negotiation

 The general idea of content negotiation is when party A communicates to party B, and the message can be delivered in more than one format (or version, or configuration), there can be some way of allowing some negotiation, some way for A to communication to B the available options, and for B to be able to accept or indicate preferences.
Content negotiation happens all over. When one fax machine twirps to another when initially connecting, they are negotiating resolution, compression methods and so forth. In Internet mail, which is a one-way communication, the "negotiation" consists of the sender preparing and sending multiple versions of the message, one in text/html, one in text/plain, for example, in sender-preference order. The recipient then chooses the first version it can understand.
HTTP added "Accept" and "Accept-language" to allow content negotiation in HTTP GET, based on MIME types, and there are other methods explained in the HTTP spec.

Fragment identifiers

 The web added the notion of being able to address part of a content and not the whole content by adding a ‘fragment identifier’ to the URL that addressed the data. Of course, this originally made sense for the original web with just HTML, but how would it apply to other content. The URL spec glibly noted that “the definition of the fragment identifier meaning depends on the MIME type”, but unfortunately, few of the MIME type definitions included this information, and practices diverged greatly.
If the interpretation of fragment identifiers depends on the MIME type, though, this really crimps the style of using fragment identifiers differently if content negotiation is wanted.

Where we need to go

 Many people are confused about the purpose of MIME in the web, its uses, the meaning  of MIME types. Many W3C specifications TAG findings and MIME type registrations make what are (IMHO) incorrect assumptions about the meaning and purposes of a MIME type registration.
We need a clear direction on how to make the web more reliable, not less. We need a realistic transition plan from the unreliable web to the more reliable one. Part of this is to encourage senders (web servers) to mean what they say, and encourage recipients (browsers) to give preference to what the senders are sending.
We should try to create specifications for protocols and best practices that will lead the web to more reliable and secure communication. To this end, we give an overall architectural approach to use of MIME, and then specific specifications, for HTTP clients and servers, Web Browsers in general, proxies and intermediaries, which encourage behavior which, on the one hand, continues to work with the already deployed infrastructure (of servers, browsers, and intermediaries), but which advice, if followed, also improves the operability, reliability and security of the web.

Specific recommendations

(I think I want to see if we can get agreement on the background, problem statement and requirements, before sending out any more about possible solutions, however the following is a partial list of documents that should be reviewed & updated, or new documents written

  • update MIME / Internet Media Type registration process (IETF BCP)
    • Allow commenting or easier update; not all MIME type owners need or have all the information the internet needs
    • Be clearer about relationship of 'magic numbers' to sniffing; review MIME types already registered & update.
    • Be clearer about requiring Security Considerations to address risks of sniffing
    • require definition of fragment identifier applicability
    • Perhaps ask the 'applications that use this type' to be clearer about whether the file type is suitable for embedding (plug-in) or as a separate document with auto-launch (MIME handler), or should always be donwloaded.
    • Be clearer about file extension use & relationship of file extensions to MIME handlers
  • FTP specifications
    • Do FTP clients also change rules about guessing file types based on OS of FTP server
  • update Tag finding on authoritative metadata
    • is it possible to remove 'authority'
  • new:  MIME and Internet Media Type section to WebArch
    • based on this memo
  • New: Add a W3C web architecture material on MIME in HTML to W3C web site
    • based on this memo
  • update mimesniff / HTML spec on sniffing, versioning, MIME types, charset sniffing
    • Sniffing uses MIME registry
    • all sniffing can a upgrade
    • discourage sniffing unless there is no type label
      • malformed content-type: error
      • no knowledge that given content-type isn't better than guessed content-type
  • update WEBAPPS specs (which ones?
  • Reconsider other extensibility mechanisms (namespaces, for example): should they use MIME or something like it?

http://lists.w3.org/Archives/Public/www-talk/1992SepOct/0035.html
Re: misconceptions about MIME [long]
Larry Masinter (masinter@parc.xerox.com)
Tue, 27 Oct 1992 14:38:18 PST

"If I wish to retrieve the document, say to view it, I might want to choose the available representation that is most appropriate for my purpose. Imagine my dismay to retrieve a 50 megabyte postscript file from an anonymous FTP archive, only to discover that it is in the newly announced Postscript level 4 format, or to try to edit it only to discover that it is in the (upwardly compatible but not parsable by my client) version 44 of Rich Text. In each case, the appropriateness of alternate sources and representations of a document would depend on information that is currently only available in-band.
I believe that MIME was developed in the context of electronic mail, but that the usage patterns in space and time of archives, database services and the like require more careful attention (a) to out-of-band information about format versions, so that you might know, before you retrieve a representation, whether you have the capability of coping with it, and (b) some restriction on those formats which might otherwise be uncontrollable.
http://lists.w3.org/Archives/Public/www-talk/1992SepOct/0056.html
Re: misconceptions about MIME [long]
Larry Masinter (masinter@parc.xerox.com)
Fri, 30 Oct 1992 15:54:56 PST
I propose (once again) that instead of saying 'application/postscript' it say, at a minimum, 'application/postscript 1985' vs 'application/postscript 1994' or whatever you would like to designate as a way to uniquely identify which edition of the Postscript reference manual you are talking about; instead of being identified as 'image/tiff' the files be identified as 'image/tiff 5.0 Class F' vs 'image/tiff 7.0 class QXB'.

2 comments:

  1. I still like Dan Connolly's suggestion to treat a media type as a (relative) URL against some agreed-upon base.

    Then not only would it be much easier to follow your nose to information about the media type, it would also make deciding the syntax for future extensions like the ones you are proposing that much easier. It would have saved us form abominations like the application/xhtml+xml syntax, and could provide layers of fallback richer than just "HTML or otherwise text", as well as easily providing the places to put versioning information.

    It seems like the first step would be to agree on the future form of media types being a relative URI, so that UAs could at least start preparing to accept the syntax, even if they only use it as an identifier.

    ReplyDelete
  2. @Steven, my 1992 suggestion about version numbers in MIME types was based on a pretty incomplete view of language versions, evolution, and compound documents. While some languages have a simple, linear progression XBOL 1, XBOL 2, XBOL 3, languages like HTML evolve all over the place. And with compounds like XHTML+SVG+MathML -- is that one language? Three? How do you version them all simultaneously? Hierarchy only works when you have a hierarchy, and many formats don't. There might be alternatives -- like the IETF 'media features' idea -- or other ways of combining assertions about versions, components etc. of language versions. I alluded to this in the HTML WG ISSUE-4 change proposal to reinstate DOCTYPE. But I don't think using a hierarchy is going to be helpful, since it leaves most hard questions unanswered.

    ReplyDelete