Masinter's Musings: 09/2014

September 14, 2014

Living Standards: "Who Needs IANA?"

I'm reading about two tussles, which seem completely disconnected, although they are about the same thing, and I'm puzzled why there isn't a connection.

This is about the IANA protocol parameter registries. Over in ianaplan@ietf.org people are worrying about preserving the IANA function and the relationship between IETF and IANA, because it is working well and shouldn't be disturbed (by misplaced US political maneuvering that the long-planned transition from NTIA is somehow giving something away by the administration.)

Meanwhile, over in www-international@w3.org, there's a discussion of the Encodings document, being copied from WHATWG's document of that name into W3C recommendation. See the thread (started by me), about the "false statement".

Living Standards don't need or want registries for most things the web use registries for now: Encodings, MIME types, URL schemes. A Living Standard has an exhaustive list, and if you want to add a new one or change one, you just change the standard. Who needs IANA with its fussy separate set of rules? Who needs any registry really?

So that's the contradiction: why doesn't the web need registries while other applications do? Or is IANAPLAN deluded?

September 9, 2014

The multipart/form-data mess

OK, this is only a tiny mess, in comparison with the URL mess, and I have more hope for this one.

Way back when (1995), I spec'ed a way of doing "file upload" in RFC1867. I got into this because some Xerox printing product in the 90s wanted it, and enough other folks in the web community seemed to want it too. I was happy to find something that a Xerox product actually wanted from Xerox research.

It seemed natural, if you were sending files, to use MIME's methods for doing so, in the hopes that the design constraints were similar and that implementors would already be familiar with email MIME implementations. The original file upload spec was done in IETF because at the time, all of the web, including HTML, was being standardized in the IETF. RFC 1867 was "experimental," which in IETF used to be one way of floating a proposal for new stuff without having to declare it ready.

After some experimentation we wanted to move the spec toward standardization. Part of the process of making the proposal standard was to modularize the specification, so that it wasn't just about uploading files in web pages. Rather, all the stuff about extending forms and names of form fields and so forth went with HTML. And the container, the holder of "form data"-- independent of what kind of form you had or whether it had any files at all -- went into the definition of multipart/form-data (in RFC2388). Now, I don't know if it was "theoretical purity" or just some sense of building things that are general purpose to allow unintended mash-ups, but RFC2388 was pretty general, and HTML 3.2 and HTML 4.0 were being developed by people who were more interested in spec-ing a markup language than a form processing application, so there was a specification gap between RFC 2388 and HTML 4.0 about when and how and what browsers were supposed to do to process a form and produce multipart/form-data.

February of last year (2013) I got a request to find someone to update RFC 2388. After many months of trying to find another volunteer (most declined because of lack of time to deal with the politics) I went ahead and started work: update the spec, investigate what browsers did, make some known changes. See GitHub repo for multipart/form-data and the latest Internet Draft spec.

Now, I admit I got distracted trying to build a test framework for a "test the web forward" kind of automated test, and spent way too much time building what wound up to be a fairly arcane system. But I've updated the document, and recommended its "working group last call". The only problem is that I just made stuff up based on some unvalidated guesswork reported second hand ... there is no working group of people willing to do work. No browser implementor has reviewed the latest drafts that I can tell.

I'm not sure what it takes to actually get technical reviewers who will actually read the document and compare it to one or more implementations to justify the changes in the draft.

Go to it! Review the spec! Make concrete suggestions for change, comments or even better, send GitHub pull requests!

September 7, 2014

The URL mess

(updated 9/8/14)

One of the main inventions of the Web was the URL. And I've gotten stuck trying to help fix up the standards so that they actually work.

The standards around URLs, though, have gotten themselves into an organizational political quandary to the point where it's like many other situations where a polarized power struggle keeps the right thing from happening.

Here's an update to an earlier description of the situation:

URLs were originally defined as ASCII only. Although it was quickly determined that it was desirable to allow non-ASCII characters, shoehorning utf-8 into ASCII-only systems was unacceptable; at the time, Unicode was not so widely deployed, and there were other issues. The tack was taken to leave "URI" alone and define a new protocol element, "IRI"; RFC 3987 published in 2005 (in sync with the RFC 3986 update to the URI definition). (This is a very compressed history of what really happened.)

The IRI-to-URI transformation specified in RFC 3987 had options; it wasn't a deterministic path. The URI-to-IRI transformation was also heuristic, since there was no guarantee that %xx-encoded bytes in the URI were actually meant to be %xx percent-hex-encoded bytes of a utf8 encoding of a Unicode string.

To address issues and to fix URL for HTML5, a new working group was established in IETF in 2009 (The IRI working group). Despite years of development, the group didn't get the attention of those active in WHATWG, W3C or Unicode consortium, and the IRI group was closed in 2014, with the consolation that the documents that were being developed in the IRI working group could be updated as individual submissions or within the "applications area" working group. In particular, one of the IRI working group items was to update the "scheme guidelines and registration process", which is currently under development in IETF's application area.

Independently, the HTML5 specs in WHATWG/W3C defined "Web Address", in an attempt to match what some of the browsers were doing. This definition (mainly a published parsing algorithm) was moved out into a separate WHATWG document called "URL".

The world has also moved on. ICANN has approved non-ascii top level domains, and IDN 2003 and 2008 didn't really address IRI Encoding. Unicode consortium is working on UTS #46.

The big issue is to make the IRI -to-URI transformation non-ambiguous and stable. But I don't know what to do about non-domain-name non-ascii 'authority' fields. There is some evidence that some processors are %xx-hex-encoding the UTF8 of domain names in some circumstances.

There are four umbrella organizations (IETF, W3C, WHATWG, Unicode consortium) and multiple documents, and it's unclear whether there's a trajectory to make them consistent:

IETF

Dave Thaler (mainly) has updated http://tools.ietf.org/html/draft-ietf-appsawg-uri-scheme-reg, which needs comunity review.

The IRI working group closed, but work can continue in the APPS area working group. Documents sitting needing update, abandoned now, are three drafts (iri-3987bis, iri-comparison, iri-bidi-guidelines) intended originally to obsolete RFC 3987.

Other work in IETF that is relevant but I'm not as familiar with is the IDN/IDNA work for internationalizing domain names, since the rules for canonicalization, equivalence, encoding, parsing, and displaying domain names needs to be compatible with the rules for doing those things to URLs that contain domain names.

In addition, there's quite a bit of activity around URNs and library identifiers in the URN working group, work that is ignored by other organizations.

W3C

The W3C has many existing recommendations which reference the IETF URI/IRI specs in various ways (for example, XML has its own restricted/expanded allowed syntax for URL-like-things). The HTML5 spec references something, the TAG seems to be involved, as well as the sysapps working group, I believe. I haven't tracked what's happened in the last few months.

WHATWG

The WHATWG spec is http://url.spec.whatwg.org/ (Anne, Leif). This fits in with the WHATWG principle of focusing on specifying what is important for browsers, so it leaves out many of the topics in the IETF specs. I don't think there is any reference to registration, and (when I checked last) had a fixed set of relative schemes: ftp, file, gopher (a mistake?), http, https, ws, wss, used IDNA 2003 not 2008, and was (perhaps, perhaps not) at odds with IETF specs.

Unicode consortium

Early versions of #46 and I think others recommends translating toAscii and back using punycode ? But it wasn't specific about which schemes.

Conclusion

From a user or developer point of view, it makes no sense for there to be a proliferation of definitions of URL, or a large variety URL syntax categories. Yes, currently there is a proliferation of slightly incompatible implementations. This shouldn't be a competitive feature. Yet the organizations involved have little incentive to incur the overhead of cooperation, especially since there is an ongoing power struggle for legitimacy and control. The same dynamic applies to the Encoding spec, and, to a lesser degree, handling of MIME types (sniffing) and multipart/form-data.

September 6, 2014

On blogging, tweeting, facebooking, emailing

I wanted to try all the social media, just to keep an understanding of how things really work, I say.

And my curiosity satisfied, I 'get' blogging, tweeting, facebook posting, linking in, although I haven't tried pinning and instagramming. And I'm not sure what about.me is about, really, and quora sends me annoying spam which tempts me to read.

Meanwhile, I'm hardly blogging at all; I have lots of topics with something to say. Meanwhile Carol (wife) is blogging about a trip; I supply photo-captions and Internet support.

So I'm going to follow suit, try to blog daily. Blogspot for technical, Facebook for personal, tweet to announce. LinkedIn notice when there's more to read. I want to update my site, too; more on that later.

Masinter's Musings