Masinter's Musings: 03/2010

html5,w3c,authoring conformance

There's some ongoing debate about whether or how the HTML specification should or shouldn't contain "authoring requirements". I thought I'd write about the general need.

In general, a standard is a (specification of an) interface, across which multiple kinds of agents can communicate. The usual reason for "conformance classes" in protocol specifications is to improve interoperability between agents of the various classes, as an extension of the Robustness principle (or Postel's Law): Be conservative in what you do; be liberal in what you accept from others.

As an engineering principle, robustness predates software by a long shot. There is a standard for railroad tracks, and a separate standard for rail car wheels. The railroad track standard says how wide a track should be, and gives some tolerances which should be kept within. Similarly, a rail car wheel standard would define how far apart the rail car wheels should be, and give restrictions and guidelines on the center of gravity of the car, guidelines on how the flanges of a rail car wheel should be constructed and some tolerances of those. The goal of the standard and the tolerances is to avoid derailment. The rules for the rails and the rules for the rail car wheels go together.

How does this map to the current HTML debate? The current HTML spec contains requirements for HTML interpreting agents (browsers, or the rails); the discussion is around the standards for HTML documents (rail cars) and for the behavior of document producers as well.

The goal of interoperability is to keep the trains on the rails -- to insure that web pages presented are received and interpreted as expected. Following the robustness principle, It is reasonable to have a conservative standard for HTML documents (and authoring tools), and for that standard to be more conservative then would be required if every implementation (rail system) exactly followed all of the standard. Document conformance, recommended document practice, and backward compatible advisory components are all reasonable targets for specification.

Of course, designing a good set of conservative guidelines involves negotiating set of tradeoffs. Is there only one "conservative" guideline? Are there several, depending on the importance of backward compatibility with older browsers? Should the fact that some content was previously conforming have weight? Those are still the open questions.

But conservative guidelines for content and authoring tools are important, and not (as originally noted in the bug report cited above) "ideology" or "loyalty".

w3c,html5,representation,resource,angels,pins,uri,url,iri

Everyone knows what a URL is, right? It's just a simple thing -- a kind of pointer to something. You'd think a standard around URLs would be easy, compared to something as complicated as a document description and application environment.

Except that every little bit of what URLs are, how they work, how to use them and find them, has endless complexities and is a source of disagreement! No wonder web standards are hard.

Trying to understand all of the issues around URLs winds up pushing into philosophy of language, knowledge representation and AI systems, security and spoofing, metadata, the way in which academic citations are made, the requirements for gracefully moving to IPv6, the upgrade of the domain name system to allow non-English domain names (that use letters other than A-Z), and so on and so forth.

This post starts to explore a few of the issues, why they're hard, and why we're still arguing about them. I'll expand it or blog more over time; comments appreciated.

Sometimes, when you're dealing with a hard subject in a committee, you wind up with a "bike shed" discussions (see Parkinson's Law of Triviality). In the case of URLs, I think the "bike shed" is and has been the terminology -- people argue about what to call things, rather than harder things to talk about, like how it works and what the requirements are. "Just rename the spec to be X, and it's fine, I'm ok if you call it an X spec, I just don't like you calling it a Y."

"Universal" vs. "Uniform"

Perhaps it was hubris to envision a world all connected by links, that led to a vision of the World Wide Web where there was not just an identifier, a resource identifier, but a Universal resource identifier. Universal: it could be used for any situation when you wanted to identify something. Now, something that is universal actually needs to work in all of the situations that might need an identifier (whatever that is) for identifying a resource (whatever that is).

But in the history of computer science, there have been lots and lots of different kinds of identifiers used for reference; in the history of language, you might think that any "noun phrase" is a kind of identifier for something. So the requirements for a Universal resource identifier are pretty extensive, hard to sort out. The group charted with standardizing these URI things -- bringing it from a vision to a standard -- (which I chaired), first tried to tackle the problem of not really wanting to solve that universal problem. How could you identify anything? So, in the great bike shed tradition, after hundreds (perhaps thousands) of emails on the subject, "Universal" was changed to "Uniform". Something that is "Uniform" just has to work the same wherever it's used, but if it's "Universal", it would have to be good for *every* kind of resource identification, which is ambitious but (alas) impossible.

"Identifiers" become "Names" and "Locators"

One of the great advances in distributed system design was the introduction of a kind of network design that separated naming, addressing, and routing (host name, IP address, which gateway to send it to.) Here at the application layer (that is, the web resting on top of the internet), this distinction has been recapitulated: is a URL the "name" of something, which gives you a hint as to how you get at it, is it really the "address" of something, telling you where on the Internet it is, but it's up to you to figure out how to get there, or is it really a "route" to something, telling you what steps you have to take to get there?

And as we all know, these separations are somewhat artificial. Will the post office deliver a letter addressed to "Hixie, Google"? Don't we use names as addresses and routes as names? Since one can be used for the other, are they really different?

They're certainly different in the library and information management world: knowing what a document IS (in caps) is a lot different from knowing where you could find it.

We spent a lot of time on this topic, finally coming up with the decision: the space of Uniform Resource Identifiers (URIs) would be split: there would be Uniform Resource Names ( URNs) and the rest. The rest of these things would be called URLs, "Uniform Resource Locators". URNs have their own requirements [RFC1737], registration procedures, and syntax, but would fit inside the overall URI space: URNs started with "urn:" and every other kind of URI didn't.

Given the ambiguity of the URN/URL divide, that URNs that can be used as locators (using [Dynamic Delegation Discovery System]) and URLs can and are used as names without any "resolution" protocol in mind, maybe trying to call them different things was hopeless. URL, URI, IRI, I'm not a stickler, at least in prose. (In formal specifications, that's another story.)

Do Resources Exist?

Through all of this, the idea of a "resource" is still one of the Great Mysteries of the web. What is a resource? How can you tell whether two resources are the same resource? Are there different kinds of resources (Information Resources vs. non-Information Resources)?

This Great Mystery has eluded being pinned down since the beginning, to the point where, well, I thought it was fitting to think of Resources as Angels and URIs as Pins.

The W3C TAG (without me) spent quite a bit of time discussing this and related issues. TAG httpRange-14 addressed "what is the range of the HTTP function". Mathematically, a function f is a mapping from one set (the domain) into another set (the range). That is, if you write f(x) = y, x is in the domain, y is in the range, and f is the mapping. Presumably, for "f" standards for the act of "locating" done by the HTTP protocol, while the domain (what x is chosen from) is the set of "all URIs", while the range is the set(?) of all "resources".

Of course, this kind of assumes that you know what these things are. But what exactly is a "resource" and what does it mean to "identify" one?

It's interesting to look at the philosophy of language; if go back to Frege and the problem of Reference, you might ask whether if you put "the morning star" and "the evening star" as two addresses for "Venus", would it make sense to ask whether these "locators" identified the "same" resource? There's only one Venus, right? So you' could have many Pins in the single Angel.

Related questions come up when you ask what is the Resource identified by a single URI, just thinking about the boundaries: is "http://larry.masinter.net" an identifier for me or for my web page? Does it refer to the page at the current moment, or whenever you get around to looking at it? Does it identify just the HTML of that page, the HTML and its images, that page and everything on the whole site I might reference? The answer to most of these questions might be "yes"; that is, any kind of reference is naturally ambiguous. There are many Resources that might be identified by a single URI, and many Angels can dance on the point [!] of a single Pin.

"Resource" vs. "Representation"

Let's restrict the range of things we're talking about at the moment to URLs that start with "http:" or "ftp:" or a few others, and even ones that are typically used on the web in hyperlinked applications.

I click on a link, and my computer asks some other computer something and gets something back. What do I get? Is it "the resource" that was identified by that URL? It usually has something to do with that resource, but of course, when you press hard, there are URLs that don't return anything (you just POST to them, for example) and URLs that can return more than one thing, depending on the situation (using content negotiation or knowing something about it.)

In another bike shed moment, the decision was to try to separate out the notion of "resource" and "representation".

In a lot of situations, the difference between a "file" and "the name of the file plus the location of the file plus the contents of the file plus any file system data about the file" .... well, they're the same. So if you think of URLs like file names and clicking on a link like accessing a file, the "resource" and the "representation" don't really matter.

But for those for whom the distinction does matter, the idea of just tossing the distinction, well, it's annoying.

"Hyperlinks vs. Ontologies"

URLs are used extensively in hyperlinked applications--not just HTML, but other formats and protocols too (Flash, PDF, Silverlight, SVG, etc.). Hyperlinked applications such as web browsers use URLs to identify which network resource should be contacted during the interaction with the hypertext and how that contact is to be made.

But URLs have also been used for other purposes. For example,

XML applications use URIs in xmlns attributes to identify namespaces.
Semantic Web applications use URIs to denote concepts. I use "denote" as term related to, but different from, "identify".
Metadata applications use URLs to identify both the data that the metadata is about.

At one time, I was trying to argue that the notion of a "concept" doesn't really form a "set" in the mathematical sense, because there wasn't a clear idea of "equality".

(more to come.... about model theory, internationalization and the discussions around IRI, the difference between "the presentation of an IRI" and "an IRI", the process of registering new URI schemes, other systems such as XRI and DOI and Handle systems, and...

... a truly universal "tdb" scheme, which can be used to identify anything. Well, anything you can think of. Well, anything you can think of and write down what you're thinking.)

Masinter's Musings

March 24, 2010

Ozymandias' URI

March 22, 2010

Browsers are rails, web sites are trains (authoring conformance requirements)

March 13, 2010

Resources are Angels; URLs are Pins

Medley Interlisp Project, by Larry Masinter et al.

Links