Masinter's Musings: 05/2019

Topic: URL

Dear Dr Masinter,

I am a French software engineer and have a technical question about RFC 3986, Uniform Resource Identifier (URI): Generic Syntax.

What do you mean exactly by “hierarchical data” in this paragraph?

“The path component contains data, usually organized in hierarchical form, that, along with data in the non-hierarchical query component (Section 3.4), serves to identify a resource within the scope of the URI’s scheme and naming authority (if any).”

The language "usually" in RFC 3986 indicates the text isn't normative, but rather explanatory. So the terms used may not be precise.

For arbitrary URIs, interpretation of "/" and the hierarchy it established depends on the scheme. For http: and https: URIs, the interpretation depends on the server implementation. While many HTTP servers map the path hierarchy to hierarchy of file names in the server's file system, such mappings are not usually 1-1, and might vary depending on server configuration, case sensitivity or interpretation of unnormalized Unicode strings. For web clients, the WHATWG URL spec defines how web clients parse and interpret URLs (which correspond roughly to RFC 3987 IRIs).

For an idea of what was intended, a search for "hierarchical data" yields a Wikipedia entry for "Hierarchical data model" that has relevant examples and how there are many data models in use.

When I look at a directed graph of resources (where resources are the vertices and links between them are the edges), I don’t know how to decide that a particular link between two resources A and B is “hierarchical” or “non hierarchical”. Should it be just antisymmetric (A → B, but not B → A)? Should it be more restrictive, for instance the inclusion relationship (A ⊃ B)? Or another property?

The choice of data model is up to your application, and you can decide for each application which model matches your use.

For instance, which of the following directed graphs of resources should I consider as having “hierarchical” links and which should I consider as having “non hierarchical” links?

1. animalia → mammalia → dog
2. Elisabeth II → Charles → William
3. 1989 → 01 → 31
4. Pink Floyd → The Dark Side of the Moon → Money

If we use antisymmetry as the criterion for being “hierarchical”, the links of all directed graphs will be “hierarchical”.
If we use inclusion as the criterion for being “hierarchical”, only the links of 1 will be “hierarchical”, since in 2 a child has two parents, in 3 a month is shared by every year and in 4 an album can be created by several artists.

This information is needed to know if we should use the path component:

1. /animalia/mammalia/dog
2. /elisabeth-ii/charles/william
3. /1989/01/31
4. /pink-floyd/the-dark-side-of-the-moon/money

or the query component:

1. ?kingdom=animalia&class=mammalia&species=dog
2. ?grandparent=elisabeth-ii&parent=charles&child=william
3. ?year=1989&month=01&day=31
4. ?artist=pink-floyd&album=the-dark-side-of-the-moon&song=money

to identify a resource.

which one of these you use depends on not one case but the entire set of data you're trying to organize.

In case 1, if you're identifying properties of groups of organisms by their taxonomic designation, the namespace is by definition hierarchical, established and managed by complex rules to resolve non-hierarchical conflicts, overlaps and disagreements. But the taxonomy of rankings has 7 or 8 levels (depending on how you count), not 3, and your example says "dog" and it's not clear if you mean Canis lupus familiaris or the entire family of Canidae. The inclusion relation is not hierarchical.

In case 2, you might have a hierarchy if you restricted the relationship to "heir apparent of".
In case 3, many use the hierarchy to organize the data such that 1989/01/13 is used to identify a resource from the first of January if the year 1983.
In case 4, many music organizers sort out files by inventing a new category of "artist" for the album-artist and using "A, B, C" for albums where there are multiple arists, and "Various Artists" for "albums" that have varieties of artists for each song.

People often invent hierarchies as a way of managing access control. WebDAV includes operations based on hierarchies.

Sometimes what is desired is a combination of different data models mixed, with the hierarchical model a "default" view, and the query parameters used to search the space and redirect to a canonical hierarchical URI.

If you're defining the space for an API, the JSON-HAL design pattern adds a layer of indirection using link relations rather than URL patterns, especially in applications where the hierarchical pattern is used for high-performance web servers which use the URL syntax for optimized load-balanced servers with multiple patterns.

Yours faithfully,

XXXX XXXX (name redacted)

Topic: preservation

Reading about the National Archives budget woes: perhaps as digital material is becoming more prevalent, the volumes of paper documents isn't growing?

What are the unique requirements of the National Archives for online storage of digital material?
Unlike most business archives, the lifetime of archived documents is measured in centuries.
The security of archives from both accidental and intentional loss is for that lifetime; unauthorized revelation most be prevented for at least decades.

For public records, LOCKSS (Lots of Copies Keeps Stuff Safe) might be a solution. Distribute copies to each state or region.

But for confidential records, the more copies, the more likely it is the information will be revealed.

But there is an approach worth investigating, using Secret Sharing where each State could maintain a separate secure facility under its own control.

This would allow some resilience to meddling, take-down, or premature release of information unless a large number of states agree.

Masinter's Musings

May 9, 2019

On the nature of hierarchical URLs

Topic: URL

May 8, 2019

Using Secret Sharing for National Archives

Topic: preservation

Medley Interlisp Project, by Larry Masinter et al.

Links