May 9, 2019

On the nature of hierarchical URLs

Topic: URL 

Dear Dr Masinter, 
I am a French software engineer and have a technical question about RFC 3986, Uniform Resource Identifier (URI): Generic Syntax. 
What do you mean exactly by “hierarchical data” in this paragraph? 
“The path component contains data, usually organized in hierarchical form, that, along with data in the non-hierarchical query component (Section 3.4), serves to identify a resource within the scope of the URI’s scheme and naming authority (if any).” 
The language "usually" in RFC 3986 indicates the text isn't normative, but rather explanatory. So the terms used may not be precise.

For arbitrary URIs, interpretation of "/" and the hierarchy it established depends on the scheme. For http: and https: URIs, the interpretation depends on the server implementation. While many HTTP servers map the path hierarchy to hierarchy of file names in the server's file system, such mappings are not usually 1-1, and might vary depending on server configuration, case sensitivity or interpretation of unnormalized Unicode strings. For web clients, the WHATWG URL spec defines how web clients parse and interpret URLs (which correspond roughly to RFC 3987 IRIs).

For an idea of what was intended, a  search for "hierarchical data" yields a Wikipedia entry for "Hierarchical data model" that has relevant examples and how there are many data models in use.
When I look at a directed graph of resources (where resources are the vertices and links between them are the edges), I don’t know how to decide that a particular link between two resources A and B is “hierarchical” or “non hierarchical”. Should it be just antisymmetric (A → B, but not B → A)? Should it be more restrictive, for instance the inclusion relationship (A ⊃ B)? Or another property?
The choice of data model is up to your application, and you can decide for each application which model matches your use.
For instance, which of the following directed graphs of resources should I consider as having “hierarchical” links and which should I consider as having “non hierarchical” links? 
1. animalia → mammalia → dog
2. Elisabeth II → Charles → William
3. 1989 → 01 → 31
4. Pink Floyd → The Dark Side of the Moon → Money
If we use antisymmetry as the criterion for being “hierarchical”, the links of all directed graphs will be “hierarchical”.
If we use inclusion as the criterion for being “hierarchical”, only the links of 1 will be “hierarchical”, since in 2 a child has two parents, in 3 a month is shared by every year and in 4 an album can be created by several artists.
This information is needed to know if we should use the path component: 
1. /animalia/mammalia/dog
2. /elisabeth-ii/charles/william
3. /1989/01/31
4. /pink-floyd/the-dark-side-of-the-moon/money
or the query component: 
1. ?kingdom=animalia&class=mammalia&species=dog
2. ?grandparent=elisabeth-ii&parent=charles&child=william
3. ?year=1989&month=01&day=31
4. ?artist=pink-floyd&album=the-dark-side-of-the-moon&song=money
to identify a resource. 
which one of these you use depends on not one case but the entire set of data you're trying to organize.

In case 1, if you're identifying properties of groups of organisms by their taxonomic designation, the namespace is by definition hierarchical,  established and managed by complex rules to resolve non-hierarchical conflicts, overlaps and disagreements. But the taxonomy of rankings has 7 or 8 levels (depending on how you count), not 3, and your example says "dog" and it's not clear if you mean Canis lupus familiaris or the entire family of Canidae.   The inclusion relation is not hierarchical. 

In case 2, you might have a hierarchy if you restricted the relationship to "heir apparent of".
In case 3, many use the hierarchy to organize the data such that 1989/01/13 is used to identify a resource from the first of January if the year 1983.
In case 4, many music organizers sort out files by inventing a new category of "artist" for the album-artist and using "A, B, C" for albums where there are multiple arists, and "Various Artists" for "albums" that have varieties of artists for each song.

People often invent hierarchies as a way of managing access control. WebDAV includes operations based on hierarchies.

Sometimes what is desired is a combination of different data models mixed, with the hierarchical model a "default" view, and the query parameters used to search the space and redirect to a canonical hierarchical URI.

If you're defining the space for an API, the JSON-HAL design pattern adds a layer of indirection using link relations rather than URL patterns, especially in applications where the hierarchical pattern is used for high-performance web servers which use the URL syntax for optimized load-balanced servers with multiple patterns.

 Yours faithfully,
XXXX XXXX (name redacted)

May 8, 2019

Using Secret Sharing for National Archives

Topic: preservation

Reading about the National Archives budget woes: perhaps as digital material is becoming more prevalent, the volumes of paper documents isn't growing?

What are the unique requirements of the National Archives for online storage of digital material?
Unlike most business archives, the lifetime of archived documents is measured in centuries.
The security of archives from both accidental and intentional loss is for that lifetime; unauthorized revelation most be prevented for at least decades.

For public records, LOCKSS (Lots of Copies Keeps Stuff Safe) might be a solution.  Distribute copies to each state or region.

But for confidential records, the more copies, the more likely it is the information will be revealed.

But there is an approach worth investigating, using Secret Sharing where each State could maintain a separate secure facility under its own control.

This would allow some resilience to meddling, take-down, or premature release of information unless a large number of states agree.

April 9, 2019

The Paperless Office and the Horseless Carriage

Topic: preservation

You know the story of the horseless carriage with a buggy-whip holder (in case you needed to put a horse in front of it). But what we wound up with is a wide variety of forms: motorcycle, automobile, tank, train, etc.  The transition was accompanied by corresponding developments in infrastructure.

When Xerox started its Palo Alto Research Center nearly 50 years ago, part of its mission was the Paperless Office -- a world where work was done without the use of paper.

We're still in the middle of a long transition to paperless processes from billing, statements, advertising, news, receipts, insurance, medical records,  government, textbooks, fiction, with documents still playing a major role in data-based activities. In most cases, these processes are turned from using paper documents to electronic ones, with PDF being an important carrier because of its ability to straddle the divide between the paper and electronic world.

In many of these processes, we're only seeing the beginnings of transition to another phase, of data connections, where electronic documents and email are being supplanted by sharing data, and documents only generated on demand by those who are outside of the data-centric roles.

HTML may have been originally designed as a document format for scientific papers at CERN, but its primary thrust in the last decade has been as a way of delivering applications to consumers.

The problem of the digital dark age is not so much technological obsolescence as it is that there are no document by-products of work done; this is not something that can be solved by new kinds of archival documents. In the records management community, documents must have or carry their own context which allows auditing of past behavior by examining the documents the process left behind.

We need some better ways of straddling the document and data world such that data archives are produced in a way that allows it to be audited, redacted, processed, without having all of the original context. Most data channels are, for efficiency reasons, context free. Metadata (embedded or supplemental) is a document-centric way of supplying context, but it isn't enough.

January 10, 2019

The Internet is a WMD (Weapon of Mass Delusion)

Some thoughts on the Internet as the root problem, based on this Washington Post Op-Ed.

When I was a kid, I remember reading about Robert Oppenheimer's work on the atomic bomb and his thought that scientists developing technologies that could be weaponized had some responsibility to counter its misuse. It was probably in the phase of reading bios.

If the Internet, the Web, social media, analytics and targeted advertising have been weaponized, turned into weapons of mass delusion, what are we doing to effectively counter this threat?

I see some focus on decentralization, but that doesn't seem to mitigate the problem. 
IETF, ISOC, ICANN, W3C, ACM, somewhere else? where?

 It seems like this isn't a problem that can be solved by fixing a handful of sites. Or applying sanctions to individuals and companies. how can we counter the susceptibility of mass communication systems to this kind of manipulation and the continuing escalating arms race of hacks and prevention hacks?

Pseudonymous postings seem to be essential, with an increasing sophistication of operational techniques.