December 8, 2019

I've been Wikipedia'd!

At some point I had the silly idea that I should be listed in Wikipedia. Now like a monkey's claw, like Midas' Touch, I discovered it happened, my wish has turned into a curse. My Wikipedia page is full of nonsense. It is hard to find a sentence that doesn't have an error or two or three. And each error correction requires four things, including cite-able third party proof that the change is justified .

Is this typical? To see so many errors in Wikipedia articles?

  • All of the Interlisp work was at Xerox. Although I was listed as a student at Stanford and didn't get my PhD until 1980, I was working at Xerox full-time after 1976.
  • I had nothing to do with Interlisp-Jericho.
  • There wasn't a port of Interlisp to the vax, there was an effort to build one, and I wrote a document trying to scope out how much work that was to be done. That document wasn't to "document the port".
  • My work at Stanford was on the Dendral project as an employee (my Alternative Service), not as a student. The program was in Lisp.
  • My work on document management was almost all at Xerox, not for Adobe. I didn't do "pioneering work on the PDF format" (for anyone).
  • I remained an employee of Xerox PARC, becoming a "Principal Scientist", but never had the title "Chief Scientist" and never reported to "Xerox AI Systems".
  • I wasn't "instrumental in the development of the PDF MIME type" (I helped publish it at best.)
  • My work on internet standards through IETF and W3C was over many years, between Xerox, AT&T Labs and Adobe. But it was mainly a volunteer effort on my part.
  • Internet standards are not published in "peer reviewed journals"; they are reviewed, but for different reasons than peer-reviewed journals.
  • I never worked on Apache. I never collaborated with Nick Kew or Kim Veltman or anyone else on any book.
  • The footnote references don't correspond very well to the topics discussed.

July 1, 2019

Why these odd anonymous comments?

I've been getting comments on my blog (hosted by blogger) that puzzle me.
  • You have brought up a very wonderful details, thanks for the post
  • Appreciate it for helping out, excellent info
  • This site definitely has all the information I needed about this subject and didn't know who to ask.
  • Hi Dear, are you actually visiting this web page on a regular basis, if so afterward you will definitely obtain fastidious know-how.
  • Excellent post. I was checking continuously this blog and I am impressed! Extremely helpful info particularly the last part :) I care for such information much. I was seeking this certain info for a very long time.  Thank you and best of luck.
  • You have made some good points there. I checked on the web for additional information about the issue and found most individuals will go along with your views on this site.
  • I am in fact pleased to glance at this webpage posts which contains lots of useful data, thanks for providing these kinds of statistics.

It wasn't until I had gotten 2 or 3 that I went from "Publish" to "Delete" to "Mark as spam" (the three options offered by blogger.)

The things that distinguish these comments:

  • They exist. I rarely get comments. These happen once or twice a month 
  • They always are Anonymous
  • They always have bad english grammar.
  • They don't fit any known category of spam; not promoting anything or links anywhere
  • They are flattering
  • They contain nothing that would associate them with the content of the blog post they are commenting on.
Here are some theories:
  • This is a data communication method, using steganography based on grammatical or punctuation or word choices.
  • There is some grad student running experiments / training an AI etc. based on human spam detection
  • Someone is running a background check for blogs that auto-publish anonymous comments.
Any better theories? Leave a comment 😀

May 9, 2019

On the nature of hierarchical URLs

Topic: URL 

Dear Dr Masinter, 
I am a French software engineer and have a technical question about RFC 3986, Uniform Resource Identifier (URI): Generic Syntax. 
What do you mean exactly by “hierarchical data” in this paragraph? 
“The path component contains data, usually organized in hierarchical form, that, along with data in the non-hierarchical query component (Section 3.4), serves to identify a resource within the scope of the URI’s scheme and naming authority (if any).” 
The language "usually" in RFC 3986 indicates the text isn't normative, but rather explanatory. So the terms used may not be precise.

For arbitrary URIs, interpretation of "/" and the hierarchy it established depends on the scheme. For http: and https: URIs, the interpretation depends on the server implementation. While many HTTP servers map the path hierarchy to hierarchy of file names in the server's file system, such mappings are not usually 1-1, and might vary depending on server configuration, case sensitivity or interpretation of unnormalized Unicode strings. For web clients, the WHATWG URL spec defines how web clients parse and interpret URLs (which correspond roughly to RFC 3987 IRIs).

For an idea of what was intended, a  search for "hierarchical data" yields a Wikipedia entry for "Hierarchical data model" that has relevant examples and how there are many data models in use.
When I look at a directed graph of resources (where resources are the vertices and links between them are the edges), I don’t know how to decide that a particular link between two resources A and B is “hierarchical” or “non hierarchical”. Should it be just antisymmetric (A → B, but not B → A)? Should it be more restrictive, for instance the inclusion relationship (A ⊃ B)? Or another property?
The choice of data model is up to your application, and you can decide for each application which model matches your use.
For instance, which of the following directed graphs of resources should I consider as having “hierarchical” links and which should I consider as having “non hierarchical” links? 
1. animalia → mammalia → dog
2. Elisabeth II → Charles → William
3. 1989 → 01 → 31
4. Pink Floyd → The Dark Side of the Moon → Money
If we use antisymmetry as the criterion for being “hierarchical”, the links of all directed graphs will be “hierarchical”.
If we use inclusion as the criterion for being “hierarchical”, only the links of 1 will be “hierarchical”, since in 2 a child has two parents, in 3 a month is shared by every year and in 4 an album can be created by several artists.
This information is needed to know if we should use the path component: 
1. /animalia/mammalia/dog
2. /elisabeth-ii/charles/william
3. /1989/01/31
4. /pink-floyd/the-dark-side-of-the-moon/money
or the query component: 
1. ?kingdom=animalia&class=mammalia&species=dog
2. ?grandparent=elisabeth-ii&parent=charles&child=william
3. ?year=1989&month=01&day=31
4. ?artist=pink-floyd&album=the-dark-side-of-the-moon&song=money
to identify a resource. 
which one of these you use depends on not one case but the entire set of data you're trying to organize.

In case 1, if you're identifying properties of groups of organisms by their taxonomic designation, the namespace is by definition hierarchical,  established and managed by complex rules to resolve non-hierarchical conflicts, overlaps and disagreements. But the taxonomy of rankings has 7 or 8 levels (depending on how you count), not 3, and your example says "dog" and it's not clear if you mean Canis lupus familiaris or the entire family of Canidae.   The inclusion relation is not hierarchical. 

In case 2, you might have a hierarchy if you restricted the relationship to "heir apparent of".
In case 3, many use the hierarchy to organize the data such that 1989/01/13 is used to identify a resource from the first of January if the year 1983.
In case 4, many music organizers sort out files by inventing a new category of "artist" for the album-artist and using "A, B, C" for albums where there are multiple arists, and "Various Artists" for "albums" that have varieties of artists for each song.

People often invent hierarchies as a way of managing access control. WebDAV includes operations based on hierarchies.

Sometimes what is desired is a combination of different data models mixed, with the hierarchical model a "default" view, and the query parameters used to search the space and redirect to a canonical hierarchical URI.

If you're defining the space for an API, the JSON-HAL design pattern adds a layer of indirection using link relations rather than URL patterns, especially in applications where the hierarchical pattern is used for high-performance web servers which use the URL syntax for optimized load-balanced servers with multiple patterns.

 Yours faithfully,
XXXX XXXX (name redacted)

May 8, 2019

Using Secret Sharing for National Archives

Topic: preservation

Reading about the National Archives budget woes: perhaps as digital material is becoming more prevalent, the volumes of paper documents isn't growing?

What are the unique requirements of the National Archives for online storage of digital material?
Unlike most business archives, the lifetime of archived documents is measured in centuries.
The security of archives from both accidental and intentional loss is for that lifetime; unauthorized revelation most be prevented for at least decades.

For public records, LOCKSS (Lots of Copies Keeps Stuff Safe) might be a solution.  Distribute copies to each state or region.

But for confidential records, the more copies, the more likely it is the information will be revealed.

But there is an approach worth investigating, using Secret Sharing where each State could maintain a separate secure facility under its own control.

This would allow some resilience to meddling, take-down, or premature release of information unless a large number of states agree.

April 9, 2019

The Paperless Office and the Horseless Carriage

Topic: preservation

You know the story of the horseless carriage with a buggy-whip holder (in case you needed to put a horse in front of it). But what we wound up with is a wide variety of forms: motorcycle, automobile, tank, train, etc.  The transition was accompanied by corresponding developments in infrastructure.

When Xerox started its Palo Alto Research Center nearly 50 years ago, part of its mission was the Paperless Office -- a world where work was done without the use of paper.

We're still in the middle of a long transition to paperless processes from billing, statements, advertising, news, receipts, insurance, medical records,  government, textbooks, fiction, with documents still playing a major role in data-based activities. In most cases, these processes are turned from using paper documents to electronic ones, with PDF being an important carrier because of its ability to straddle the divide between the paper and electronic world.

In many of these processes, we're only seeing the beginnings of transition to another phase, of data connections, where electronic documents and email are being supplanted by sharing data, and documents only generated on demand by those who are outside of the data-centric roles.

HTML may have been originally designed as a document format for scientific papers at CERN, but its primary thrust in the last decade has been as a way of delivering applications to consumers.

The problem of the digital dark age is not so much technological obsolescence as it is that there are no document by-products of work done; this is not something that can be solved by new kinds of archival documents. In the records management community, documents must have or carry their own context which allows auditing of past behavior by examining the documents the process left behind.

We need some better ways of straddling the document and data world such that data archives are produced in a way that allows it to be audited, redacted, processed, without having all of the original context. Most data channels are, for efficiency reasons, context free. Metadata (embedded or supplemental) is a document-centric way of supplying context, but it isn't enough.

January 10, 2019

The Internet is a WMD (Weapon of Mass Delusion)

Some thoughts on the Internet as the root problem, based on this Washington Post Op-Ed.

When I was a kid, I remember reading about Robert Oppenheimer's work on the atomic bomb and his thought that scientists developing technologies that could be weaponized had some responsibility to counter its misuse. It was probably in the phase of reading bios.

If the Internet, the Web, social media, analytics and targeted advertising have been weaponized, turned into weapons of mass delusion, what are we doing to effectively counter this threat?

I see some focus on decentralization, but that doesn't seem to mitigate the problem. 
IETF, ISOC, ICANN, W3C, ACM, somewhere else? where?

 It seems like this isn't a problem that can be solved by fixing a handful of sites. Or applying sanctions to individuals and companies. how can we counter the susceptibility of mass communication systems to this kind of manipulation and the continuing escalating arms race of hacks and prevention hacks?

Pseudonymous postings seem to be essential, with an increasing sophistication of operational techniques.