September 10, 2013

HTTP/2.0 worries

I tried to explain HTTP/2.0 in my previous post. This post notes some nagging worries about HTTP/2.0 going forward. Maybe these are nonsense, but ... tell me why I'm wrong ....

Faster is better, but faster for whom?

It should be no surprise that using software is more pleasant when it responds more quickly.  But the effect is pronounced and the difference between "usable" and "just frustrating".  For the web, the critical time is between when the user clicks on a link and the results are legible and useful. Studies (and others) show that improving page load time has a significant effect on the use of web sites.  And a primary component of web speed is the network speed: not just the bandwidth but, for the web, the latency. Much of the world doesn't have high-speed Internet, and the web is often close to unusable.

The problem is -- faster for whom? In general, when optimizing something, one makes changes that speed up common cases, even if making uncommon cases more expensive. Unfortunately, different communities can disagree about what is "common", depending on their perspective.

Clearly, connection multiplexing helps sites that host all of their data at a single server more than it helps sites that open connection to multiple systems.

It should be a good thing that the protocol designers are basing optimizations by measuring the results on real web sites and real data. But the data being used risks a bias; so far little of the data used has been itself published and results reproduced. Decisions in the working group are being made based on limited data, and often are not reproducible or auditable.

Flow control at multiple layers can interfere

This isn't the first time there's been an attempt to revise HTTP/1.1; the HTTP-NG effort also tried. One of the difficulties with HTTP-NG was that there was some interaction between TCP flow control and the framing of messages at the application layer, resulting in latency spikes.  And those working with SPDY report that SPDY isn't effective without server "prioritization", which I understand to be predictively deciding which resources the client will  need first, and returning their content chunks with higher priority for being sent sooner. While some servers have added such facilities for prioritization and prediction, those mechanisms are unreported and proprietary.

Forking  

While HTTP/2.0 started with SPDY, SPDY development development continues independently of HTTP/2.0. While the intention is to roll good ideas from SPDY into HTTP/2.0, there still remains the risk that the projects will fork. Whether the possibility of forking is itself positive or negative is itself controversial, but I think the bar should be higher.

Encryption everywhere 

There is a long-running and still unresolved debate around the guidelines for using, mandating, requiring use of, or implementation of encryption, in both HTTP/1.1 and HTTP/2.0. It's clear that HTTP/2.0 changes the cost of multiple encrypted connections to the same host significantly, thus reducing the overhead of using encryption everywhere: Normally, setting up an encrypted channel is relatively slow, requiring a lot more network round trips to establish. With multiplexing, the setup cost only happens once, so encrypting everything is less of a problem.

But there are a few reasons why that might not actually be ideal. For example, there is also a large market for devices which monitor, adjust, redirect or otherwise interact with unencrypted HTTP traffic; a company might scan and block some kinds of information on its corporate net. Encryption everywhere will have a serious impact for sites that have these interception devices, for better or worse. And adding encryption in a situation where the traffic is already protected is less than ideal, adding unnecessary overhead.

In any case, encryption everywhere might be more feasible with HTTP/2.0 than HTTP/1.1 because of the lower overhead, but it doesn't promise any significant advantage for privacy per se.

Need realistic measurement data

To insure that HTTP/2.0 is good enough to completely replace HTTP 1.1, it's necessary to insure that HTTP/2.0 is better in all cases. We do not have agreement or reproducable ways of measuring performance and impact in a wide variety of realistic configurations of bandwidth and latency. Measurement is crucial, lest we introduce changes which make things worse in unanticipated situations, or wind up with protocol changes that only help the use cases important to those who attend the meetings regularly and not the unrepresented.

Why HTTP/2.0? A Perspective

When setting up for the HTTP meeting in Hamburg, I was asked, reasonably enough, what the group is doing, why it was important, and my prognosis for its success.  It was hard to explain, so I thought I'd try to write up my take "why HTTP/2.0?"  Corrections, additions welcome.

HTTP Started Simple

The HyperText Transfer Protocol when first proposed was a very simple network protocol, much simpler than FTP (File Transfer Protocol), and quite similar to Gopher. Basically, the protocol is layered on the Transport Control Protocol (TCP)  which sets up bi-directional reliable streams of data. HTTP/0.9 expected one TCP connection per user click to get a new document. When the user clicks a link, it takes the URL of the link (which contains the host, port, and path of the link) and
  1. Using DNS, client get the IP address of the server in the URL
  2. opens a TCP connection to that server's address on the port named in the URL
  3. client writes "GET" and the path of the URL onto the connection
  4. the server responds with HTML for the page
  5. the client reads the HTML and displays it
  6. the connection is closed
Simple HTTP was adequate, judging by latency and bandwidth, as the overhead of HTTP/0.9 was minimal; the only overhead is the time to look up the DNS name and set up the TCP connection. 

Growing Complexity

HTTP got lots more complicated; changes were reflected in a series of specifications, initially with HTTP/1.0, and subsequently HTTP/1.1. Evolution has been lengthy, painstaking work; a second edition of the HTTP/1.1 specification (in six parts, only now nearing completion) has been under development for 8 years. 

Adding Headers

HTTP/1.0 request and response (steps 3 and 4 above) added headers: fields and values that modified the meaning of requests and responses. Headers were added to support a wide variety of additional use cases, e.g., adding a "Content-Type" header to allow images and  other kinds of content, a "Content-Transfer-Encoding" header and others to allow optional compression, quite a number of headers for support of caching and cache maintenance, a "DNT" header to express user privacy preferences.

While each header has its uses and justification, and many are optional, headers add both size and complexity to every HTTP request. When HTTP headers get big, there is more chance of delay (e.g., the request no longer fits in a single packet), and the same header information gets repeated.

Many More Requests per Web Page

The use of HTTP changed, as web expressiveness increased. Initially NCSA Mosaic led by supporting embedded  images in web pages, doing this by using a separate URL and HTTP request for each image.  Over time, more elements also have been set up as separate cachable resources, such as style sheets, JavaScript and fonts. Presently, the average popular web home page makes over 40 HTTP requests 

HTTP is stateless

Neither client nor server need to allocate memory or remember anything from one request/response to the next. This is an important characteristic of the web that allows highly popular web sites to serve many independent clients simultaneously, because the server need not allocate and manage memory for each client.  Headers must be repeatedly sent, to maintain the stateless nature of the protocol.

Congestion and Flow Control

 Flow control in TCP, like traffic metering lights, throttles a sender's output to match the receivers capability to read. Using many simultaneous connections does not work well, because the streams use the same routers and bridges which must manage the streams independently, but the TCP flow control algorithms do not, cannot, take into account the other traffic on the other connections. Also, setting up a new connection potentially involves additional latency, and opening encrypted connections is even slower since it requires more round-trips of communication of information.

Starting HTTP/2.0

While these problems were well-recognized quite a while ago, work on optimizing HTTP labeled "HTTP-NG" (next generation) foundered. But more recent work (and deployment) by Google on a protocol called SPDY shows that, at least in some circumstances, HTTP can be replaced with something which can improve page load time. SPDY is already widely deployed, but there is an advantage in making it a standard, at least to get review by those using HTTP for other applications. The IETF working group finishing the HTTP/1.1 second edition ("HTTPbis") has been rechartered to develop HTTP/2.0 which addresses performance problems. The group decided to start with (a subset of) SPDY and make changes from there.

HTTP/2.0 builds on HTTP/1.1; for the most part, it is not a reduction of the complexity of HTTP, but rather adds new features primarily for performance.

Header Compression

The obvious thing to do to reduce the size of something is to try to compress it, and HTTP headers compress well. But the goal is not just to speed transmission, it's also to reduce parse time of the headers. The header compression method is undergoing significant changes.

Connection multiplexing

One way to insure coordinated flow control and avoid causing network congestion is to "multiplex" a single connection. That is, rather than open 40 connections, only open one per destination. A site that serves all of its images and style sheets and JavaScript libraries on the same host could send the data for the page over the same connection. The only issue is how to coordinate independent requests and responses which can either be produced or consumed in chunks.

Push vs. Pull

A "push" is when the server sends a response that hadn't been asked for. HTTP semantics are strictly request followed by response, and one of the reasons why HTTP was considered OK to let out through a firewall that filtered out incoming requests.  When the server can "push" some content to clients even when the client didn't explicitly request it, it is "server push".  Push in HTTP/2.0 uses a promise "A is what you would get if you asked for B", that is, a promise of the result of a potential pull. The HTTP/2.0 semantics are developed in such a way that these "push" requests look like they are responses to requests not made yet, so it is called a "push promise".  Making use of this capability requires redesigning the web site and server to make proper use of this capability.

With this background, I can now talk about some of the ways HTTP/2.0 can go wrong. Coming up!

September 6, 2013

HTTP meeting in Hamburg

I was going to do a trip report about the HTTPbis meeting August 5-7 at the Adobe Hamburg office, but wound up writing up a longer essay about HTTP/2.0 (which I will post soon, promise.) So, to post the photo:

It was great to have so many knowledgeable implementors working on live interoperability: 30 people from around the industry and around the world came, including participants from Adobe, Akamai, Canon, Google, Microsoft, Mozilla, Twitter, and many others representing browsers, servers, proxies and other intermediaries.
It's good the standard development is being driven by implementation and testing. While testing across the Internet is feasible, meeting face-to-face helped with establishing coordination on the standard.
I do have some concerns about things that might go wrong, which I'll also post soon.

July 21, 2013

Linking and the Law

Ashok Malhotra and I (with help from a few friends) wrote a short blog post  "Linking and the Law" as a follow-on of the W3C TAG note Publishing and Linking on the Web (which Ashok and I helped with after its original work by Jeni Tennison and Dan Appelquist.)

Now, we wanted to make this a joint publication, but ... where to host it? Here, Ashok's personal blog, Adobe's, the W3C?

Well, rather than including the post here (copying the material) and in lieu of real transclusion, I'm linking to Ashok's blog: see "Linking and the Law".

Following this: the problems identified in Governance and Web Architecture are visible here:
  1. Regulation doesn't match technology
  2. Regulations conflict because of technology mis-match
  3. Jurisdiction is local, the Internet is global
These principles reflect the difficulties for Internet governance ahead. The debates on managing and regulating the Internet are getting more heated. The most serious difficulty for Internet regulation is the risk that the regulation won't actually make sense with the technology (as we're seeing with Do Not Track).
The second most serious problem is that standards for what is or isn't OK to do will vary widely across communities to the extent that user created content cannot be reasonably vetted for general distribution.

April 2, 2013

Safe and Secure Internet

The Orlando IETF meeting was sponsored by Comcast/NBC Universal. IETF sponsors get to give a talk on Thursday afternoon of IETF week, and the talk was a panel, "A Safe, Secure, Scalable Internet".

What I thought was interesting was the scope of what the speaker's definition of "Safe" and "Secure", and the mismatch to the technologies and methods being considered. "Safety" included "letting my kids surf the web without coming across pornography or being subject to bullying", while the methods they were talking about were things like site blocking by IP address or routing.

This seems like a oomplete mismatch. If bullying happens because harassers facebook post nasty pictures which they label with the victim's name, this problem cannot be addressed by IP-address blocking. "Looking in the wrong end of the telescope."

I'm not sure there's a single right answer, but we have to define the question correctly.

March 25, 2013

Standardizing JSON

Update 4/2/2013: in an email to the IETF JSON mailing list, Barry Leiba (Applications Area director in IETF) noted that discussions had started with ECMA and ECMA TC 39 to reach agreement on where JSON will be standardized, before continuing with the chartering of an IETF working group.

JSON (JavaScript Object Notation) is a text representation for data interchange. It is derived from the JavaScript scripting language for representing data structures and arrays. Although derived from JavaScript, it is language-independent, with parsers available for many programming languages.

JSON is often used for serializing and transmitting structured data over a network connection. It is commonly used to transmit data between a server and web application, serving as an alternative to XML.

JSON was originally specified by Doug Crockford in RFC 4627, an "Informational" RFC.  IETF specifications known as RFCs come in lots of flavors: an "Informational" RFC isn't a standard that has gone through careful review, while a "standards track" RFC is.

An increasing number of other IETF documents want to specify a reference to JSON, and the IETF rules generally require references to other documents that are the same or higher levels of stability. For this reason and a few others, the IETF is starting a JSON working group (mailing list) to update RFC 4627.

The JavaScript language itself is standardized by a different committee (TC-39) in a different standards organization (ECMA).  For various reasons, the standard is called "ECMAScript" rather than JavaScript.  TC 39 published ECMAScript 5.1, and are working on ECMAScript 6, with a plan to be done in the same time frame as the IETF work.

The W3C  also is developing standards that use JSON and need a stable specification.

Risk of divergence

Unfortunately, there is a possibility of (minor) divergence between the two specifications without coordination, either formally (organizational liaison) or informally, e.g., by making sure there are participants who work in both committees.

There is a formal liaison between IETF and W3C. There is currently no also a formal liaison between W3C and ECMA (and a mailing list, public-script-coord@w3.org ). There is no formal liaison between TC39/ECMA and IETF.

Having multiple conflicting specifications for JSON would be bad. While some want to avoid the overhead of a formal liaison, there needs to be explicit assignment of responsibility. I'm in favor of a formal liaison as well as informal coordination. I think it makes sense for IETF to specify the "normative" definition of JSON, while ECMA TC-39's ECMAScript 6.0 and W3C specs should all point to the new IETF spec.

JSON vs. XML

JSON is often considered as an alternative to XML as a way of passing language-independent data structures as part of network protocols.

In the IETF, BCP 70 (also known as RFC 3470"Guidelines for the Use of Extensible Markup Language (XML) within IETF Protocols" gives guidelines for use of XML in network protocols. However, this published in 2003. (I was a co-author with Marshall Rose and Scott Hollenbeck.)

But of course these guidelines don't answer the question many have: When people want to pass data structures between applications in network protocols, do they use XML or JSON and when? What is the rough consensus of the community? Is it a choice? What are the alternatives and considerations? (Fashion? deployment? expressiveness? extensibility?) 

This is a critical bit of web architecture that needs attention. The community needs guidelines for understanding the competing benefits and costs of XML vs. JSON.  If there's interest, I'd like to see an update to BCP 70 which covers JSON as well as XML.