Did you ever wonder why URLs look like they do?
With a bit of common sense, I found it not to hard to find reasonable explanations of it. Please note that all of this is speculation, but I hope it’s right because it’s just the way to go (please write me if you can contribute).
Remember, back before the invention of the web, there were no URLs at all. FTP sites were shared as free-text and transferred to the command-line client using copy and paste (at most). Then, you had to go to the directory you wanted and issue a fetch.
TimBL’s genius idea now was to find a way to address everything on the Internet, and he didn’t want to limit it to the web, but as well allow to address FTP, GopherSpace and some even more obscure systems that were in use back then.
Consider this URL:
Since URLs were meant to be uniform, there had to be a way to
determine the protocol to use, and the protocol had to be separated
from the host name in some way: a colon is the common way to say “this
is that”, and that’s why it is
foo.org, of course, existed long before. DNS was invented in 1983,
and was a reasonable thing to build on. However, it was often criticized
for being “the wrong way ‘round”, with most-significant part of the
hierarchy last (the top-level domain
org). Having a Unix
background, Tim decided to keep the path hierarchy like in Unix, and
that’s why it is
/bar/baz. It works well for FTP, too.
Remember, when HTTP/0.9 was state of the art, there was only GET. (Actually, the initial design shows up some other methods.)
However, Tim quickly discovered serving static data all day was boring: there had to be a way to, for example, make a search page.
And there was. Hands up, who remembers
<ISINDEX>? Add this
older-than-the-stones tag to your HTML and the browser will
automatically place a text box and a button to allow you to call a
How is the data transferred? Having only GET, the data has to go into the URL. Which character would you use to delimit the path and the query? Indeed, a question mark.
<ISINDEX> directly specifies the query string, so in the
beginning, it was more common to see URLs like
http://goo.org/search?meaning+of+life Later, when forms were added,
there had to be a way to specify field names (what could be more
= for this purpose? Maybe
:…), and a way to
separate multiple fields (
& sounded like a good idea, but tells
enough about the state of SGML parsing back in the old days).
Whitespace in URLs were a problem! How can you know it is over? CGI
+ to surrogate space, but then, how do you transmit a
Obviously, URL escaping had to be %69%6e%76%65%6e%74%65%64. This is where
things got ugly… for example, you really can’t tell if an URL
escaped string already had been escaped. Big fun. Why percent
encoding? Probably because the percent sign was not used yet?
Finally, I admit, I have a problem. I can’t figure at all why there
// between the hostname and schema. I suppose it was meant for
something special (e.g. some URI/URNs don’t have it), but what is it
good for HTTP URLs?
But up to that, I think the design of URLs was perfectly reasonable, wasn’t it?