Introduction

Note:
Some history and philosophy goes below, so if you were looking for more specific technical details, you might want to consider going directly to Architecture or API.

What it is all about

The key SiteXML thesis is the following:

At the time when HTML was developed, the web was thought as a network of static pages, linked together with hyperlinks, and therefore HTML was a standard to a web page. Today the Internet is thought as the network of sites, but we still do not have a standard to a website.

Internet publishing is based on the HTML standard, developed in the 1980's by Tim Berners-Lee and colleagues. Today, HTML is only an illusion when speaking of server-side. Yes, there is HTML in your browser, when you request a page from a server, but there are no HTML pages on the server. It became so due to historical reasons, mostly because of the browser war which took place on Earth some years after the first site was published. While browser producers struggled to make Internet look better in their browsers, enforcing and polishing HTML and other client-related standards, server-side developers distorted the initial simple and clear idea. Where is that HTML's simplicity? If you look at a website source code on a server, you will be terrified how complex and difficult it is: a weird collection of files, snippets, source code; various programming languages, databases, exotic homebrew frameworks, all difficult to combine.

What have we lost, ignoring HTML principles at server side? First of all, it is simpicity, of course. Let's dig further: the fundamental principles of HTML, which is still the basis of the modern Internet by the way, were:

  • focus on valuable content, rather than styling and interactivity;
  • free, accessible, and simple enough to be edited by anyone who has something to say;
  • platform-independent: if you simply copy a site from Unix to Mac or PC or whatever, and it will work! — this is what is called a standard.

Sites, made on pure HTML had both advantages and disadvantages. Advantages:

  • fast loading;
  • simple and readabile code;
  • easily creatable and editable with plain text processors (mostly free or preinstalled);
  • requiring no need to know programming;
  • content-driven file structure;
  • direct relationship between file structure and URL—easy to locate, create, edit, and publish web-pages.

Pure HTML approach had also some disadvantages:

  • poor styling;
  • poor interactivity.

These disadvantages are hardly possessed by the modern approach of 'heavy' backend, but this modern approach also hardly has the old school advantages.

20 year ago we could not know what the web would be today. And today it is the time of highly interactive pages with rich-authoring functionality, and this is very different from the Internet of the early 1990s. The difference is that now the Internet is thought as the network of sites, but we still do not have a standard to a website.

The Internet today is based on HTML as if the web still was the network of pages, but actually, all HTML principles are dropped as if they were useless:

  • Webservers only pretend to have HTML pages. Even simple static sites do not have in fact nothing even similar to HTML pages! Even more: it became a 'mauvais ton' to make sites using HTML technology. How could the beautiful HTML, which is the fundamental of the Internet, become the sign of a bad style?!
  • Simple static pages are served by too powerful site engines (mostly scripting and because of that slow) and CMS's that generate too much server, client, and broadband load per page.
  • Backend is absolutely dependent on platform—you cannot just copy your site from Unix to Windows, you can even hardly copy your site from Unix to another Unix—that will not work in most cases!
  • Sites are nearly impossible to maintain by non-programmers. Publishers depend on developers who
    • choose server configuration;
    • code their own functions;
    • are the only (or from limited amount of people) who can control and update the site, made by them.
  • It is quite difficult to create and keep track of site's components;
  • We depend on intermediate tools like scripting engines, databases, modules, templates.

All this makes the task of setting up and working a website for someone, who has something valuable to talk about, but not a programmer, a quite uneasy thing.

Summing up, we have come to a point where we can create very interactive sites on the Internet, at the cost of the fundamental web principles. SiteXML tries to solve them, read on.

Problem Statement

This work's goal is to revise HTML principles in perspective of the modern Internet. We want to return to pure HTML, keep it simple, but highly interactive. Let us try to break down this general task into more discrete goals. They are:

  1. Suggest a standard to file structure of websites to ensure that:
    • sites are platform- and developer-independent, at least at content level;
    • different kinds of developers depend less on each other as much as possible; their work is better reused throughout the Internet: design, modules, rich editors. (As a consequence, this will react in better web usability: similar functions looking similar on different sites result in well-expected UI behavior and this is a good usability.)
  2. Introduce client-server interaction protocol, or STP (Site Transfer Protocol)
    • to help frontend and backend developers work more independently, thus providing more effective Internet development by better reuse of components.
  3. make file structure of sites more readable, accessible and editable:
    • separate content into PURE HTML, as if a site would really be a collection of HTML pages;
    • make sites, especially content, editable with both rich editors and text processors, thus giving full control over content and easy access to its update.
  4. And, last but not least, keep it all as simple as pure HTML, but as powerful, as modern technology level web sites. This is the essence of our idea.

The Principles

We spoke about what, now speak about how.

The solution to the problem must meet the following criteria, that are very much similar to HTML principles:

  1. Clear file structure:
    • Focus on content;
    • Content is pure HTML files;
    • File structure reflects content structure.

  2. Cover 90% of site needs:
    • Focus on content-oriented sites;
    • Reuse repeated from-site-to-site interactivity: forums, feedback, comments, etc.;

  3. Platform independent:
    • Webserver-integrated—you can copy SiteXML site to any platform and it should start working automatically (at least at content level);
    • Any modern and popular CMS should be able run SiteXML sites;
    • Any scripting back-end languages must support SiteXML in case of obsolete servers, or because of need of specific customization;
    • any browser should be able to run a SiteXML site alone (without running back-end driven functions, of course).

  4. Avoid intermediate software, that make our sites more platform-dependent:
    • no databases;
    • no frameworks;
    • and no scripting engines when possible.

  5. Component independent to keep it as light and swift as possible, but infinitely flexible and extensible:
    • you should be free to choose your own components to run that site:
      • server-side engine, from server integrated to your favorite CMS, stand-alone scripting engine, or your custom engine;
      • custom client editing tool—from none to reach editors or CMS's;
      • reusable themes and modules.

  6. Convey good usability principles:
    • Ajax-browsing—why loading whole page every time? - Every time we request a page, we should load only changeable parts: content-portions, defined in layout, e.g. banner on top, ad column on the right, main content in the middle;
    • Unchangeable portions should stay untouched while browsing, there is absolutely no need to load them with every page:  styles and layout, navigation, javascripts, images, etc.;
    • In-place content authoring should become a standard: it should be as easy as working with modern text processors.

Server integration. Webservers should have native support of SiteXML.

  • One SiteXML engine per server, integrated into webserver; no CMS, no site engine, or any other executable script or DB in SiteXML site directory. There is only site-specific content in SiteXML site directory.
  • gives better maintenance
  • server admin maintains only one instance of site engine, that servers all sites on the server.
  • more reliability
  • sites do not depend on custom site engine or CMS developers, programmers and different hosting environments
  • less files in site directory
  • better file manipulation,
  • backup / recovering
  • faster
  • no need to load scripting engines
  • site engine is compiled into web-server
  • better site cashing
  • easy support for hosters
  • all they do is add a webserver module that supports SiteXML, and voila! your favorite hoster company supports SiteXML!