AXDC – Kurt Cagle, XSLT2 and Saxon.Net

Kurt will describe a .Net implementation of XSLT2. He runs Metaphorical Web Publishing, has authored 16 books (including one on SVG), and has a blog.

Despite the fact that they have 56,000 employees, Microsoft has yet to implement this new standard.

Saxon.Net is the first implementation of XSLT2 for .Net.

XSLT is crucial in a number of roles: document conversion, business logic processing, macro code generation, message routing, data mapping and transfer, and so forth.

As of 2004, XSLT is now supported on most platforms, and is on its way to becoming pervasive. Every procedural language has some XSLT support. There are also firmware implementations. A very common way to do processing, on any system, anywhere.

How many of you have used XSLT? 95% of those in the room raised their hands.

Problems with XSLT: Verbose, recursion needed far too much, tree fragments, lack of extensibility, string handling is poor, grouping (breaking a document into sections) is awkward or impossible, and there are sorting limitations (strings, numbers, and ISO dates only).

Processors include MSXML, System.Xml.Xsl, Xalan (Java), Saxon (Java), LibXSLT/XSLTProc (C++, Linux), and Sablotron (PHP/Windows & Linux).

Saxon 6.5 is earliest and most compliant XSLT1 processor, written by Michael Kay. Includes some namespace extensions. Saxon 8.* is the XSLT2 implementation.

EXSLT is an extension to XSLT; open source extension movement, extended math support, invocable XSLT functions, regular expressions, randomization, multiple output documents. It is (informally) XSLT 1.1.

XSLT 1.1 + EXSLT => XSLT 2.0. XSLT 1.1 problems too great to fix and EXSLT is a stopgap.

So, XSLT2 is where its at. Under development for the last 2 years, at least 1 more year to go due to some changes to XPath and XQuery. Definitely a moving target.

Trees vs. forests. Forest is a living organism very unlike its trees.

XPath 1.0 data model has lots of flaws — node list confusion, something about document(), no generalized enumerators, no set operations. Therefore XSLT2 is dependent on XPath2, which implements sequences — linear collection of objects (not nodes, and not necessarily homogeneous).

XPath2 and XSLT2 both support regular expressions. Token() and join() functions are new. Regex test(), matches(), and replace() functions.

Input serialization — the new doc() function can pull text in from the outside world and create a tree from us.

Conditional expresions.

Saxon.net is a .Net port of the Java code,translated using theIKVM tool.

Early 2006 XPath2 and XSLT2 complete; Saxon.Net ready to run.

SVG Programming

AXDC – Daniel Cazzulino, Schematron

Schematron is a language for expression of XML validation rules and constraints. Rule-based, tree patterns, using XPath as language.

Reference implementation in XSLT. A .Net version is at at schematron.net. Versions for Perl and PHP also available.

Language is simple, consisting of asserts (absence of pattern), reports (presence of patterns), rules, patterns, schemas, and phase (for workflow).

AXDC – Sam Ruby, IBM

Sam is reprising a talk he did several years ago.

Most people learned HTML via “View Source”. What’s the downside to this?

Must be willing to seek and see the truth. Look at the messages and actually understand them (The Matrix).

Focus on identity, is x equal to y?

Unicode, letter “a” in two different fonts. Yes, they are both U+0061.

“a” vs. “A”, not the same.

Look at encodings, 0×41 all the way down.

Attractive nuisance — hazardous object which can be expected to attract children to investigate or play.

As applied to Unicode and RSS. People don’t worry about encodings, they do the simplest thing, take the default, strcat strings from various encodings, leave it up to the browser to handle the resulting mess (unfortunately, it often did).

Two more A’s, U+0041, U+0391 (Greek Alpha).

Four things look like a lowercase i (U+0069, U+2139, U+2148, U+2170).

Imagine the confusion that would result if these were allowed in domain names and URLs!

Two glyphs could look the same yet have different encodings.

Now get into accented characters. Multiple ways to do “e with accent.” Comparing them will produce different results.

Default encodings. HTML is iso-8859-1. XML is UTF-8. Would be nice if XML required the encoding declaration. Userland RSS feeds still don’t have them. Microsoft has Windows 1252.

Planet RDF will tke HTML, run it through iso-8859-1 to utf-8 comversion, producing good utf-8 triples, even if the input isn’t iso-8859-1. This chokes on things like the Euro character.

Footer of msdn2.microsoft.com has this issue, very hard to solve.

Win-1252 differents from 8859-1 in 27 places, Euro symbols and smart quotes. Kills RSS feeds. Most web clients are on Windows, most web servers don’t indicate the encoding. Most browsers have given up and consider 8859-1 and 1252 to be the same.

Google for links to feedvalidator.org, but have bad 1252 characters. Code points are used to generate bad XML entities. Don’t confuse encoding with code point. This is a pervasive problem.

UTF-8 has a nice property, strings can be recognized as such with high probability.

Now consider URI’s. Lots and lots of cases. URI spec recently revised; normalization method not yet nailed down. Add Unicode to make it even more interesting. Don’t believe it when they say that URI’s give you queries for free. What encoding is used for the data in the query string?

CLR’s System.Uri.Equals says that all of his examples are the same; using these as XML namespaces require them each to be distinct!

Presumptions about layering are very dangerous. Case in point, HTTP has tons of problems.

GET /index.html HTTP/1.0

This omits the supposedly optional Host: header, which is really required. Leaky DNS abstraction.

Content-type, XML header, meta tag in HEAD, and form post all have a charset or encoding attribute.

Ruby’s Postulate:

Accuracy of metadata inversely proportional to square of distance between data and the metadata.

HTTP Content-Type is a hint, XML declarations ignored by browsers, meta charset typically seen as authoritative, charset attributes on elements aren’t supported.

More layering issues…

Accept-encoding: gzip

Mozilla unzips, starts to parse, http-equiv causes it to restart, issues page reload with bad byte range. This is an example of bad layering, does the byte range apply before or after gzip?

Character references, lots of ways to say the same thing, all the same if you respect all of the rules for XML processing. Broken in lots of RSS feeds, obvious to him and a few others, not to everyone else.

Ampersands in URLs are supposed to be encoded within HTML.

Escaping in XML is broken for character references. There is no way to look at a string of XML and determine if it is escaped or not. Trips up the pros every day.

This is all stated explicitly in Atom; this will allow for better tool support and better diagnostics.

Double-escaped — to put “A&B” in an RSS feed, must say “A&B”.

RSS & Atom — can title contain markup, and many other issues. He’ll talk about all of this tomorrow.

WS-*, take the blue pill. Hide it all in tools.

Summary, comparing characters and URIs is surprisingly difficult; it can lead to security holes. Layering is the problem, not the solution.

You won’t find reality in any specification.

Spec authors are responsible for the confusion that they create.

AXDC – Whit Kemmey, XML for Naval Missile Systems

Whit works for the DoD.

Unique application, very long-lived. Data around since the 1950′s. Fully redundant — take battle damage and keep going. Survices nuclear explosions, runs underwater.

Naval Surface Warfare Center, Dahlgren Virginia. Started with Harvard Mark II.

The submarine platform — SSN (ship submersible, nuclear). SSBN (ballistic missile). Software never used for its intended purpose, hopefully never will be. 50% of US warheads are on subs. 24 missiles, each with multiple targetable warheads. Stay hidden and be ready. 560 feet long, 16K tons, crew of 156. SSGN is an SSBN converted to Guided conventional missile. 154 tactical cruise missiles.

Trident missile, 44 feet long, 130K lbs. 4600 mile range, moves at 20K feet per second, cost $30.9 million. Cost to reload one sub is $1B (one billion dollars).

Fire control problem: Fly missile to space, land in a small spot. Moving launcher, ballistic flight (warhead freefalls, must hit with accuracy, through multiple layers of air each with distinct characteristics), re-targetable (targets assigned as needed). Safety, maintainability paramount.

Originally one processor, 1 MB of memory, built specifically for the system. OS built in house so they know exactly what is going on.

Now VxWorks on PowerPC, source code scrubbed, VxWindows (X and Motif), own shell and file system. Two of everything, complete redundancy.

So what about XML?

19-year old running the software, written procedures for everything, rigid checklists. Goal is to leverage system information to better control information flow.

XML in Government, xml.house.gov — collection of DTDs.

Still checklist-based, but now electronic. BALPARS = Ballistic Parameters! Each action logged.

Extensive use of XML for procedure guide. Using XML as a scripting language! Instructions for what to do to carry out task. “Localization is not a big problem for us.” Uses the libxml parser, open source.

Each operational step described in a block of XML, with pre (check first) and post (do after) conditions.

Audience member asks, why XML? Value will become apparent over time. Create docs and scripts in one file, manage using CM system, associate docs with step. Do transformations of the XML into hypertext for testing. Also political reasons within the Navy.

No XSL running on the boat, only in the testing environment.

XML written using XML Spy.

They must always have source code, and they must always have the right to recompile it. They have to scrutinize all external software and demonstrate that they know how it works.

AXDC – Don Box and WS-Why?

Intro by Chris Sells, everyone in the audience already knows who Don Box is.

The original title of the was “WS-Islands.” He’s going to walk through the various standards in the talk, and attempt to explain and justify each one.

WS-DesertIsland best describes the architecture. If you had to put 5 specs on a zip drive before heading off to a desert island, which specs would you take?

  • XML or lisp, one or the other. Sticking with LISP might have been better (aside to Tim Bray, why didn’t you guys just use LISP? Tim: Uh, uh, uh, Uh. Uh Uh!). All I need is a simple way to write down structured data. Hard to imagine life without one or the other. Both of these systems make the absolute minimum commitment to abstraction; they are just data. With no idea where an XML document has been, can walk through it, process it, munge it, and so forth, without bringing along any code. The data outlives the code in many situations.
  • SOAP, after somewhat of a struggle. SOAP 1.2 part 1 is a pretty reasonable place. Needed better extensibility mechanism other than “throw shit anywhere you want.” SOAP constrains the top-level elements and defines this mechanism, basically a header extensibility model. This model seems to hold up. .Net serializer is adopting this model, all logically SOAP based without a SOAP envelope. This produced some good code synergies. SOAP 1.1, Section 5, went way too far, “the work of the devil, a Linda Blair experience.” Had to prohibit PI’s because processing model was ill-defined — was it aimed at any intermediaries or the ultimate receiver? Also removed DOCTYPE so that they had an atomic unit of transmission — the SOAP envelope. SWA (SOAP With Attachments) went a bit too close to the edge. Mixed content model was horribly naive. All in all, a good place to be. Most of what SOAP gives us is actually pretty good.
  • WS-Addressing, since SOAP contains as little policy as possible. Build simple & reliable mechanisms and layer policy atop that. SOAP over HTTP buried too much addressing in the transport. This made it impossible to store a message and process it later since it got in the way of end-to-end guarantees. Putting it in the SOAP envelope allowed for digital signatures, XPath, and more. This turned out to be very useful. 2 controversial points. First, the Action URI. Message is just data, but since it was sent to an endpoint, want to assign names to semantics instead of just a POST to a proliferation of addresses. Best to capture intent of original sender. Second was the endpoint reference. He notes that this is a platform (extensibility hook) for the future, with lots of stuff to be layered atop. This will require cleanup later on. “I strongly discourage you from reading all of these specs.” Defined specs as “minimum progress required to declare victory.” IBM has figured out how to turn complexity into a business model — IBM Global Services. Specs which embedded the intent and expected usage would never converge. So just agree about what’s on the wire, and spec that. Pretty clear that IBM and Microsoft are the only two elephants contributing to these specs based on his description of how the spec development process works.
  • ws-mex, they forgot to do this earlier — Metadata Exchange! Perhaps will be fixed in 2007, 9 years after the first SOAP spec was written. Don introduced Omri, who owns all of the WS- specs at Microsoft. ws-mex is IProvideClassInfo for services!
  • XSD WSDL, not separate in his mind. Both equally onerous and full of bad, stupid, committee-thing within. Relax-NG would be better (and the Amiga too).

So, he could live with the above plus a transport as a basic kernel.

If we get to choose more, which ones:

  • ws-security, difficult but necessary in the context of MOM-based systems. SSL and the like don’t work with store and forward systems. Security must go along with the message as part of the data. Web economy requires SSL and TLS. ws-security is the simplest thing that could work, given a bunch of other requirements — XML Dsig and encryption. This is a pain point right now. ws-security 10x slower than SSL!
  • WS-Trust, metadata exchange for WS-Security.
  • WS-Reliable Messaging, little to add over raw TCP unless you are going through relays or intermediaries where things could get reordered or delivered more than once, or dropped. Doesn’t give durability or queuing, just reliable end-to-end messaging. Don’t use it if your transport already does this.
  • ws-policy, fkirst one way too complicated. New one is simpler and even implementable.
  • ws-NewZealand, a lovely place to visit exactly once per lifetime. Here are specs lke that. ws-eventing, strictly speaking it is not needed. But there’s a desire to do pub-sub. Microsoft ships a simple version, IBM ships WS-RF, a three volume spec. ws-atomictx, good for the three people building transaction managers. ws-enumeration.
  • ws-IslandOfDoctorMoreau, the ugly stuff. UDDI, bad spec, terrible marketing, horrid overall. Spec is better than WSDL ,less crazy-ass features and less non-determinism. Put way too much business in there. ws-businessactivit, bpel4ws. mtom, cleanup after SWA, ws-transfer, published to tempt the RESTafarians.
  • ws-fantasyIsland, stuff he wants. ws-data (isn’t this just XML?, asked someone). SOAP over TCP, being built for Indigo, binary XML, XSD 1.Noah. Retire NG (a WSDL replacement in the spirt of Relax NG)

The Exorcist (The Version You've Never Seen)

The Island of Dr. Moreau

AXDC – Lunch

After a quick lunch and some good conversations, I have a few minutes to update the blog…

With the exception of the omnipresent Sam Ruby, audience at this conference is essentially totally disjoint with the audience at the other recent conferences that I’ve been to. Its as if the Microsoft, open source, and blog-based worlds are separate and distinct entities.

I was expecting to see Scoble here. So far, he’s a no-show.

AXDC – Cauldwell and Hanselman – Legacy Financial Systems

Patrick Cauldwell and Scott Hanselman described their work to use XML schemas to formalize the interaction contract with the Voyager banking platform. They work for Corillian and claimed that 25% of the online banking activity in the US takes place on their system.

A memorable phrase from this talk was “Brady Bunch Wrox Books”. Oddly enough, the audience didn’t really seem to be familiar with the Brady Bunch and didn’t get the reference. Given that I recently worked with an intern who didn’t know that Charlie’s Angels was once a TV show, this is not too surprising.

Professional Commerce Server 2000 The Brady Bunch Movie Charlie's Angels - The Complete First Season

“Get rid of code that monkeys can write.” This includes various types of validation that can be written in a declarative fashion and checked using Schematron, Babel (?) or Slang (?). This is expensive code and “not fun.” Automating it lets people focus on the more interesting parts.

For fun, they created a Tivo interface to their mainframe-based platform. This is used for UI development right now.

AXDC – Chris Anderson – “Developers Hate XML”

Chris Anderson didn’t really talk too much about what’s wrong with XML. He used simple, large-font slides and did a lot of hand-waving. The great thing about Chris is that he understands things at a very deep level, and that he can cogently and clearly explain them. He and I worked together when we were both on the VB6 team at Microsoft. He built (and I debugged) a nifty VB tool called the DHTML Page Designer — a weird amalgamation of a web page and downloaded VB code that is now safely dead and buried.

Chris’s big beef with XML is that XML must be processed in isolation, using special purpose tools and languages such as XSLT. In order to use these special things, developers must become domain experts in a rich and complex space that is essentially unrelated to the application itself.

The alternative to this is to use some kind of object model — but as Chris says, this is like using a straw to get to relational data through ADO.

Applications do a lot more than just process XML and Integrated programming models are key.

XML and .Net are not a good match! XML can be characterized by:

  1. Schema
  2. Loose
  3. XML type system

Whereas .Net is more like:

  1. metadata
  2. strong
  3. CLR type system

We are really talking about documents vs. programs — there is a disconnect; no clear way to use one single programming model for both.

XML is an integration technology; it lets you get between two endpoints; it is a plumbing technology like TCP.

Chris Sells says — “Its about information, not code. ” He cited Technorati as an example, and speculated that there was probably not a lot of code inside. A member of the audience made it clear that Technorati has MySQL and Perl inside. Chris was shocked and momentarily nonplussed for some reason.

Finally, Chris moved on to XAML. They had noticed that user expections for UI continue to rise; want designers and creative people to build the UI, not the developer. Grey boxes on the screen are no longer sufficient; designers and users want rich UI’s, embedded video, and so forth. Therefore, XAML is for interop between graphic designers and developers, and for integration between applications & ecosystems.

XML is somewhat human legible and very machine legible.

Developers see XAML as a new primary programming model and they believe that th e programming model should dictate the markup: Markup == OM. The markup is derived from and reflects the object model.

Chris gave some examples of the XAML declaration for buttons, and showed how XML attributes could used for simple property settings, and then turned into nested tags to set the same properties in a richer way. For example, making a button red would be done with a property; putting an embedded graphic or texture wold use nested tags.

Question from the audience, Why not SVG? Its good for doing vector markup, not for element-based UI. Could use SVG + XUL + HTML + some app / event markup. XAML Was supposed to be a unified and integrated thing.

The XAML programming model is .Net; this is the platform foundation and the XML is a reflection of that. XML is a data format, not a programming model.

So developers don’t hate XML, they hate the systems that make XML be the heart of the platform — you have to understand namespaces, transformations, and lots of other stuff.

What about other people implementing XAML? Not speaking for Microsoft, Chris hopes there will be a rich ecosystem of tools around XAML; doing it in xML enables this. Building a runtime for something not yet shipped is not a good way to guarantee compatibility.

Chris used the Java Swing and AWT toolkits to make a good point about the difficulties inherent in guaranteeing functional compatibility and pixel-for-pixel correctness across platforms. Avalon not designed to be cross platform.

Applied XML Developer Conference

I got up at 4 AM today, took care of some odds and ends on the Syndic8 server, got dressed, and hopped in my car for a quick (80+ MPH) Southbound dash on I5, then east along the Columbia River Gorge and up into the mountains. The early morning view of the river was spectacular, the road had plenty of twists and turns, and I was half-tempted to bag the conference and keep on driving!

I am now at the Applied XML Developer’s Conference, staying at the Skamania Lodge. I have to come back here as a tourist some time really soon.

I missed Tim Bray’s talk. I heard he bashed Microsoft 10 ways to Sunday; that’s what speakers from Sun are supposed to do, of course.

My plan is to blog each talk as a separate posting today and tomorrow.

Exponential Growth in Feed Count!

Although one month of data certainly isn’t enough to be a trend, the number of feeds in Syndic8 appears to be growing exponentially:

feed growth

We had near-linear growth for the first 2+ years. Early this year the curve started heading up at an ever-increasing rate.

Put another way, Syndic8 users have submitted over 45,000 feeds so far this month. It took 2 years to accumulate the first 45,000 feeds.

I’ve been refining my polling algorithms, encouraging people to use pings, and adding additional hardware. There’s still a lot to be done (by me and by others) to make better use of ETags, delta compression, adaptive polling, and so forth.

I also need to simplify and streamline the review process. I made some small changes yesterday, but I need to do a lot more.

Things are about to get really exciting.