Firefox 1.0, not a Moment Too Soon

The Firefox 1.0 Preview Release was announced earlier this week. It now includes an integrated RSS reader, among other features.

Now that Firefox and Mozilla are gaining in popularity, the hackers are poring over the code with a fine-toothed comb, looking for vulnerabilities, a big pile of which were announced earlier this week at Secunia. Upgrade now if you are not using 1.0.

The upgrade from 0.8 went very smoothly. Firefox checked each of my extensions (as previously blogged here) for compatibility, found some updates, and installed them. Looking good!

Real Time Headline View

Joyce Park and Mike Dierken of the mod-pubsub gang have taken the headline feed that I described yesterday and created a browser-based scrolling view at http://mod-pubsub.org/rss_scroller.html.

If your blogging tool pings Syndic8 or Ping-O-Matic, your posts will should show up on this page within seconds of posting.

Now, imagine if every aggregator in the world was equipped to listen to this channel (I am assured that it is possible to scale to this level), every reader would know about new events within a very small number of seconds. When I have some time I will draw and post a very complete diagram of the action and event flow that would transpire in such a system.

Syndication Scalability

Site operators are starting to get concerned about the scalability of the entire poll-based syndication model. In fact, Microsoft caused a stir within the community when they stopped including complete content in their MSDN feeds. Others (i.e. Robert Scoble) are starting to do the math, and the problem is becoming clear.

All of this talking is great — it certainly puts the issue out in the public. But we have to do something! I have some thoughts, and some working code.

First, some background…

Over 4 years ago I did some consulting work for what was then a tiny Seattle company. After meeting the two founders and listening to them talk for two hours, I walked away thinking “I am not sure what they are doing, but it is going to be cool.” They were talking about level 7 routers, event distribution, internet-wide scaling, real-time notification, and lots more. I was invited to meet with them after they saw an early version of my Headline Viewer news aggregator. At that time (and several times since) we talked about flowing headlines on to the desktop in real time.

I’ve been pursuing that goal ever since.

This embryo of a company was ultimately funded by Kleiner Perkins. The two founders were Adam Rifkin and Rohit Khare, and the company was KnowNow. I should add, as a disclaimer, that I do own a little bit of KnowNow stock as a result of my stint there as Temporary VP of Engineering.

KnowNow went on to build a great product, and they also spun off an open source version of it as mod-pubsub. This Apache module encapsulates all of the core publish, subscribe, and routing functionality of their commercial-grade product.

There is also a public instance of the product running at the same site.

Since early June I have been working to make Syndic8 into a great ping receiver. It now receives and processes a ping every couple of seconds and displays the results in the Pinged Feeds Box. I did all that I could to make the ping processing efficient and lightweight, even going so far as to use a RAM-based MySQL table for some transient data elements.

This week I took the next steps.

First, I made sure that each ping was a legitimate ping. There are two sources of what I will call “bogus” pings. First, some sites, in desperate need of attention, will ping even though the associated feed has not changed. According to some stats that I started tracking yesterday, about 2/3 of the pings are bogus. Second, some people will fine-tune a blog entry after it has been published. This seems to generate some spurious pings.

Second, I figured out what was truly new as of each ping. Because I store all of the XML for every feed in the Syndic8 feed list, it was a very simple matter to parse the old one, parse the new one, and compare them. This results in 0 or more new items (title, link, description, and so forth).

Third, I published the new items to the topic /what/syndic8.com/news/items at mod-pubsub.org. You can see the new items in the Event Introspector application at that site. This is a real-time browser-based application.

The ping processing within Syndic8 takes around 3 seconds, on average. This is mainly due to the need to actually fetch the feed; the internal processing is cheap, efficient, and scalable — I use a message queue (known as a “System V message queue” when I was a kid) as the asynchronous coupling between the first-stage processing when the ping is received, and the second-stage when the XML is fetched). I can easily add more processes if the queue length starts to grow. It would not be hard to move this processing off to another machine (or machines) if necessary.

The publishing end of this has been running for about 24 hours. So far, so good. Latency is low, system performance is still good. I’m working with some members of the mod-pubsub team to get some demos cooked up. We need to get some aggregator developers to take a look at what we’ve done so far and to figure out what else has to be done. We definitely need to work on categorization and metadata to allow aggregators to listen for changes within topical areas.

I think this could be the start of something big. From the blogger’s point of view I think this is pretty cool. Less than 5 seconds after a post is written and published, it can be present on machines all around the world.

We are now a few steps closer to that goal we settled upon over 4 years ago in that now-condemned building in Seattle, making plans in a room paneled entirely with white boards. Its been a great journey so far, but the best is undoubtedly yet to come.

Sometimes, you just have to dig in!

I’ve always been happiest when I can dig all the way down to the very bottom of a software problem. For the last couple of years I’ve been writing lots of PHP, where I can remain more or less ignorant of what’s happening internally. However, it is still nice to be able to fire up a low-level debugger and take a look around.

Case in point, I have a daemon (background process) running on the Syndic8 server. It reads URLs from a System V message queue, fetches and parses the page, and looks for interesting items on the page. 99% of the time this process runs fine. Sometimes, however, it would get stuck and I didn’t know why.

It happened again today, and I had a few spare minutes to dig in. So I used ps to get the Process Id, then I invoked GDB like this:

gdb /usr/local/bin/php 10942

I used the where command to get a stack trace. The top entries on the stack looked like this:

#0 0x00b0fc32 in dlsysinfo_int80 () from /lib/ld-linux.so.2

1 0x00bfbaed in __newselectnocancel () from /lib/tls/libc.so.6

The word “select” in the second entry caught my eye, since it is used to wait for an event to occur on a file descriptor. If the select is used without an accompanying timeout, it may very well wait forever.

Up a few more levels on the stack, and I found this:

#5 0x081337ec in phpstreamurlwraphttpex (wrapper=0x9c08280, path=0x9ef11fc "http://www.amaamy.com/blog", mode=0xa1f990c "r", options=4, openedpath=0x0, context=0x0, redirectmax=19, headerinit=1) at /home/downloads/Apachetoolbox-1.5.70/src/php-4.3.6/ext/standard/httpfopenwrapper.c:492

Hmmm, fopen, that’s interesting. I never use fopen to connect to web sites; I prefer cURL and the fine grained control that comes along with it.

So now I had to know what code was making this call. I went up the stack a few more levels until I was inside of the PHP interpreter. From previous experience, this is a function named execute. I got there and then took a look at the local variables:

(gdb) inf loc callingsymboltable = (HashTable *) 0x9c6e4bc original_return_value = (zval **) 0xbff36684 execute_data = {opline = 0x9c71d1c, function_state = { function_symbol_table = 0x9c809e4, function = 0x9c78278, reserved = { 0x8184c05, 0x837ec4c, 0x9c6fba4, 0xb}}, fbc = 0x9c78278, ce = 0x0, object = {ptr = 0x0}, Ts = 0xbff34710, original_in_execution = 1 '

After a bit more digging I found the data structure containing the name of the PHP function, and I was all set. It turns out that the fopen call is made from a third party HTML parsing library that I use. Since it is open, I can fix it or work around it.

Now I could have tried to solve this problem any number of ways. I could have added a bunch of print statements, or I could have tried to install a PHP debugger. But I have been using GDB for a long, long time and it was the tool at hand. I spent about 3 minutes debugging and about 20 blogging about it.

Now This is a Data Center

For the past year, Syndic8 has been hosted at BocaCom of Boca Raton, Florida. During this time I’ve been impressed time and time again with the utter dedication of the staff and the relability of the service. During the recent hurricanes, they took pains to reassure the customers that the data center was built to withstand a category 5 event, and that our servers were safe.

It turns out that they are located in a classic building that once housed IBM’s North American Research Division, and now known as the T-Rex Corporate Data Center. Very secure and very stylish!

It Works!

We made a lot of progress on Saturday. Click to see the picture in high resolution.

First, Stephen finished up the axle:

Steve Axle

Here’s what it looks like. That axle holds the big black sprocket, the wheel, and a disc brake:

Back Wheel

I machined 4 slots on the motor mounting plate. I didn’t get them positioned just right, and we had to procure a new plate from our friendly neighborhood Metal Supermarket. The new plate is thicker, and we are much happier with the result. Here’s Andy showing off the slots:

Andy Motor Mount

Then we mounted the motor to the plate and put the chain on. Stephen and Andy had already mounted the torque converter to the motor. Here are the boys in action:

Boys in Action

Finally, we dragged it outside and started it up. The wheel turned, and there was much rejoicing (the full version of this picture gives you a great view of what we’ve done so far):

Startup

Lots of Bits!

The following picture shows a month’s worth of bandwidth usage on the Syndic8 server:

Syndic8 Bandwidth

In one month, Syndic8 pulled in 139 GB of data and gave out 213 GB, for a total transfer of 353 GB. Fortunately, my hosting plan at Bocacom includes 500 GB of transfer per month.

The periodic spikes occur when the Syndic8 poller goes out and downloads the latest info from each feed. This happens twice a day for most feeds.

I do need to start using things like ETags and If-Modified-Since headers to reduce the inbound bandwidth a bit. Here‘s some info on the right way to do this, and some more here.