Joseph Smarr

One of the last things I did before leaving Plaxo was to implement PubSubHubbub (PuSH) subscriber support, so that any blogs which ping a PuSH hub will show up almost instantly in pulse after being published. It’s easy to do (you don’t even need a library!), and it significantly improves the user experience while simultaneously reducing server load on your site and the sites whose feeds you’re crawling. At the time, I couldn’t find any good tutorials for how to implement PuSH subscriber support (add a comment if you know of any), so here’s how I did it. (Note: depending on your needs, you might find it useful instead to use a third-party service like Gnip to do this.)

My assumption here is that you’ve already got a database of feeds you’re subscribing to, but that you’re currently just polling them all periodically to look for new content. This tutorial will help you “gracefully upgrade” to support PuSH-enabled blogs without rewriting your fundamental polling infrastructure. At the end, I’ll suggest a more radical approach that is probably better overall if you can afford a bigger rewrite of your crawling engine.

The steps to add PuSH subscriber support are as follows:

Identify PuSH-enabled blogs extract their hub and topic
Lazily subscribe to PuSH-enabled blogs as you discover them
Verify subscription requests from the hub as you make them
Write an endpoint to receive pings from the hub as new content is published
Get the latest content from updated blogs as you receive pings
Unsubscribe from feeds when they’re deleted from your system

1. Identify PuSH-enabled blogs extract their hub and topic

When crawling a feed normally, you can look for some extra metadata in the XML that tells you this blog is PuSH-enabled. Specifically, you want to look for two links: the “hub” (the URL of the hub that the blog pings every time it has new content, which you in turn communicate with to subscribe and receive pings when new content is published), and the “self” (the canonical URL of the blog you’re subscribing to, which is referred to as the “topic” you’re going to subscribe to from the hub).

A useful test blog to use while building PuSH subscriber support is http://pubsubhubbub-example-app.appspot.com/, since it lets anyone publish new content. If you view source on that page, you’ll notice the standard RSS auto-discovery tag that tells you where to find the blog’s feed:

<link title="PubSubHubbub example app" type="application/atom+xml" rel="alternate" />

And if you view source on http://pubsubhubbub-example-app.appspot.com/feed, you’ll see the two PuSH links advertised underneath the root feed tag:

<link type="application/atom+xml" title="PubSubHubbub example app" rel="self" /> <link rel="hub" href="http://pubsubhubbub.appspot.com/" />

You can see that the “self” link is the same as the URL of the feed that you’re already using, and the “hub” link is to the free hub being hosted on AppEngine at http://pubsubhubbub.appspot.com/. In both cases, you want to look for a link tag under the root feed tag, match the appropriate rel-value (keeping in mind that rel-attributes can have multiple, space-separated values, e.g. rel="self somethingelse", so split the rel-value on spaces and then look for the specific matching rel-value), and then extract the corresponding href-value from that link tag. Note that the example above is an ATOM feed; in RSS feeds, you generally have to look for atom:link tags under the channel tag under the root rss tag, but the rest is the same.

Once you have the hub and self links for this blog (assuming the blog is PuSH-enabled), you’ll want to store the self-href (aka the “topic”) with that feed in your database so you’ll know whether you’ve subscribed to it, and, if so, whether the topic has changed since you last subscribed.

2. Lazily subscribe to PuSH-enabled blogs as you discover them

When you’re crawling a feed and you notice it’s PuSH-enabled, check your feed database to see if you’ve got a stored PuSH-topic for that feed, and if so, whether the current topic is the same as your stored value. If you don’t have any stored topic, or if the current topic is different, you’ll want to talk to that blog’s PuSH hub and initiate a subscription so that you can receive real-time updates when new content is published to that blog. By storing the PuSH-topic per-feed, you can effectively “lazily subscribe” to all PuSH-enabled blogs by continuing to regularly poll and crawl them as you currently do, and adding PuSH subscriptions as you find them. This means you don’t have to do any large one-time migration over to PuSH, and you can automatically keep up as more blogs become PuSH-enabled or change their topics over time. (Depending on your crawling infrastructure, you can either initiate subscriptions as soon as you find the relevant tags, or you can insert an asynchronous job to initiate the subscription so that some other part of your system can handle that later without slowing down your crawlers.)

To subscribe to a PuSH-enabled blog, just send an HTTP POST to its hub URL and provide the following POST parameters:

hub.callback = [the URL of your endpoint for receiving pings, which we’ll build in step 4]
hub.mode = subscribe
hub.topic = [the self-link / topic of the feed you’re subscribing to, which you extracted in step 1]
hub.verify = async [means the hub will separately call you back to verify this subscription]
hub.verify_token = [a hard-to-guess token associated with this feed, which the hub will echo back to you to prove it’s a real subscription verification]

For the hub.callback URL, it’s probably best to include the internal database ID of the feed you’re subscribing to, so it’s easy to look up that feed when you receive future update pings. Depending on your setup, this might be something like http://yoursite.com/push/update?feed_id=123 or http://yoursite.com/push/update/123. Another advantage of this technique is that it makes it relatively hard to guess what the update URL is for an arbitrary blog, in case an evil site wanted to send you fake updates. If you want even more security, you could put some extra token in the URL that’s different per-feed, or you could use the hub.secret mechanism when subscribing, which will cause the hub to send you a signed verification header with every ping, but that’s beyond the scope of this tutorial.

For the hub.verify_token, the simplest thing would just be to pick a secret word (e.g. “MySekritVerifyToken“) and always use that, but an evil blog could use its own hub and quickly discover that secret. So a better idea is to do something like take the HMAC-SHA1 of the topic URL along with some secret salt you keep internally. This way, the hub.verify_token value is feed-specific, but it’s easy to recompute when you receive the verification.

If your subscription request is successful, the hub will respond with an HTTP 202 “Accepted” code, and will then proceed to send you a verification request for this subscription at your specified callback URL.

3. Verify subscription requests from the hub as you make them

Shortly after you send your subscription request to the hub, it will call you back at the hub.callback URL you specified with an HTTP GET request containing the following query parameters:

hub.mode = subscribe
hub.topic = [the self-link / topic of the URL you requested a subscription for]
hub.challenge = [a random string to verify this verification that you have to echo back in the response to acknowledge verification]
hub.verify_token = [the value you sent in hub.verify_token during your subscription request]

Since the endpoint you receive this verification request is the same one you’ll receive future update pings on, your logic has to first look for hub.mode=subscribe, and if so, verify that the hub sent the proper hub.verify_token back to you, and then just dump out the hub.challenge value as the response body of your page (with a standard HTTP 200 response code). Now you’re officially subscribed to this feed, and will receive update pings when the blog publishes new content.

Note that hubs may periodically re-verify that you still want a subscription to this feed. So you should make sure that if the hub makes a similar verification request out-of-the-blue in the future, you respond the same way you did the first time, providing you indeed are still interested in that feed. A good way to do this is just to look up the feed every time you get a verification request (remember, you build the feed’s ID into your callback URL), and if you’ve since deleted or otherwise stopped caring about that feed, return an HTTP 404 response instead so the hub will know to stop pinging you with updates.

4. Write an endpoint to receive pings from the hub as new content is published

Now you’re ready for the pay-out–magically receiving pings from the ether every time the blog you’ve subscribed to has new content! You’ll receive inbound requests to your specified callback URL without any additional query parameters added (i.e. you’ll know it’s a ping and not a verification because there won’t be any hub.mode parameter included). Instead, the new entries of the subscribed feed will be included directly in the POST body of the request, with a request Content-Type of application/atom+xml for ATOM feeds and application/rss+xml for RSS feeds. Depending on your programming language of choice, you’ll need to figure out how to extract the raw POST body contents. For instance, in PHP you would fopen the special filename php://input to read it.

5. Get the latest content from updated blogs as you receive pings

The ping is really telling you two things: 1) this blog has updated content, and 2) here it is. The advantage of providing the content directly in the ping (a so-called “fat ping“) is so that the subscriber doesn’t have to go re-crawl the feed to get the updated content. Not only is this a performance savings (especially when you consider that lots of subscribers may get pings for a new blog post at roughly the same time, and they might otherwise all crawl that blog at the same time for the new contents; the so-called “thundering herd” problem), it’s also a form of robustness since some blogging systems take a little while to update their feeds when a new post is published (especially for large blogging systems that have to propagate changes across multiple data-centers or update caching tiers), so it’s possible you’ll receive a ping before the content is available to crawl directly. For these reasons and more, it’s definitely a best-practice to consume the fat ping directly, rather than just using it as a hint to go crawl the blog again (i.e. treating it as a “light ping”).

That being said, most crawling systems are designed just to poll URLs and look for new data, so it may be easier to start out by taking the “light ping” route. In other words, when you receive a PuSH ping, look up the feed ID from the URL of the request you’re handling, and assuming that feed is still valid, just schedule it to crawl ASAP. That way, you don’t have to change the rest of your crawling infrastructure; you just treat the ping as a hint to crawl now instead of waiting for the next regular polling interval. While sub-optimal, in my experience this works pretty well and is very easy to implement. (It’s certainly a major improvement over just polling with no PuSH support!) If you’re worried about crawling before the new content is in the feed, and you don’t mind giving up a bit of speed, you can schedule your crawler for “in N seconds” instead of ASAP, which in practice will allow a lot of slow-to-update feeds to catch up before you crawl them.

Once you’re ready to handle the fat pings directly, extract the updated feed entries from the POST body of the ping (the payload is essentially an exact version of the full feed you’d normally fetch, except it only contains entries for the new content), and ingest it however you normally ingest new blog content. In fact, you can go even further and make PuSH the default way to ingest blog content–change your polling code to act as a “fake PuSH proxy” and emit PuSH-style updates whenever it finds new entries. Then your core feed-ingesting code can just process all your updated entries in the same way, whether they came from a hub or your polling crawlers.

However you handle the pings, once you find that things are working reliably, you can change the polling interval for PuSH-enabled blogs to be much slower, or even turn it off completely, if you’re not worried about ever missing a ping. In practice, slow polling (e.g. once a day) is probably still a good hedge against the inevitable clogs in the internet’s tubes.

6. Unsubscribe from feeds when they’re deleted from your system

Sometimes users will delete their account on your system or unhook one of their feeds from their account. To be a good citizen, rather than just waiting for the next time the hub sends a subscription verification request to tell it you no longer care about this feed, you should send the hub an unsubscribe request when you know the feed is no longer important to you. The process is identical to subscribing to a feed (as described in steps 2 and 3), except you use “unsubscribe” instead of “subscribe” for the hub.mode values in all cases.

Testing your implementation

Now that you know all the steps needed to implement PuSH subscriber support, it’s time to test your code in the wild. Probably the easiest way is to hook up that http://pubsubhubbub-example-app.appspot.com/ feed, since you can easily add content it to it to test pings, and it’s known to have valid hub-discovery metadata. But you can also practice with any blog that is PuSH-enabled (perhaps your shiny new Google Buzz public posts feed?). In any case, schedule it to be crawled normally, and verify that it correctly extracts the hub-link and self-link and adds the self-link to your feed database.

The first time it finds these links, it should trigger a subscription request. (On subsequent crawls, it shouldn’t try to subscribe again, since the topic URL hasn’t changed. ) Verify that you’re sending a request to the hub that includes all the necessary parameters, and verify that it’s sending you back a 202 response. If it’s not working, carefully check that you’re sending all the right parameters.

Next, verify that upon sending a subscription request, you’ll soon get an inbound verification request from the hub. Make sure you detect requests to your callback URL with hub.mode=subscribe, and that you are checking the hub.verify_token value against the value you sent in the subscription request, and then that you’re sending the hub.challenge value as your response body. Unfortunately, it’s usually not easy to inspect the hub directly to confirm that it has properly verified your subscription, but hopefully some hubs will start providing site-specific dashboards to make this process more transparent. In the meantime, the best way to verify that things worked properly is to try making test posts to the blog and looking for incoming pings.

So add a new post on the example blog, or write a real entry on your PuSH-enabled blog of choice, and look in your server logs to make sure a ping came in. Depending on the hub, the ping may come nearly instantaneously or after a few seconds. If you don’t see it after several seconds, something is probably wrong, but try a few posts to make sure you didn’t just miss it. Look at the specific URL that the hub is calling on your site, and verify that it has your feed ID in the URL, and that it does indeed match the feed that just published new content. If you’re using the “light ping” model, check that you scheduled your feed to crawl ASAP. If you’re using the “fat ping” model, check that you correctly ingested the new content that was in the POST body of the ping.

Once everything appears to be working, try un-hooking your test feed (and/or deleting your account) and verify that it triggers you to send an unsubscribe request to the hub, and that you properly handle the subsequent unsubscribe verification request from the hub.

If you’ve gotten this far, congratulations! You are now part of the real-time-web! Your users will thank you for making their content show up more quickly on your site, and the sites that publish those feeds will thank you for not crawling them as often, now that you can just sit back and wait for updates to be PuSH-ed to you. And I and the rest of the community will thank you for supporting open standards for a decentralized social web!

(Thanks to Brett Slatkin for providing feedback on a draft of this post!)

Month: March 2010

Follow me

Current Projects

Links

Me on the Web

Past Projects

Categories

Archives