hostgator coupon July 2014 hostgator coupon 2014 June Hostgator coupon June 2014 Dreamhost promo code 2014 our business news > first class business
Joseph Smarr » Implementing PubSubHubbub subscriber support: A step-by-step guide

Main menu:


Add to Google

Subscribe via e-mail:

insurance types cosmetics today fitness animals automobile reviews business money buy jewellery finance loans home helps> insurance companies in market finance news medical product money us technology in finance time roof repairing places where cook recipes business ideas business pay buy insurance car price in compamy marketing company guide cosmetics product diet info healthy live samples of business plan home decoration tech news auto insurance home improvement online business tips personal loans product reviews security types top business list top company list home tricks weight loss help what kind of business should i start business letter format business case template general business business type it security what is a good business to start health loss business me magazine news

business ideas

Site search

Categories

March 2010
M T W T F S S
« Jan   Jun »
1234567
891011121314
15161718192021
22232425262728
293031  

Archive

Implementing PubSubHubbub subscriber support: A step-by-step guide

One of the last things I did before leaving Plaxo was to implement PubSubHubbub (PuSH) subscriber support, so that any blogs which ping a PuSH hub will show up almost instantly in pulse after being published. It’s easy to do (you don’t even need a library!), and it significantly improves the user experience while simultaneously reducing server load on your site and the sites whose feeds you’re crawling. At the time, I couldn’t find any good tutorials for how to implement PuSH subscriber support (add a comment if you know of any), so here’s how I did it. (Note: depending on your needs, you might find it useful instead to use a third-party service like Gnip to do this.)

My assumption here is that you’ve already got a database of feeds you’re subscribing to, but that you’re currently just polling them all periodically to look for new content. This tutorial will help you “gracefully upgrade” to support PuSH-enabled blogs without rewriting your fundamental polling infrastructure. At the end, I’ll suggest a more radical approach that is probably better overall if you can afford a bigger rewrite of your crawling engine.

The steps to add PuSH subscriber support are as follows:

  1. Identify PuSH-enabled blogs extract their hub and topic
  2. Lazily subscribe to PuSH-enabled blogs as you discover them
  3. Verify subscription requests from the hub as you make them
  4. Write an endpoint to receive pings from the hub as new content is published
  5. Get the latest content from updated blogs as you receive pings
  6. Unsubscribe from feeds when they’re deleted from your system

1. Identify PuSH-enabled blogs extract their hub and topic

When crawling a feed normally, you can look for some extra metadata in the XML that tells you this blog is PuSH-enabled. Specifically, you want to look for two links: the “hub” (the URL of the hub that the blog pings every time it has new content, which you in turn communicate with to subscribe and receive pings when new content is published), and the “self” (the canonical URL of the blog you’re subscribing to, which is referred to as the “topic” you’re going to subscribe to from the hub).

A useful test blog to use while building PuSH subscriber support is http://pubsubhubbub-example-app.appspot.com/, since it lets anyone publish new content. If you view source on that page, you’ll notice the standard RSS auto-discovery tag that tells you where to find the blog’s feed:

<link title="PubSubHubbub example app" type="application/atom+xml" rel="alternate" />

And if you view source on http://pubsubhubbub-example-app.appspot.com/feed, you’ll see the two PuSH links advertised underneath the root feed tag:

<link type="application/atom+xml" title="PubSubHubbub example app" rel="self" />
<link rel="hub" href="http://pubsubhubbub.appspot.com/" />

You can see that the “self” link is the same as the URL of the feed that you’re already using, and the “hub” link is to the free hub being hosted on AppEngine at http://pubsubhubbub.appspot.com/. In both cases, you want to look for a link tag under the root feed tag, match the appropriate rel-value (keeping in mind that rel-attributes can have multiple, space-separated values, e.g. rel="self somethingelse", so split the rel-value on spaces and then look for the specific matching rel-value), and then extract the corresponding href-value from that link tag. Note that the example above is an ATOM feed; in RSS feeds, you generally have to look for atom:link tags under the channel tag under the root rss tag, but the rest is the same.

Once you have the hub and self links for this blog (assuming the blog is PuSH-enabled), you’ll want to store the self-href (aka the “topic”) with that feed in your database so you’ll know whether you’ve subscribed to it, and, if so, whether the topic has changed since you last subscribed.

2. Lazily subscribe to PuSH-enabled blogs as you discover them

When you’re crawling a feed and you notice it’s PuSH-enabled, check your feed database to see if you’ve got a stored PuSH-topic for that feed, and if so, whether the current topic is the same as your stored value. If you don’t have any stored topic, or if the current topic is different, you’ll want to talk to that blog’s PuSH hub and initiate a subscription so that you can receive real-time updates when new content is published to that blog. By storing the PuSH-topic per-feed, you can effectively “lazily subscribe” to all PuSH-enabled blogs by continuing to regularly poll and crawl them as you currently do, and adding PuSH subscriptions as you find them. This means you don’t have to do any large one-time migration over to PuSH, and you can automatically keep up as more blogs become PuSH-enabled or change their topics over time. (Depending on your crawling infrastructure, you can either initiate subscriptions as soon as you find the relevant tags, or you can insert an asynchronous job to initiate the subscription so that some other part of your system can handle that later without slowing down your crawlers.)

To subscribe to a PuSH-enabled blog, just send an HTTP POST to its hub URL and provide the following POST parameters:

  • hub.callback = [the URL of your endpoint for receiving pings, which we'll build in step 4]
  • hub.mode = subscribe
  • hub.topic = [the self-link / topic of the feed you're subscribing to, which you extracted in step 1]
  • hub.verify = async [means the hub will separately call you back to verify this subscription]
  • hub.verify_token = [a hard-to-guess token associated with this feed, which the hub will echo back to you to prove it's a real subscription verification]

For the hub.callback URL, it’s probably best to include the internal database ID of the feed you’re subscribing to, so it’s easy to look up that feed when you receive future update pings. Depending on your setup, this might be something like http://yoursite.com/push/update?feed_id=123 or http://yoursite.com/push/update/123. Another advantage of this technique is that it makes it relatively hard to guess what the update URL is for an arbitrary blog, in case an evil site wanted to send you fake updates. If you want even more security, you could put some extra token in the URL that’s different per-feed, or you could use the hub.secret mechanism when subscribing, which will cause the hub to send you a signed verification header with every ping, but that’s beyond the scope of this tutorial.

For the hub.verify_token, the simplest thing would just be to pick a secret word (e.g. “MySekritVerifyToken“) and always use that, but an evil blog could use its own hub and quickly discover that secret. So a better idea is to do something like take the HMAC-SHA1 of the topic URL along with some secret salt you keep internally. This way, the hub.verify_token value is feed-specific, but it’s easy to recompute when you receive the verification.

If your subscription request is successful, the hub will respond with an HTTP 202 “Accepted” code, and will then proceed to send you a verification request for this subscription at your specified callback URL.

3. Verify subscription requests from the hub as you make them

Shortly after you send your subscription request to the hub, it will call you back at the hub.callback URL you specified with an HTTP GET request containing the following query parameters:

  • hub.mode = subscribe
  • hub.topic = [the self-link / topic of the URL you requested a subscription for]
  • hub.challenge = [a random string to verify this verification that you have to echo back in the response to acknowledge verification]
  • hub.verify_token = [the value you sent in hub.verify_token during your subscription request]

Since the endpoint you receive this verification request is the same one you’ll receive future update pings on, your logic has to first look for hub.mode=subscribe, and if so, verify that the hub sent the proper hub.verify_token back to you, and then just dump out the hub.challenge value as the response body of your page (with a standard HTTP 200 response code). Now you’re officially subscribed to this feed, and will receive update pings when the blog publishes new content.

Note that hubs may periodically re-verify that you still want a subscription to this feed. So you should make sure that if the hub makes a similar verification request out-of-the-blue in the future, you respond the same way you did the first time, providing you indeed are still interested in that feed. A good way to do this is just to look up the feed every time you get a verification request (remember, you build the feed’s ID into your callback URL), and if you’ve since deleted or otherwise stopped caring about that feed, return an HTTP 404 response instead so the hub will know to stop pinging you with updates.

4. Write an endpoint to receive pings from the hub as new content is published

Now you’re ready for the pay-out–magically receiving pings from the ether every time the blog you’ve subscribed to has new content! You’ll receive inbound requests to your specified callback URL without any additional query parameters added (i.e. you’ll know it’s a ping and not a verification because there won’t be any hub.mode parameter included). Instead, the new entries of the subscribed feed will be included directly in the POST body of the request, with a request Content-Type of application/atom+xml for ATOM feeds and application/rss+xml for RSS feeds. Depending on your programming language of choice, you’ll need to figure out how to extract the raw POST body contents. For instance, in PHP you would fopen the special filename php://input to read it.

5. Get the latest content from updated blogs as you receive pings

The ping is really telling you two things: 1) this blog has updated content, and 2) here it is. The advantage of providing the content directly in the ping (a so-called “fat ping“) is so that the subscriber doesn’t have to go re-crawl the feed to get the updated content. Not only is this a performance savings (especially when you consider that lots of subscribers may get pings for a new blog post at roughly the same time, and they might otherwise all crawl that blog at the same time for the new contents; the so-called “thundering herd” problem), it’s also a form of robustness since some blogging systems take a little while to update their feeds when a new post is published (especially for large blogging systems that have to propagate changes across multiple data-centers or update caching tiers), so it’s possible you’ll receive a ping before the content is available to crawl directly. For these reasons and more, it’s definitely a best-practice to consume the fat ping directly, rather than just using it as a hint to go crawl the blog again (i.e. treating it as a “light ping”).

That being said, most crawling systems are designed just to poll URLs and look for new data, so it may be easier to start out by taking the “light ping” route. In other words, when you receive a PuSH ping, look up the feed ID from the URL of the request you’re handling, and assuming that feed is still valid, just schedule it to crawl ASAP. That way, you don’t have to change the rest of your crawling infrastructure; you just treat the ping as a hint to crawl now instead of waiting for the next regular polling interval. While sub-optimal, in my experience this works pretty well and is very easy to implement. (It’s certainly a major improvement over just polling with no PuSH support!) If you’re worried about crawling before the new content is in the feed, and you don’t mind giving up a bit of speed, you can schedule your crawler for “in N seconds” instead of ASAP, which in practice will allow a lot of slow-to-update feeds to catch up before you crawl them.

Once you’re ready to handle the fat pings directly, extract the updated feed entries from the POST body of the ping (the payload is essentially an exact version of the full feed you’d normally fetch, except it only contains entries for the new content), and ingest it however you normally ingest new blog content. In fact, you can go even further and make PuSH the default way to ingest blog content–change your polling code to act as a “fake PuSH proxy” and emit PuSH-style updates whenever it finds new entries. Then your core feed-ingesting code can just process all your updated entries in the same way, whether they came from a hub or your polling crawlers.

However you handle the pings, once you find that things are working reliably, you can change the polling interval for PuSH-enabled blogs to be much slower, or even turn it off completely, if you’re not worried about ever missing a ping. In practice, slow polling (e.g. once a day) is probably still a good hedge against the inevitable clogs in the internet’s tubes.

6. Unsubscribe from feeds when they’re deleted from your system

Sometimes users will delete their account on your system or unhook one of their feeds from their account. To be a good citizen, rather than just waiting for the next time the hub sends a subscription verification request to tell it you no longer care about this feed, you should send the hub an unsubscribe request when you know the feed is no longer important to you. The process is identical to subscribing to a feed (as described in steps 2 and 3), except you use “unsubscribe” instead of “subscribe” for the hub.mode values in all cases.

Testing your implementation

Now that you know all the steps needed to implement PuSH subscriber support, it’s time to test your code in the wild. Probably the easiest way is to hook up that http://pubsubhubbub-example-app.appspot.com/ feed, since you can easily add content it to it to test pings, and it’s known to have valid hub-discovery metadata. But you can also practice with any blog that is PuSH-enabled (perhaps your shiny new Google Buzz public posts feed?). In any case, schedule it to be crawled normally, and verify that it correctly extracts the hub-link and self-link and adds the self-link to your feed database.

The first time it finds these links, it should trigger a subscription request. (On subsequent crawls, it shouldn’t try to subscribe again, since the topic URL hasn’t changed. ) Verify that you’re sending a request to the hub that includes all the necessary parameters, and verify that it’s sending you back a 202 response. If it’s not working, carefully check that you’re sending all the right parameters.

Next, verify that upon sending a subscription request, you’ll soon get an inbound verification request from the hub. Make sure you detect requests to your callback URL with hub.mode=subscribe, and that you are checking the hub.verify_token value against the value you sent in the subscription request, and then that you’re sending the hub.challenge value as your response body. Unfortunately, it’s usually not easy to inspect the hub directly to confirm that it has properly verified your subscription, but hopefully some hubs will start providing site-specific dashboards to make this process more transparent. In the meantime, the best way to verify that things worked properly is to try making test posts to the blog and looking for incoming pings.

So add a new post on the example blog, or write a real entry on your PuSH-enabled blog of choice, and look in your server logs to make sure a ping came in. Depending on the hub, the ping may come nearly instantaneously or after a few seconds. If you don’t see it after several seconds, something is probably wrong, but try a few posts to make sure you didn’t just miss it. Look at the specific URL that the hub is calling on your site, and verify that it has your feed ID in the URL, and that it does indeed match the feed that just published new content. If you’re using the “light ping” model, check that you scheduled your feed to crawl ASAP. If you’re using the “fat ping” model, check that you correctly ingested the new content that was in the POST body of the ping.

Once everything appears to be working, try un-hooking your test feed (and/or deleting your account) and verify that it triggers you to send an unsubscribe request to the hub, and that you properly handle the subsequent unsubscribe verification request from the hub.

If you’ve gotten this far, congratulations! You are now part of the real-time-web! Your users will thank you for making their content show up more quickly on your site, and the sites that publish those feeds will thank you for not crawling them as often, now that you can just sit back and wait for updates to be PuSH-ed to you. And I and the rest of the community will thank you for supporting open standards for a decentralized social web!

(Thanks to Brett Slatkin for providing feedback on a draft of this post!)

  • http://www.ouvre-boite.com Julien

    Joseph, we (superfeedr!) can probably help #Plaxo get any feed in realtime, (PubSubHubbub or not!)… Who should I get in touch with @ Plaxo?

  • http://josephsmarr.com Joseph Smarr

    Julien-I'd ping John McCrea (@johnmccrea or john at plaxo.com) and he can route you appropriately. :)

  • http://twitter.com/robjohnson robjohnson

    And you don't have to worry about re-subscribing either; according to the spec, hubs are required to ask the subscriber if they want to re-subscribe, and that “ask” looks the same as the initial subscription verify.

  • http://www.ouvre-boite.com Julien

    Also, I'd insist on the fact that subscribers should use unique callback urls for each feed (with the inclusion of query params for example). It makes it incredibly easier to debug subscriptions. Also, I wrote a blog post about getting started with PubSubHubbub a few weeks back : http://blog.superfeedr.com/API/pubsubhubbub/get

    It's not as detailed as yours though!

  • http://www.xn--8ws00zhy3a.com/ James Holderness

    May I suggest you update the paragraph where you talk about extracting the “self” and “hub” links in the feed. Parsing the rel attribute for a link in Atom is quite a bit different from HTML (which is essentialy what you've described). See section 4.2.7.2 of the Atom Syndication Format (RFC4287).

  • wal

    You mention an interesting point about the “self” link. Did I understand correctly that you subscribe to the self link rather than the feed URL? For example, blogpost blogs tend to have a self link that is different from the feed link. And, LiveJournal feeds don't even have a self link the last time I checked. How do you handle those cases?

  • http://josephsmarr.com Joseph Smarr

    Correct-you must subscribe to the rel=”self” link, which is the official canonical URL for that feed, and is often different than the original feed URL you find, esp on sites like Blogger. I think LJ does also use self-links, but the rule is to fall back on the feed URL itself if it doesn't have a rel=self link in the feed content.

  • tamer212

    great article!
    so you metioned “hub.verify” is set to asynch to let the hub know to do the call back separatly. what if you set it to “synch”, does this mean you can get a verification right in the response to your request? where in the google docs documentation this part is mentioned?

    thanks.

  • http://josephsmarr.com Joseph Smarr

    No, sending hub.verify=sync just means the hub needs to separately ping you to verify the request *before* it returns a response for your subscribe request. So it basically holds the subscribe connection open while separately pinging you back. Not sure why some subscribers would prefer that, but I think async is a simpler and more robust pattern for most people.

  • tamer212

    well they are both fine, even wordpress, they only support sync even if you ask for async. so basically as a subscriber your system should be ready for both modes.

  • clinicaltrialselect

    Hello Joseph,

    Thank you for taking the time to write this. It has really opened my eyes to some exciting possibilities that could really make a difference in many people's lives.

    I am wondering since we are about to launch a clinical trial screening service if we couldn't use this recipe as a jumping off point to send a back link in response to anyone posting the phrase “clinical trial”? (There are about 100,000 lives at stake each year here in the U.S. that could be spared if we could get people talking to their own doctors about their clinical trial choices. We have no profit motive, just want to get these conversations started and let 100,000 flowers bloom)

    Thanks again for going the extra mile in explaining this.

    Best,

    Etienne Taylor
    CEO
    Clinical Trial Select, Inc.

  • Thomas

    I dont see the reason for this, I mean hell its screwed up a already working system for me…

    now google puts pubsubhubbub in my rss.xml every time I post… and the links break when I include them in a website

  • Ben

    Thanks very much for this write-up, it's helped me a lot.

    I'm using PSHB in my app for a different reason than blogging. I'm using my published feed to display updated contact details data in real time. (ie, a person updates their contact details, and I want my administrators to see the updated data on the page without having to refresh). I've had success publishing and subscribing to my feed, but I'm actually stumped on how to deliver my 'fat-ping' to the screen that my user is looking at.

    I've had no luck searching for this online, and was hoping you might be able to offer some advice. I'm no javascript expert, but I'm surprised there's no examples of this out there.

    I think the process of subscribing and receiving the callbacks is great, however when the hub sends a 'payload' to my server, what is the best way to immediately present that on the screen (in the browser window) to the user? The only thing I can think of is having a polling javascript function that checks if there's new data, but this defeats the purpose doesn't it?

    Hope someone's got some ideas.

  • http://www.ouvre-boite.com Julien

    This doesn't belong to the PubSubHubbub realm :)
    You can do this with stuff like long-polling, where the webpage is constantly connected to the server and can then receive data from there. You can also use stuff like BOSH (with XMPP) or even regular Ajax (but this implies that the page will actually poll the server every X seconds). Good luck!

  • http://www.pikkuvippi.com Pikavippi

    hehe well nice name at least pubbssubbhububbb :D

  • http://www.examcentral.net/pmp/pmp-practice-exam pmp practice test

    thanks for the good insight here, Subscribing will really help. . .thanks

  • Anonymous
  • Cico3452

    nice
    thank you for this post
    http://www.3d-ohnebrille.net

  • http://profiles.google.com/linuxbasiccommand john hardy
  • http://profiles.google.com/linuxbasiccommand john hardy
  • http://www.frivland.com/ Friv

    For what else are you using PSHB? Thanks for this tutorial! Very well written!

  • Newyear402

    The body of the soul in the protection of Ray Ban Wayfarer. Suddenly open your eyes. His right hand a catch a telekinesis In gray in the hands of the little man. Immediately from his hand. Blink between fly. In gray man was made drastic change. Pale no blood. Forehead secrete sweat buildup. Cheap Ray Bans dark eyeglasses. Again like a big pinch crush a bubble. The man in gray around the flame of bleak. Put all put out. In out of the moment. In gray hysterical man with a shout. Body shift out immediately. Appeared in the whirlpool outside. Inside to fly away the Ray Ban Sunglasses.

  • http://visionfitness.info/ Dave

    Enjoyed reading this thanks!

  • chrisjj2

    J, “A useful test blog to use while building PuSH subscriber support ishttp://pubsubhubbub-example-app.appspot.com/,” – this link is not working.

  • http://www.ouvre-boite.com Julien

    Indeed, try out http://push-pub.appspot.com which does pretty much the same but works :)

  • chrisjj2

    Looks good -. Thanks J.