Site limbo syndrome

It happens a little too often than I’d like. I build a killer web site for a client; it’s so much better than the site they already have (if they have one at all). I meet with them for their training meeting and show them how to add and edit content. They’re generally happy with how easy it is, and they seem ready to get rocking on their site. And then…

I never hear from them again. And occasionally I’ll go back to the site to see if there’s been any progress on it, but no. And I’ll send them an email or two asking if they need help or if they’re making any progress, and my boss will send the bill for the services which more often than not will get paid, but… the site just won’t progress. It’s in web limbo; fully built, but lacking in content; ready to conquer the world if it just knew what to say.

Sometimes the client comes back and resurrects things to the point where the site is taken live; one such site will likely finally be going live early next week if not sooner. But it seems like maybe as many as a quarter of the sites I build will never see the light of day.

It’s kind of depressing. I’m not really sure if there’s anything I can do about it… Are my training sessions too complicated? Most of the time the clients seem to understand me as I’m doing them… Is it just that the clients find themselves unexpectedly overwhelmed when they come face-to-face with the realization that they will have to do some work on their site, even if that’s what they thought they wanted all along? In my more recent early meetings with clients, before they sign the contract, I’ve been trying to make sure that the clients are absolutely aware of that and its implications; it’s been too soon for results to be conclusive.

Arg.

Fatbluepost

Clearly, I haven’t found much motivation to complete a blog post lately. In my defense, I haven’t been entirely idle. Recently I’ve released two projects related to the excellent Zen starter theme for Drupal: Zenophile lets you create Zen subthemes quickly, and Zen Midnight is a starter theme for themes which will use a light-on-dark color scheme (such as white text on a black background). There’s also been the utilitarian boringness of Menu Clone and Compound Eye (which lets you monitor the status of several Drupal sites at once - or will some day, anyway). Then there’s activities at work; we’ve finally got our own private server (of the virtual sort, for now), so I’ve been having a lot of fun and learning a lot as I’m getting it set up. Linux and Apache? That’s for wussies! We’re running FreeBSD and Lighty! Having full control of HTTP headers and being able to install things like Xcache means that we’re able to serve pages faster than ever before.

On a decidedly less nerdy note, I’ve also created a mini-fansite (it’s only one page long) for one of my favorite new bands, Fatblueman. Check it out here. I’ve included a list of all of their music I could find on YouTube - which is a heck of a lot - so if you’re interested in finding some new candy for your ears, check it out and give some of the vids a lesson. You can even download their albums (legally) for free! Give ‘em a try - maybe they’ll become one of your favorite new bands too.

Search results via Ajax & JSON: Yahoo, Live deliver; Google fails it

Warning: This will be the most technical and nerdy article I’ve posted to RGR in quite a while. Those friends of mine who didn’t understand the title of this post might as well go ahead and skip the whole thing.

On Friday, my boss brought up the idea of building a system which we could use to check and track our clients’ sites’ search engine ranking performance for various keywords. He’s been coming up with a lot of these sorts of ideas lately - sometimes I wonder if he realizes he’s only hired one of me - but this idea struck me as particularly interesting. After doing some research on it, I found other tools online which do this task, but they all required payment and registration and other unpleasantness. And yet Google and Yahoo seem to offer their search results via JSON, so how difficult or expensive could this be? So I told my boss this seemed like something we could do, then went home for the weekend. Hey, quittin’ time is quittin’ time…

But after I got some work done for our clients Monday morning, I got started on it soon after. By lunchtime, I had Yahoo working, and got Google hammered out when I came back. I then looked into Live Search’s options and found them sufficient, so I added support for them too.

Basically, how it works is that you enter a search query in one field and a web address in another. When you submit the form, the system uses Ajax to submit the query to and fetch search results in JSON format from the three engines, then runs a regular expression on the web addresses in the results to see if they match the web address appropriately and reports on the results. It’s pretty slick, and I’d love to release it to the public, but it was made on my boss’s time, so I don’t know if he’d be cool with that…

Anyway, it was interesting to play with the differences between the three major engines with regards to their support for all this stuff. Long story short: Yahoo and Live Search were great to work with, but Google’s solution just ain’t cuttin’ it. Let’s go into more depth, shall we?

The XSS problem

I started with Yahoo first since they seemed the most developer-friendly for reasons I’ll get into later. My first problem with getting their feeds was… getting their feeds. When I just tried using jQuery’s standard $.ajax() function, I got a cryptic error about Yahoo’s address being illegal or something like that. I kept checking the format of the address I was using for the request, but I couldn’t find anything wrong with it… after doing a bit of searching in the manual sense, I found out that it turns out this is just the browsers being paranoid about requesting and executing scripts from “foreign” servers.

It turns out there’s a workaround, though. Instead of doing a standard Ajax call, what you actually do is inject a new <script> tag into your DOM with an SRC attribute which requests a script on the search service via GET, with a variable specifying a callback function. The service then returns a script which calls the callback with the JSON results. In practice, it works something like this: You add

<script src="http://search.example.com/search?q=searchquery&callback=myCallback"></script>

…to your DOM, and the script that the search service returns looks like:

myCallback({/* the search results as a JS object */});

…which then gets executed. This is a Grade AAA Prime Hack, 100% Certified. But it works and is supported by all three engines equally well.

Both Yahoo and MSN are capable of offering results in XML format instead of JSON with the change of a single query particle - I bet you can figure it out. But, guys… XML is a language for marking up documents, and JSON is a system for serializing data. And we’re working with data. Sorry, but the “let’s use XML for friggin’ everything!” crew annoy me.

Anyway. Let’s look at the services individually.

Yahoo!

Yahoo seems to really be putting a lot of effort into making their various services developer-friendly, and it shows. Check out the Everything YDN page on Yahoo’s Yahoo Developer Network site and check out all the stuff you can play with!

Let’s take a look at a GET query to Yahoo’s servers. Note that I’m going to leave the URLs in these examples unencoded for easier readability; they won’t actually work until you run encodeURIComponent() or something on them.

http://query.yahooapis.com/v1/public/yql?format=json&callback=myCallback&q='select * from search.web(100) where query = "bananas"'

Woah, what the crap? Is that SQL? Nope, that’s Yahoo! Query Language, an SQL-inspired language for querying Yahoo services - including Flickr, Delicious, and so on in addition to just standard web search. Like SQL, you can “select” only certain “fields” from the results, and you can even do WHERE clauses to a certain extent. No sorting, though. It’s pretty trippy. Try the interactive console for some ideas of what it can do.

Notice the parenthesized (100) after search.web? That’s where we tell Yahoo how many results we want back. As far as I can tell, there’s no hard limit to this… I once upped it to 1000, and Yahoo dutifully gave me 1000 results, which is really a magnitude more than our project really needs to use. I didn’t bother asking for more, but it seemed like the system was ready to give me more if I asked. Wow.

Yahoo’s response looks something like this:

myCallback({
  "query": {
    "count": "10",
    "created": "2009-03-31T05:21:37Z",
    "lang": "en-US",
    "updated": "2009-03-31T05:21:37Z",
    "uri": "http://query.yahooapis.com/v1/yql?q=select+*+from+search.web%2810%29+where+query+%3D+%22bananas%22",
    "diagnostics": {
      "publiclyCallable": "true",
      "url": {
        "execution-time": "285",
        "content": "http://boss.yahooapis.com/ysearch/web/v1/bananas?format=xml&start=0&count=10"
      },
      "user-time": "287",
      "service-time": "285",
      "build-version": "911"
    },
    "results": {
      "result": [
        {
          "abstract": "Information from Wikipedia on this fruit, including its description, world trade, <b>...</b> <b>Bananas</b> are a valuable source of vitamin B6, vitamin C, and potassium. <b>...</b>",
          "clickurl": "http://lrd.yahooapis.com/_ylc=X3oDMTQ4amI4Z25zBF9TAzIwMjMxNTI3MDIEYXBwaWQDb0pfTWdwbklrWW5CMWhTZnFUZEd5TkouTXNxZlNMQmkEY2xpZW50A2Jvc3MEc2VydmljZQNCT1NTBHNsawN0aXRsZQRzcmNwdmlkA3QxM1lUVWdlQXUyM1JXRVZyVEpybXdzS1N6N3VQVW5ScUdFQUFrUTM-/SIG=118hrpqt5/**http%3A//en.wikipedia.org/wiki/Banana",
          "date": "2009/03/19",
          "dispurl": "<b>en.wikipedia.org</b>/wiki/Banana",
          "size": "140417",
          "title": "Banana - Wikipedia, the free encyclopedia",
          "url": "http://en.wikipedia.org/wiki/Banana"
        },
        {
          "abstract": "<b>bananas</b>, fruit, healthy <b>...</b> has proved that just two <b>bananas</b> provide enough energy for a <b>...</b> This is because <b>bananas</b> contain tryptophan, one of the twenty <b>...</b>",
          "clickurl": "http://lrd.yahooapis.com/_ylc=X3oDMTQ4amI4Z25zBF9TAzIwMjMxNTI3MDIEYXBwaWQDb0pfTWdwbklrWW5CMWhTZnFUZEd5TkouTXNxZlNMQmkEY2xpZW50A2Jvc3MEc2VydmljZQNCT1NTBHNsawN0aXRsZQRzcmNwdmlkA3QxM1lUVWdlQXUyM1JXRVZyVEpybXdzS1N6N3VQVW5ScUdFQUFrUTM-/SIG=11c59h9ug/**http%3A//www.finetuneyou.com/Bananas.html",
          "date": "2009/03/22",
          "dispurl": "www.<b>finetuneyou.com</b>/<b>Bananas</b>.html",
          "size": "16271",
          "title": "<b>Bananas</b>",
          "url": "http://www.finetuneyou.com/Bananas.html"
        },
        /* …snip… */
      ]
    }
  }
});

Ah, it’s glorious. Nicely formatted, with all that meta-info… Well, both query.uri or query.url.content are wrong, but oh well, close enough.

Live Search

Live Search, which is what Microsoft is calling its search service this week, was surprisingly forward with its results as well. To read up on Microsoft’s documentation for this, start here. Unlike the other two services, you have to sign up for an API key before you can even make some test queries against the service, but doing so is free, quick and relatively painless. I already have a “Live ID” thanks to my Xbox Live subscription, so I didn’t even have to create a new account. A query looks like this:

http://api.search.live.net/json.aspx?Sources=web&Web.Count=50&JsonType=callback&JsonCallback=myCallback&AppId=0123456789ABCDEF&Query=banana

You can probably guess that I faked in that AppId. (Hey, I don’t want you associatin’ my good ID with whatever sicko queries you’re going to be makin’.) The number of results that we can fetch is set by the Web.Count parameter; through trial and error, I found that it seems to max out at fifty, which was sufficient enough for our task. (If you need more, Live Search lets you specify an offset parameter to fetch the next “page” of results.) Also note the use of TitleCase all over the place; not only in the query, but as you’re about to see, in the response as well. On a one-to-ten scale of annoyingness, that’s about a four.

if(typeof myCallback == 'function') myCallback({
  "SearchResponse": {
    "Version": "2.1",
    "Query": {
      "SearchTerms":"banana"
    },
    "Web": {
      "Total": 29500000,
      "Offset": 0,
      "Results": [
        {
          "Title": "Banana - Wikipedia, the free encyclopedia",
          "Description": "Banana is the common name for a type of fruit and also the herbaceous plants of the genus Musa which produce this commonly eaten fruit. They are native to the tropical region of ... ",
          "Url": "http:\/\/en.wikipedia.org\/wiki\/Banana",
          "CacheUrl": "http:\/\/cc.msnscache.com\/cache.aspx?q=banana&d=75747133304431&w=5ffa56e8,3e149266",
          "DisplayUrl": "http:\/\/en.wikipedia.org\/wiki\/Banana",
          "DateTime": "2009-03-27T12:31:49Z"
        },
        {
          "Title": "Guide to Bananas - History - Recipes - Nutrition - Banana.com",
          "Description": "Complete Guide to Bananas features the history of bananas, banana recipes, the purchase and storage of bananas, how to grow bananas, medicinal uses of bananas, the nutritional ... ",
          "Url": "http:\/\/www.banana.com\/",
          "CacheUrl": "http:\/\/cc.msnscache.com\/cache.aspx?q=banana&d=75708296684832&w=14416f13,fb67564f",
          "DisplayUrl": "http:\/\/www.banana.com\/",
          "DateTime": "2009-03-22T05:56:53Z"
        },
        /* …snip… */
      ]
    }
  }
} /* pageview_candidate */);

Live Search’s results do something unusual in checking for the existence of the callback function before calling it. I’m not sure I like that - if something goes wrong, raising an execption is often better than failing silently. Hmm.

Google

Google. Google Google Google Google Google… tsk tsk tsk.

Whereas Yahoo is gloriously generous with the data it’s providing us, Google seems downright stingy. Most of their “AJAX Search API” documentation is geared more around the idea of drawing a pretty little search form and pretty little search results on the page, not providing raw data to work with - for info on getting that, you have to read the section annoyingly titled “Flash and other non-Javascript [sic] Environments”, even if you really are working entirely in JavaScript. Additionally, they “ask, but do not require, that each request contains a valid API Key” without providing any information as to just how that API key should be passed to the server. In the data passed back, there’s no explicit search result offset value as there is with the other services’ data; you can find it doing simple arithmetic with other values, but it’s still annoying that you have to do it at all. They also make a lot of demands about preserving the Google branding and such when the results are displayed (which I gloriously ignored since technically we’re not displaying results and really only a couple people in the office are ever going to use this anyway… Perhaps the other services make demands like this too, but are less obnoxious about them).

Worst of all… the amount of results you can fetch at once maxes out at eight. Eight! To work around this limitation, I scripted the system to check through the first page of results for a match, and if none is found, to get the next eight, and so on, up to eight times (sixty-four results). The result is that querying Google takes up to ten connections to Google’s server, whereas the rest only take one (possibly two in the case of Live Search if the boss decides the first fifty results aren’t enough). Fail!

Well, anyway. A query:

http://ajax.googleapis.com/ajax/services/search/web?v=1.0&callback=myCallback&rsz=large&q=bananas

Simple enough. The “rsz” attribute is what tells the servers to send us eight results - the default is only four!

myCallback({
  "responseData": {
    "results": [
      {
        "GsearchResultClass": "GwebSearch",
        "unescapedUrl": "http://en.wikipedia.org/wiki/Banana",
        "url": "http://en.wikipedia.org/wiki/Banana",
        "visibleUrl": "en.wikipedia.org",
        "cacheUrl": "http://www.google.com/search?q\u003dcache:Gdi1ltWHn3UJ:en.wikipedia.org",
        "title": "\u003cb\u003eBanana\u003c/b\u003e - Wikipedia, the free encyclopedia",
        "titleNoFormatting": "Banana - Wikipedia, the free encyclopedia",
        "content": "\u003cb\u003eBanana\u003c/b\u003e is the common name for a type of fruit and also the herbaceous plants of   the genus Musa which produce this commonly eaten fruit. \u003cb\u003e...\u003c/b\u003e"
      },
      {
        "GsearchResultClass": "GwebSearch",
        "unescapedUrl": "http://www.bananasinc.org/",
        "url": "http://www.bananasinc.org/",
        "visibleUrl": "www.bananasinc.org",
        "cacheUrl": "http://www.google.com/search?q\u003dcache:paffpacthUcJ:www.bananasinc.org",
        "title": "\u003cb\u003eBANANAS\u003c/b\u003e Home Page",
        "titleNoFormatting": "BANANAS Home Page",
        "content": "\u003cb\u003eBANANAS\u003c/b\u003e specializes in childcare, daycare \u0026amp; babysitting referrals for parents   and childcare providers in Alameda County, California."
      },
      /* …snip… */
    ],
    "cursor": {
      "pages": [
        {
          "start": "0",
          "label": 1
        },
        {
          "start": "8",
          "label": 2
        },
        /* …snip… */
      ],
      "estimatedResultCount": "16400000",
      "currentPageIndex": 0,
      "moreResultsUrl": "http://www.google.com/search?oe\u003dutf8\u0026ie\u003dutf8\u0026source\u003duds\u0026start\u003d0\u0026hl\u003den\u0026q\u003dbananas"
    }
  },
  "responseDetails": null,
  "responseStatus": 200
});

You can see that Google provides this curious responseData.cursor.pages array which I guess is supposed to be used to build a “Goooooooogle”-like pager for the results. It seems useless, but it sort of comes in handy when a URL matches and I have to calculate parseInt(json.responseData.cursor.pages[json.responseData.cursor.currentPageIndex].start) + i + 1 to find out which result it was. Barf.

So well on Microsoft and especially Yahoo for making their search data so accessible like this. The potential for building some really great “mash-up”-style apps with search result data is nearly limitless. I know that Google is the go-to search engine for everyone from n00bs to l33ts, but I think developers really should take a second look at what Yahoo offers for them - they’re really doing some great work in terms of developer outreach. I didn’t look into it as deeply, but it seems Microsoft has made some great steps in that direction too.

But all shame upon Google for providing poorly-formatted, restriction-heavy data. Clearly their focus is not upon us, the developers itching to use their data in a sweet new web app. They haven’t had anything to fear from the competition in a while, but if they don’t watch their back, it may come back to bite them…

More deductions, less stress: TurboTax delivers

While searching for help on a vexing programming problem I was having earlier today, Yahoo threw me an ad for Intuit’s TurboTax online service, promising to let me file my taxes for free. I knew the hook of these services - your federal taxes are free, but they’ll charge you a fee to transfer over your data and file your state taxes - but, since I was already at that point ready for a distraction from my programming problem at that point, I took the bait and clicked the ad.

It must be said that I hate doing my taxes. Not so much to see how much the government has leeched out of me - though of course that blows too - but because it involves getting together papers from one place and papers from another place and putting them all together and copying numbers from one piece of paper to another… Paperwork. Empty formality. It’s not my style.

So I played around with TurboTax, seeing if it would be a better experience than filling out a 1024ASDF form. At the end, I’d have to say that it was, and that I was generally satisfied with the experience.

TurboTax works through a series of screens with various questions, most of which accept a yes or no answer. You’ll also have to copy over values from your W2 and other such forms, but the system does all the math for you. The system will ask you questions about your situation first and only ask you for your name, address and such near the end, which I think is an interesting choice psychologically… Thinking about it, I probably would have been more dissuaded to try the service if it asked me for those things up front.

I started out filling forms on the free plan, but, of course, eventually I got the pitch for an upsell. Would I like to use the “Basic Plus” plan for an extra $15 and have it maximize my deductions or some such language? At this point, I still hadn’t given Intuit my credit card number, so I accepted the offer to see if it would suggest anything I missed. Indeed, it did - specifically, the interest I paid on my student loan (which I finally paid off just a couple weeks ago) was deductible, something I didn’t know I could deduct for the past three tax years I’ve had since graduating. Damn! I was also able to deduct the tuition I paid for the classes I took at the local community college, something else I wouldn’t have considered. In the end, my tax burden was lowered by a couple dozen dollars - enough for the premium service to pay for itself, if not spectacularly.

And then there was the pitch I knew was coming - file your state taxes too for $35. Okay, fine. You’ve got a deal.

In the end, it turns out I wasn’t able to file electronically, because apparently they need some security number that was on my 2007 return statement or something like that - I don’t know, exactly. Just that it involved paperwork. So instead TurboTax created a multi-page PDF of the various tax forms with values already filled in, ready to print out and mail along with my federal and state ransoms (no return for me).

In the end, I paid $50 for something I could have done for free - for free, and not enjoyed it one bit. And I know for sure that TurboTax’s suggestions for deductions saved me a few bucks - not quite $50, but the resultant decreased tax fee plus the reduced stress of not having to do the paperwork manually makes me believe I got my money’s worth. I’ll definitely look to use this tool next year, and for those still procrastinating on their taxes, I’d recommend giving TurboTax a look, especially if you hate paperwork too.

Now if you all would have just voted for Huckabee in ‘08, maybe 2009 would have been the year of the FairTax and all this stupid paperwork would be just a bad memory…

The fallacy of trust

I’ve recently been reading Absolute FreeBSD: The Complete Guide to FreeBSD by Michael W Lucas. The book is packed with info about using FreeBSD, mostly from a server perspective; it has a lot of information related to keeping server boxes secure (as it should). In a footnote in Chapter 9, Lucas provides the web address to an article which is no longer available at that address; Archive.org saves the day with a cached version of the article.

Entitled “Reflections on Trusting Trust,” it’s a transcript of a speech given by Ken Thompson, one of the graybeards behind Unix as well as a few other neat technologies including UTF-8. It describes a security breach involving an operating system’s compiler - the program which takes program source code and turns it into an executable program. The speech goes into detail, but to sum it up, the breach works like this.

The attacker modifies the source code of a compiler, adding two new instructions:

  1. If the compiler is compiling the Unix login program, and the password provided is a certain password, allow the user full access rights to the computer. Otherwise, behave as normal.

  2. If the compiler is compiling the compiler (another copy of itself), add these two new instructions to the compiler.

The attacker then compiles the code to produce a tainted compiler, then removes the two instructions above from the compiler’s source code to cover their tracks. But it doesn’t matter, because from now on, any copies of the login or compiler programs the compiler creates, or that a compiler created by the compiler creates, etc, will be “tainted.” If the compiler is then distributed in binary form to a wide number of systems - say, as an operating system release - then you’ve suddenly got a wide range of systems out in the wild which one hacker can gain root access to with a single password.

Once I wrapped my head around how the attack works, I was struck by both its simplicity and its practicality. Who’s to say such an attack isn’t already happening, really? Maybe not by a malicious hacker, but by a government interested on keeping tabs on its citizens and/or international neighbors…

Or think of other instructions which could be added to the list. If the compiler is compiling PHP or some other interpreter often used for the deployment of web sites, it could add instructions that, whenever the system accepts a number which looks like a credit card number, it emails that number off to the hacker. This would obviously be a huge breach for online shopping sites, and one that, I imagine, they’d have a very hard time to track down themselves.

The gist is that it’s impossible to trust any code that you didn’t write yourself. But, of course, it’s impractical to write an entire functioning computer’s code all by yourself, end to end, and would probably cause more problems than it would fix anyway, since often the bugs and security issues you’ve created which are obvious to others can be easily skipped over or “invisible” to yourself. At some point, you just have to trust - or at least hope - that someone else’s code will be safe and sane.

If I were the paranoid type, I might even lose sleep over this…

Syndicate content

About RGR

Ray Gun Robot is the personal site of Garrett Albright, a fairly decent web developer living in northern California. Find out more about me or check out some projects I’ve worked on.