Something Vast This Way Comes

The Vast.com (developer) Preview is finally available! If you’ve been wondering what we’ve been up to, here it is, in a nutshell – we are building a search service that extracts classified ads from across the web, structures them, and then makes them available via an open REST API for commercial and non-commercial uses.

A little more detail:

– We are crawling the web and large parts of the blogosphere with a general crawler, similar to the ones operated by Yahoo!, Google, Ask, MSN, and Gigablast.

– The crawler activates forms, and digs deep to find even dynamic data (although it certainly doesn’t fill in any logins and passwords)

– We automatically recognize classifieds listings – currently cars for sale, job postings, and personals profiles, and extract and normalize the surrounding metadata (make, model, price, mileage, salary, location, title, age, gender, etc.).

Currently, we have some of the largest databases anywhere, of over 15 Million classified listings across these three categories, automatically extracted and structured with no human oversight, from nearly 50,000 web sites and blogs. (We actually crawled many, many times that number, but these are just the sites that have results to date).

If you are an end-user, you should be able to search for that hard-to-find listing without having to visit hundreds of sites, and compare cross-site results, with images, sorting, and statistics.

If you are a web-site owner or web developer, we’re offering a no-hassle API to show this data to your visitors, or to mash it up to your hearts content. You can use it build a huge destination site, an interesting application, or to supplement content and listings that you have today. You CAN use it for commercial purposes, and as long as it’s being shown to real end users, there’s NO LIMIT on the number of queries. Everything you see on the site is built on our API, so you should be able to replicate Vast.com on your own site or blog.

If you have a classifieds site or a blog and would like your ads to be included in our results, you shouldn’t have to do anything. Just post like you normally would, and we’ll find you. If we’re not getting your results or not getting them all, drop us a note at help – at – vast – dot – com and we’ll try and fix it.

We’re going to keep this site and the API as open as possible, and like a good net citizen, link directly back to the results. We don’t compete with the people that we crawl by taking direct listings. We don’t rely on explicit tagging. And we do an enormous amount of de-duplication and spam filtering to keep the results clean.

Of course, this is a search service, not a listing service, so you can expect some spam and mis-classified results will sneak through. Some links will break due to changes, expirations, and finicky databases that were not designed to be “deep crawled.” In those cases, the cache is your friend. There’re also rivers of pornographic content that had to be filtered out, and occasionally, we miss a few. Please help out by reporting bad results using the links next to each result.

We will be adding more sources, better crawling, improved classification, and many more categories over time – this is just a start. We want to support the web community that wants to take highly-structured content and build applications on top of these massive data flows. When we start making revenue through syndicating this data, we will share it with the developers and sites distributing it via the API.

What more would people like to see? How can we help or improve?

Update: Some coverage of the launch and reviews from TechCrunch, Paul Kedrosky, Peter Ripand CNet.

Dare Microsoft Kill Google? (updated)

“Say, that’s a nice ad business you got there. It’d be a real shame if something were to happen to it…”

IE7_Privacy.jpg

Sneak peek of IE7 Dialog

Ok, so I Photoshopped this. But, when Microsoft decides that it doesn’t want Google’s revenue stream as much as it wants Google gone, why wouldn’t it do this? And what’s wrong with helping consumers filter out unwanted content on the Internet? 

Update – I’ve created a monster. Abhishek ran off and wrote a Firefox extension to do this. Then again, Customize Google has been around for a while. Actually, this is hitting Mozilla, not Google. You know the Firefox default home page with Google search inside? Well, rumor has it that all of the ad revenue from it goes to the Mozilla foundation – that’s over a billion dollars to fund their fight against IE 7!

Update 2IE 7 breaks Adsense! (Hat tip to Ram).

Fix the Search Interface First

Barry Diller wants Ask.com to grow market share. I’m sure there’s lots to be done on improving distribution deals, the crawler, back-end algorithms, etc., but how about starting with some simple, obvious UI fixes? Here are the results of a search for “Search Engine” on the 4 majors:

jeeves.001.jpg
jeeves.002.jpg
jeeves.003.jpg
jeeves.004.jpg

At 1024×768, the UI differences are glaring. I’ve marked the ads in red and the information of dubious value in blue. The green checks indicate relevant, high-quality results.

To the average surfer, the info in white is all that matters. Note to everyone else except Google – fix the UI first – it’s the low-hanging fruit.

How Microsoft can Obliterate Google

Time Bray has it right. In the future, every site can carry search. All Microsoft has to do is to give the revenues from any potential search to the site carrying it. Two big assumptions:

  • MSN Search has to be as good as Google search (not there yet, but possible)
  • Microsoft is willing to forego the Google ad revenue stream in exchange for severely crippling Google. Right now, it seems like MS wants the revenue stream rather than to eliminate it, but that seems unlikely.