Parsing Google Fast Flip

Jun 22

Introduction

From time to time I spend a little too long on a project I probably shouldn’t. A while back I set out to pull in Google News Fast Flip data for my personal Homepage(you’ll see several projects from this). For those who are not famiiar with Google Fast Flip, it is a service that shows screenshots of news articles from several different sources. Now, unusual for Google there isn’t an API for Fast Flip. This would seem to be the end of the story, but if you pay attention to their page you will notice that they pull in the articles via AJAX. So I started with this to try to pull in data. The base URL is http://fastflip.googlelabs.com/data. If you open this URL up you will get nothing. This is because it requires that you to post data to it with a query string, a start, and a number of posts to pull. So lets curl that.

The fun part

Before we start here is a quick example of what you can do with the data: Demo


<?php
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'http://fastflip.googlelabs.com/data');
curl_setopt($ch, CURLOPT_POSTFIELDS, "q=section:\"Sci/Tech\"&start=0&num=1");
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$content = curl_exec($ch);
$response = curl_getinfo($ch);
curl_close($ch);

echo $content;

This will get you what looks like JSON data but will not parse with json_decode(). It looks something like this:


[['06/21/11 00:00','The Atlantic','Forget the Specs, Is the New MacBook Air Going to Be Black?','http://www.theatlantic.com/technology/archive/2011/06/forget\x2Dthe\x2Dspecs\x2Dis\x2Dthe\x2Dnew\x2Dmacbook\x2Dair\x2Dgoing\x2Dto\x2Dbe\x2Dblack/240760/','rLgszGxdkW19xM','Nicholas Jackson','Sci/Tech','','640 956','992 956','http://g1.gstatic.com/news/screenshots/u2yJ66dzwq5X2M\x2Dtiny.png','http://g1.gstatic.com/news/screenshots/u2yJ66dzwq5X2M\x2Dmed.png','http://g3.gstatic.com/news/screenshots/u2yJ66dzwq5X2M.png','1','Nicholas Jackson \x2D Nicholas Jackson is an associate editor at The Atlantic. A former media aggregator for Slate, his writing has also appeared in Encyclopaedia ...','http://www.theatlantic.com/technology/archive/2011/06/forget\x2Dthe\x2Dspecs\x2Dis\x2Dthe\x2Dnew\x2Dmacbook\x2Dair\x2Dgoing\x2Dto\x2Dbe\x2Dblack/240760/','1','3','8797714876515','','','0','']]

On this one I wasn’t quite sure so I have this custom parser (thanks to Matt for the help with the regex):


$page = array_map(function ($n) {
return preg_replace("#^\\[*'?(.*)'?\\]*$#", '$1', preg_split("#','#", $n));
}, preg_split('#\],#', $content));

Update:
Thanks to adbe on reddit, I was able to simplify this a bit with the following chunk of code:

$search = array('\'', '\\');
$replacement = array('"', '\\\\');
$content = str_replace($search, $replacement, $content);

From here you can actually json_decode() the content variable which will give you an array.

What you get from this is an array of 0-22. For me this was the most difficult part. I let this run as a cron for a few days and just dumped everything to a table. To save you a LOT of trouble this is what I found (and didn’t)
Update:
Thanks to one of the Google devs for confirming the fields I didn’t know.

0 – Date Time the article was written
1 – The news source
2 – Article Title
3 – Link to the article
4 – What I think is a Google internal id (As per the Google dev this is an internal id)
5 – Author of the article
6 – Category the post is in (Check below for a full list)
7 – This field is always blank(As per the Google dev this isn’t used)
8 – Medium image dimensions
9 – Large image dimension
10 – Thumbnail image URL
11 – Medium image URL
12 – Large image URL
13 – Always 1 or 0 but I don’t know what it represents (as per the Google dev this is for showing ads)
14 – A snippet of the content
15 – A link to the article, always exactly the same as number 2
16 – Always 1, but again I don’t know what this field represents
17 – Always 1 or 0 but I don’t know what it represents(As per the Google dev this is actually the number of likes but might not be used anymore)
18 – A number but I don’t know what it represents
19 – Always blank
20 – Topics or Tags
21 – A number but I don’t know what it represents (As per the Google dev this is a personalization score)
22 –  Whether the source is Tribune Company, Wenner Media, or its blank

As you can tell I have a few holes in this I would love to fill in. If you can help with this I would be glad to hear what they are. As far as the sections I use the following array:


$sections = array(
'Politics',
'Business',
'U.S.',
'World',
'Sports',
'Sci/Tech',
'Entertainment',
'Health',
'Opinion',
'Travel',
'Environment'
);

Conclusion

Now, this is of course just the starter. I actually loop through all of the sections and store the information I want to my HUD database. What I can tell you from the time I’ve had this running is Google seems to store the images for about a month and then they remove them. I hope this helps someone with a project they are working on and as always I’m interested in your thoughts and how you think this could be used.

It has been pointed out, and I agree, that this should not be used for any sort of production application. When I said that I hoped it helped someone with a project I meant a personal project. I just wanted to clear this up.

2 comments

  1. Konrad Dzwinel /

    As you mentioned – it’s not an API. That means that output format, request link, required parameters etc. may change anytime. Also, Google may block you anytime for messing with their data (which they don’t feel like shearing). It’s not a good idea to build on something so unstable and unpredictable. Google provides news feeds for noncommercial use (http://www.google.com/support/news/bin/answer.py?answer=59255&hl=en) – they sound like a better start for a project.

  2. @Konrad While your are quite right and this could change at any time, that is no different the RSS feeds they provide. The information in this post is not meant to be something that is used in a production app by any means. As I said its for my own personal use, as such I don’t see Google making a big deal about it. Even if they do its not going to effect me too much because as I said it is just my personal use, I wont have any users to answer to. I appreciate your comment though.

Leave a Reply