Thursday, 27 June 2013

Fast, Accurate Detection of 100,000 Object Classes on a Single Machine



Humans can distinguish among approximately 10,000 relatively high-level visual categories, but we can discriminate among a much larger set of visual stimuli referred to as features. These features might correspond to object parts, animal limbs, architectural details, landmarks, and other visual patterns we don’t have names for, and it is this larger collection of features we use as a basis with which to reconstruct and explain our day-to-day visual experience. Such features provide the components for more complicated visual stimuli and establish a context essential for us to resolve ambiguous scenes.

Contrary to current practice in computer vision, the explanatory context required to resolve a visual detail may not be entirely local. A flash of red bobbing along the ground might be a child’s toy in the context of a playground or a rooster in the context of a farmyard. It would be useful to have a large number of feature detectors capable of signaling the presence of such features, including detectors for sandboxes, swings, slides, cows, chickens, sheep and farm machinery necessary to establish the context for distinguishing between these two possibilities.

This year’s winner of the CVPR Best Paper Award, co-authored by Googlers Tom Dean, Mark Ruzon, Mark Segal, Jonathon Shlens, Sudheendra Vijayanarasimhan and Jay Yagnik, describes technology that will enable computer vision systems to extract the sort of semantically rich contextual information required to recognize visual categories even when a close examination of the pixels spanning the object in question might not be sufficient for identification in the absence of such contextual clues. Specifically, we consider a basic operation in computer vision that involves determining for each location in an image the degree to which a particular feature is likely to be present in the image at that particular location.

This so-called convolution operator is one of the key operations used in computer vision and, more broadly, all of signal processing. Unfortunately, it is computationally expensive and hence researchers use it sparingly or employ exotic SIMD hardware like GPUs and FPGAs to mitigate the computational cost. We turn things on their head by showing how one can use fast table lookup — a method called hashing — to trade time for space, replacing the computationally-expensive inner loop of the convolution operator — a sequence of multiplications and additions — required for performing millions of convolutions with a single table lookup.

We demonstrate the advantages of our approach by scaling object detection from the current state of the art involving several hundred or at most a few thousand of object categories to 100,000 categories requiring what would amount to more than a million convolutions. Moreover, our demonstration was carried out on a single commodity computer requiring only a few seconds for each image. The basic technology is used in several pieces of Google infrastructure and can be applied to problems outside of computer vision such as auditory signal processing.

On Wednesday, June 26, the Google engineers responsible for the research were awarded Best Paper at a ceremony at the IEEE Conference on Computer Vision and Pattern Recognition held in Portland Oregon. The full paper can be found here.

Tuesday, 18 June 2013

Style vs. Substance in Mobile Software

Although we’ve all been talking about mobile computing for years, the smartphone and tablet markets are still very young, and changing rapidly. Many app and web companies are struggling to figure out how mobile works and what makes it different from the more familiar world of websites and personal computers.

The depth of the confusion became clear to me recently when, as part of a research project, I had the chance to watch a huge archive of videos of users trying out mobile-specific apps and websites. The results were surprising. Many users struggled to figure out even basic tasks, and I saw the same design mistakes repeated over and over by different developers.

I’ve written a whitepaper on the findings, with many details and examples (you can download it here).* In this post, I want to highlight the biggest problem I saw in the tests, and what I think it means for all of us.

The most common problem I saw in the tests was users struggling with mobile apps and websites that prioritized beauty over usability. Too often, we as an industry equate an app that looks simple with an app that’s easy to use. Those are two entirely different things. Stripping all the text out of an app and hiding all of the buttons makes for a beautiful demo at TechCrunch, but a horrible user experience for people who are trying to get something done with an app.

We tell ourselves that this is OK, relabeling confusion as “intrigue.”  How many times have you see an expert online say something like this: “Users enjoy the process of discovering new functions in your app as they gradually explore its interface and learn its hidden features”?  From watching real people use apps, I can tell you that’s lunacy. What delights most mobile users is getting things done. The only time they want to explore an app’s hidden nuances is if they’re playing. In a game or other entertainment app, cryptic Myst-like interfaces make for an engaging puzzle. In all other apps, puzzlement is a sign of bad design.

Here are three examples of the trouble we're creating for ourselves:

Low contrast. A trend in modern graphic design is the use of low-contrast graphics and text: light gray or blue text on a white background, or dark gray text on a black background. It looks sexy in print and on the web, but causes problems in mobile. Smartphones are often used outdoors, in situations where any screen image is hard to see. Low-contrast items can completely disappear in direct sunlight. Often companies don’t realize that this will be a problem because they test their apps indoors, or do design reviews by projecting screen images in a darkened room.

If you think this is just an isolated problem, check out the weather app in iOS 7. I love the look of that white text that Apple superimposed over a pale blue sky with puffy white clouds. But can you read it? How will it look in the sun?



Cryptic icons. There are a few icons that mean the same thing on all mobile platforms. For example, the magnifying glass means “search” everywhere. But in most cases, the mobile OS players have used icon designs as a point of differentiation. The table below shows some conflicting icon designs in Android and iOS:

The last two examples in the table show similar icons in iOS and Android that have different meanings.

Some developers respond to this diversity by creating separate versions of their mobile app for each OS, with different icons in each version. But users are not as easily segmented. In the tests, I saw cases in which iOS users assumed the Android icon definitions, and vice-versa. The situation is even worse for a mobile web developer, who must use the same UI on all platforms. Which icon set should they use?

When icon designs conflict, they cancel each other out and mean nothing. Many apps are studded with icons that the developers think make sense, but that actually are just tiny meaningless pictures in the eyes of many users.

Missing help. I used to think the ideal mobile app would be so simple that everyone could figure out how to use it intuitively. I now realize that’s a fantasy. The tiny screen and other restrictions of a mobile device make it almost certain that people will sometimes be confused by your app.

When mobile app users get confused, the first thing they do is search in the app for a help function. If help is available and properly structured, the user can usually resolve the problem and get back on task.  Unfortunately, in most mobile apps and websites, help is minimal or totally absent. I don’t know why that is. Maybe developers feel adding help would be an admission that their app is hard to use. But that’s like saying you shouldn’t put seat belts in a car because it implies the car might crash. Plan for trouble and your users will be happier.


What it means. The fixes to these specific problems are straightforward:

—Use high-contrast text (black on white, white on black, or pretty close to it). And test your mobile app or website outdoors, in bright sunlight.
—Label all buttons with text in addition to (or instead of) icons.
—Add context-sensitive help to every screen in your app (the help can be as simple as an overlay saying what you’re supposed to do on this screen and what the buttons do).

The harder part is dealing with the underlying design attitude that created these problems in the first place. I don’t know exactly when we went astray on design. Early websites were horribly cluttered, and in reaction to that we started to see a welcome move toward cleaner and simpler designs online (think of Mint.com, which took a complex subject like personal finance and made it feel accessible). The rise of the iPhone, with Apple’s strong emphasis on design elegance, reinforced this trend. But somewhere along the way, we lost track of the user’s needs. Instead of making things simple, we made them simplistic. We hid features for the sake of hiding them, rather than because the user didn’t need them. And we started designing software that would look beautiful to VCs and other designers, rather than being helpful to our users.

If we’re going to permanently solve the usability problems in mobile, we need to readjust our attitude toward mobile design. The most beautiful app is not the one that looks most striking; it’s the one people can actually use. You should design your app to be usable first, and then make it as pretty as you can.

The highest form of beauty is functionality.

__________

*Full disclosure: In addition to my startup role at zekira.com, I’m working on mobile strategy for UserTesting.com. They’re the leading “talk aloud” user testing service, and they gave me access to their test archive for the whitepaper. I controlled the content of the research and the conclusions. And the company had nothing to do with this blog post; I wrote it because I thought you’d be interested in the findings.

Some Innovative MOOCs



Last summer, we jumped into the world of MOOCs (Massive Open Online Courses) with our own course on search skills, Power Searching. Soon after, we open sourced the platform that we developed to present the course -- Course Builder. A large number of courses have been hosted on Course Builder since, with many more coming soon. As can be seen from the list of courses, our goal is to provide the capability for anyone to create a MOOC. We’ve been pleasantly surprised by the variety of courses and the creativity of the instructors building on Course Builder.

For example, GivingWithPurpose is an innovative MOOC presented by Learning By Giving, one of Doris Buffett’s foundations for “giving it all away.” Instructor Rebecca Riccio, who teaches philanthropy at both Northeastern and Brandeis Universities, feels that reaching thousands of people in a discussion about how to allocate scarce resources to address the needs of our communities has huge potential. “For all the social, cultural, and economic value we derive from the nonprofit sector, we do shockingly little to educate people about why it is so important and what we can do to help it thrive. So while I believe GivingWithPurpose will be successful in its primary goal of teaching students how to give more effectively, in ways that both satisfy their own motivations for giving and support high-performing nonprofit organizations, my second, perhaps more ambitious goal is to promote more informed civic engagement.”

We’ve also hosted MOOCs on evaluating and selecting soccer players, how to search for a job, and how to develop digital learning opportunities for students in public schools. We have many university courses such as Game Theory from Stanford and Information Visualization from Indiana University.

Course Builder’s support of both traditional and non-traditional education opportunities is core to its mission. We’ll continue to build features that help university professors, K12 teachers and anyone else who has something important to teach.

Thursday, 13 June 2013

Excellent Papers for 2012



Googlers across the company actively engage with the scientific community by publishing technical papers, contributing open-source packages, working on standards, introducing new APIs and tools, giving talks and presentations, participating in ongoing technical debates, and much more. Our publications offer technical and algorithmic advances, feature aspects we learn as we develop novel products and services, and shed light on some of the technical challenges we face at Google.

In an effort to highlight some of our work, we periodically select a number of publications to be featured on this blog. We first posted a set of papers on this blog in mid-2010 and subsequently discussed them in more detail in the following blog postings. In a second round, we highlighted new noteworthy papers from the later half of 2010 and again in 2011. This time we honor the influential papers authored or co-authored by Googlers covering all of 2012 -- covering roughly 6% of our total publications.  It’s tough choosing, so we may have left out some important papers.  So, do see the publications list to review the complete group.

In the coming weeks we will be offering a more in-depth look at some of these publications, but here are the summaries:

Algorithms and Theory

Online Matching with Stochastic Rewards
Aranyak Mehta*, Debmalya Panigrahi [FOCS'12]
Online advertising is inherently stochastic: value is realized only if the user clicks on the ad, while the ad platform knows only the probability of the click. This paper is the first to introduce the stochastic nature of the rewards to the rich algorithmic field of online allocations. The core algorithmic problem it formulates is online bipartite matching with stochastic rewards, with known click probabilities. The main result is an online algorithm which obtains a large fraction of the optimal value. The paper also shows the difficulty introduced by the stochastic nature, by showing how it behaves very differently from the classic (non-stochastic) online matching problem.

Matching with our Eyes Closed
Gagan Goel*, Pushkar Tripathi* [FOCS'12]
In this paper we propose a simple randomized algorithm for finding a matching in a large graph. Unlike most solutions to this problem, our approach does not rely on building large combinatorial structures (like blossoms) but works completely locally. We analyze the performance of our algorithm and show that it does significantly better than the greedy algorithm. In doing so we improve a celebrated 18 year old result by Aronson et. al.

Simultaneous Approximations for Adversarial and Stochastic Online Budgeted Allocation
Vahab Mirrokni*, Shayan Oveis Gharan, Morteza Zadimoghaddam, [SODA'12]
In this paper, we study online algorithms that simultaneously perform well in worst-case and average-case instances, or equivalently algorithms that perform well in both stochastic and adversarial models at the same time. This is motivated by online allocation of queries to advertisers with budget constraints. Stochastic models are not robust enough to deal with traffic spikes and adversarial models are too pessimistic. While several algorithms have been proposed for these problems, each algorithm was known to perform well in one model and not both, and we present new results for a single algorithm that works well in both models.

Economics and EC

Polyhedral Clinching Auctions and the Adwords Polytope
Gagan Goel*, Vahab Mirrokni*, Renato Paes Leme [STOC'12]
Budgets play a major role in ad auctions where advertisers explicitly declare budget constraints. Very little is known in auctions about satisfying such budget constraints while keeping incentive compatibility and efficiency. The problem becomes even harder in the presence of complex combinatorial constraints over the set of feasible allocations. We present a class of ascending-price auctions addressing this problem for a very general class of (polymatroid) allocation constraints including the AdWords problem with multiple keywords and multiple slots.

HCI

Backtracking Events as Indicators of Usability Problems in Creation-Oriented Applications
David Akers*, Robin Jeffries*, Matthew Simpson*, Terry Winograd [TOCHI '12]
Backtracking events such as undo can be useful automatic indicators of usability problems for creation-oriented applications such as word processors and photo editors. Our paper presents a new cost-effective usability evaluation method based on this insight.

Talking in Circles: Selective Sharing in Google+
Sanjay Kairam, Michael J. Brzozowski*, David Huffaker*, Ed H. Chi*, [CHI'12]
This paper explores why so many people share selectively on Google+: to protect their privacy but also to focus and target their audience. People use Circles to support these goals, organizing contacts by life facet, tie strength, and interest.

Information Retrieval

Online selection of diverse results
Debmalya Panigrahi, Atish Das Sarma, Gagan Aggarwal*, and Andrew Tomkins*, [WSDM'12]
We consider the problem of selecting subsets of items that are simultaneously diverse in multiple dimensions, which arises in the context of recommending interesting content to users. We formally model this optimization problem, identify its key structural characteristics, and use these observations to design an extremely scalable and efficient algorithm. We prove that the algorithm always produces a nearly optimal solution and also perform experiments on real-world data that indicate that the algorithm performs even better in practice than the analytical guarantees.

Machine Learning

Large Scale Distributed Deep Networks
Jeffrey Dean, Greg S. Corrado*, Rajat Monga, Kai Chen, Matthieu Devin, Quoc V. Le, Mark Z. Mao, Marc’Aurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, Andrew Y. Ng, NIPS 2012;
In this paper, we examine several techniques to improve the time to convergence for neural networks and other models trained by gradient-based methods. The paper describes a system we have built that exploits both model-level parallelism (by partitioning the nodes of a large model across multiple machines) and data-level parallelism (by having multiple replicas of a model process different training data and coordinating the application of updates to the model state through a centralized-but-partitioned parameter server system). Our results show that very large neural networks can be trained effectively and quickly on large clusters of machines.

Open Problem: Better Bounds for Online Logistic Regression
Brendan McMahan* and Matthew Streeter*, COLT/ICML'12 Joint Open Problem Session, JMLR: Workshop and Conference Proceedings.
One of the goals of research at Google is to help point out important open problems--precise questions that are interesting academically but also have important practical ramifications. This open problem is about logistic regression, a widely used algorithm for predicting probabilities (what is the probability an email message is spam, or that a search ad will be clicked). We show that in the simple one-dimensional case, much better results are possible than current theoretical analysis suggests, and we ask whether our results can be generalized to arbitrary logistic regression problems.

Spectral Learning of General Weighted Automata via Constrained Matrix Completion
Borja Balle and Mehryar Mohri*, NIPS 2012.
Learning weighted automata from finite samples drawn from an unknown distribution is a central problem in machine learning and computer science in general, with a variety of applications in text and speech processing, bioinformatics, and other areas. This paper presents a new family of algorithms for tackling this problem for which it proves learning guarantees. The algorithms introduced combine ideas from two different domains: matrix completion and spectral methods.

Machine Translation

Improved Domain Adaptation for Statistical Machine Translation
Wei Wang*, Klaus Macherey*, Wolfgang Macherey*, Franz Och* and Peng Xu*, [AMTA'12]
Research in domain adaptation for machine translation has been mostly focusing on one domain. We present a simple and effective domain adaptation infrastructure that makes an MT system with a single translation model capable of providing adapted, close-to-upper-bound domain-specific accuracy while preserving the generic translation accuracy. Large-scale experiments on 20 language pairs for patent and generic domains show the viability of our approach.

Multimedia and Computer Vision

Reconstructing the World's Museums
Jianxiong Xiao and Yasutaka Furukawa*, [ECCV '12]
Virtual navigation and exploration of large indoor environments (e.g., museums) have been so far limited to either blueprint-style 2D maps that lack photo-realistic views of scenes, or ground-level image-to-image transitions, which are immersive but ill-suited for navigation. This paper presents a novel vision-based 3D reconstruction and visualization system to automatically produce clean and well-regularized texture-mapped 3D models for large indoor scenes, from ground-level photographs and 3D laser points. For the first time, we enable users to easily browse a large scale indoor environment from a bird's-eye view, locate specific room interiors, fly into a place of interest, view immersive ground-level panoramas, and zoom out again, all with seamless 3D transitions.

The intervalgram: An audio feature for large-scale melody recognition
Thomas C. Walters*, David Ross*, Richard F. Lyon*, [CMMR'12]
Intervalgrams are small images that summarize the structure of short segments of music by looking at the musical intervals between the notes present in the music. We use them for finding cover songs - different pieces of music that share the same underlying composition. Wedo this by comparing 'heatmaps' which look at the similarity between intervalgrams from different pieces of music over time. If we see a strong diagonal line in the heatmap, it's good evidence that the songs are musically similar.

General and Nested Wiberg Minimization
Dennis Strelow*, [CVPR'12]
Eriksson and van den Hengel’s CVPR 2010 paper showed that Wiberg’s least squares matrix factorization, which effectively eliminates one matrix from the factorization problem, could be applied to the harder case of L1 factorization. Our paper generalizes their approach beyond factorization to general nonlinear problems in two sets of variables, like perspective structure-from-motion. We also show that with our generalized method, one Wiberg minimization can also be nested inside another, effectively eliminating two of three sets of unknowns, and we demonstrated this idea using projective struture-from-motion

Calibration-Free Rolling Shutter Removal
Matthias Grundmann*, Vivek Kwatra*, Daniel Castro, Irfan Essa*, International Conference on Computational Photography '12. Best paper.
Mobile phones and current generation DSLR’s, contain an electronic rolling shutter, capturing each frame one row of pixels at a time. Consequently, if the camera moves during capture, it will cause image distortions ranging from shear to wobbly distortions. We propose a calibration-free solution based on a novel parametric mixture model to correct these rolling shutter distortions in videos that enables real-time rolling shutter rectification as part of YouTube’s video stabilizer.

Natural Language Processing

Vine Pruning for Efficient Multi-Pass Dependency Parsing
Alexander Rush, Slav Petrov*, The 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL '12), Best Paper Award.
Being able to accurately analyze the grammatical structure of sentences is crucial for language understanding applications such as machine translation or question answering. In this paper we present a method that is up to 200 times faster than existing methods and enables the grammatical analysis of text in large-scale applications. The key idea is to perform the analysis in multiple coarse-to-fine passes, resolving easy ambiguities first and tackling the harder ones later on.

Cross-lingual Word Clusters for Direct Transfer of Linguistic Structure
Oscar Tackstrom, Ryan McDonald*, Jakob Uszkoreit*, North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL '12), Best Student Paper Award.
This paper studies how to build meaningful cross-lingual word clusters, i.e., clusters containing lexical items from two languages that are coherent along some abstract dimension. This is done by coupling distributional statistics learned from huge amounts of language specific data coupled with constraints generated from parallel corpora. The resulting clusters are used to improve the accuracy of multi-lingual syntactic parsing for languages without any training resources.

Networks

How to Split a Flow
Tzvika Hartman*, Avinatan Hassidim*, Haim Kaplan*, Danny Raz*, Michal Segalov*, [INFOCOM '12]
Decomposing a flow into a small number of paths is a very important task arises in various network optimization mechanisms. In this paper we develop an an approximation algorithm for this problem that has both provable worst case performance grantees as well as good practical behavior.

Deadline-Aware Datacenter TCP (D2TCP)
Balajee Vamanan, Jahangir Hasan*, T. N. Vijaykumar, [SIGCOMM '12]
Some of our most important products like search and ads operate under soft-real-time constraints. They are architected and fine-tuned to return results to users within a few hundred milliseconds. Deadline-Aware Datacenter TCP is a research effort into making the datacenter networks deadline aware, thus improving the performance of such key applications.

Trickle: Rate Limiting YouTube Video Streaming
Monia Ghobadi, Yuchung Cheng*, Ankur Jain*, Matt Mathis* [USENIX '12]
Trickle is a server-side mechanism to stream YouTube video smoothly to reduce burst and buffer-bloat. It paces the video stream by placing an upper bound on TCP’s congestion window based on the streaming rate and the round-trip time. In initial evaluation Trickle reduces the TCP loss rate by up to 43% and the RTT by up to 28%. Given the promising results we are deploying Trickle to all YouTube servers.

Social Systems

Look Who I Found: Understanding the Effects of Sharing Curated Friend Groups
Lujun Fang*, Alex Fabrikant*, Kristen LeFevre*, [Web Science '12]. Best Student Paper award.
In this paper, we studied the impact of the Google+ circle-sharing feature, which allows individual users to share (publicly and privately) pre-curated groups of friends and contacts. We specifically investigated the impact on the growth and structure of the Google+ social network. In the course of the analysis, we identified two natural categories of shared circles ("communities" and "celebrities"). We also observed that the circle-sharing feature is associated with the accelerated densification of community-type circles.

Software Engineering

AddressSanitizer: A Fast Address Sanity Checker
Konstantin Serebryany*, Derek Bruening*, Alexander Potapenko*, Dmitry Vyukov*, [USENIX ATC '12].
The paper “AddressSanitizer: A Fast Address Sanity Checker” describes a dynamic tool that finds memory corruption bugs in C or C++ programs with only a 2x slowdown. The major feature of AddressSanitizer is simplicity -- this is why the tool is very fast.

Speech

Japanese and Korean Voice Search
Mike Schuster*, Kaisuke Nakajima*, IEEE International Conference on Acoustics, Speech, and Signal Processing [ICASSP'12].
"Japanese and Korean voice search" explains in detail how the Android voice search systems for these difficult languages were developed. We describe how to segment statistically to be able to handle infinite vocabularies without out-of-vocabulary words, how to handle the lack of spaces between words for language modeling and dictionary generation, and how to deal best with multiple ambiguities during evaluation scoring of reference transcriptions against hypotheses. The combination of techniques presented led to high quality speech recognition systems--as of 6/2013 Japanese and Korean are #2 and #3 in terms of traffic after the US.

Google's Cross-Dialect Arabic Voice Search
Fadi Biadsy*, Pedro J. Moreno*, Martin Jansche*, IEEE International Conference on Acoustics, Speech, and Signal Processing [ICASSP 2012].
This paper describes Google’s automatic speech recognition systems for recognizing several Arabic dialects spoken in the Middle East, with the potential to reach more than 125 million users. We suggest solutions for challenges specific to Arabic, such as the diacritization problem, where short vowels are not written in Arabic text. We conduct experiments to identify the optimal manner in which acoustic data should be clustered among dialects.

Deep Neural Networks for Acoustic Modeling in Speech Recognition
Geoffrey Hinton*, Li Deng, Dong Yu, George Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew W. Senior*, Vincent Vanhoucke*, Patrick Nguyen, Tara Sainath, Brian Kingsbury, Signal Processing Magazine (2012)"
Survey paper on the DNN breakthrough in automatic speech recognition accuracy.

Statistics

Empowering Online Advertisements by Empowering Viewers with the Right to Choose
Max Pashkevich*, Sundar Dorai-Raj*, Melanie Kellar*, Dan Zigmond*, Journal of Advertising Research, vol. 52 (2012).
YouTube’s TrueView in-stream video advertising format (a form of skippable in-stream ads) can improve the online video viewing experience for users without sacrificing advertising value for advertisers or content owners.

Structured Data

Efficient Spatial Sampling of Large Geographical Tables
Anish Das Sarma*, Hongrae Lee*, Hector Gonzalez*, Jayant Madhavan*, Alon Halevy*, [SIGMOD '12].
This paper presents fundamental results for the "thinning problem": determining appropriate samples of data to be shown on specific geographical regions and zoom levels. This problem is widely applicable for a number of cloud-based geographic visualization systems such as Google Maps, Fusion Tables, and the developed algorithms are part of the Fusion Tables backend. The SIGMOD 2012 paper was selected among the best papers of the conference, and invited to a special best-papers issue of TODS.

Systems

Spanner: Google's Globally-Distributed Database
James C. Corbett*, Jeffrey Dean*, Michael Epstein*, Andrew Fikes*, Christopher Frost*, JJ Furman*, Sanjay Ghemawat*, Andrey Gubarev*, Christopher Heiser*, Peter Hochschild*, Wilson Hsieh*, Sebastian Kanthak*, Eugene Kogan*, Hongyi Li*, Alexander Lloyd*, Sergey Melnik*, David Mwaura*, David Nagle*, Sean Quinlan*, Rajesh Rao*, Lindsay Rolig*, Dale Woodford*, Yasushi Saito*, Christopher Taylor*, Michal Szymaniak*, Ruth Wang*, [OSDI '12]
This paper shows how a new time API and its implementation can provide the abstraction of tightly synchronized clocks, even on a global scale. We describe how we used this technology to build a globally-distributed database that supports a variety of powerful features: non-blocking reads in the past, lock-free snapshot transactions, and atomic schema changes.

Wednesday, 12 June 2013

Improving Photo Search: A Step Across the Semantic Gap



Last month at Google I/O, we showed a major upgrade to the photos experience: you can now easily search your own photos without having to manually label each and every one of them. This is powered by computer vision and machine learning technology, which uses the visual content of an image to generate searchable tags for photos combined with other sources like text tags and EXIF metadata to enable search across thousands of concepts like a flower, food, car, jet ski, or turtle.

For many years Google has offered Image Search over web images; however, searching across photos represents a difficult new challenge. In Image Search there are many pieces of information which can be used for ranking images, for example text from the web or the image filename. However, in the case of photos, there is typically little or no information beyond the pixels in the images themselves. This makes it harder for a computer to identify and categorize what is in a photo. There are some things a computer can do well, like recognize rigid objects and handwritten digits. For other classes of objects, this is a daunting task, because the average toddler is better at understanding what is in a photo than the world’s most powerful computers running state of the art algorithms.

This past October the state of the art seemed to move things a bit closer to toddler performance. A system which used deep learning and convolutional neural networks easily beat out more traditional approaches in the ImageNet computer vision competition designed to test image understanding. The winning team was from Professor Geoffrey Hinton’s group at the University of Toronto.

We built and trained models similar to those from the winning team using software infrastructure for training large-scale neural networks developed at Google in a group started by Jeff Dean and Andrew Ng. When we evaluated these models, we were impressed; on our test set we saw double the average precision when compared to other approaches we had tried. We knew we had found what we needed to make photo searching easier for people using Google. We acquired the rights to the technology and went full speed ahead adapting it to run at large scale on Google’s computers. We took cutting edge research straight out of an academic research lab and launched it, in just a little over six months. You can try it out at photos.google.com.

Why the success now? What is new? Some things are unchanged: we still use convolutional neural networks -- originally developed in the late 1990s by Professor Yann LeCun in the context of software for reading handwritten letters and digits. What is different is that both computers and algorithms have improved significantly. First, bigger and faster computers have made it feasible to train larger neural networks with much larger data. Ten years ago, running neural networks of this complexity would have been a momentous task even on a single image -- now we are able to run them on billions of images. Second, new training techniques have made it possible to train the large deep neural networks necessary for successful image recognition.

We feel it would be interesting to the research community to discuss some of the unique aspects of the system we built and some qualitative observations we had while testing the system.

The first is our label and training set and how it compares to that used in the ImageNet Large Scale Visual Recognition competition. Since we were working on search across photos, we needed an appropriate label set. We came up with a set of about 2000 visual classes based on the most popular labels on Google+ Photos and which also seemed to have a visual component, that a human could recognize visually. In contrast, the ImageNet competition has 1000 classes. As in ImageNet, the classes were not text strings, but are entities, in our case we use Freebase entities which form the basis of the Knowledge Graph used in Google search. An entity is a way to uniquely identify something in a language-independent way. In English when we encounter the word “jaguar”, it is hard to determine if it represents the animal or the car manufacturer. Entities assign a unique ID to each, removing that ambiguity, in this case “/m/0449p” for the former and “/m/012x34” for the latter. In order to train better classifiers we used more training images per class than ImageNet, 5000 versus 1000. Since we wanted to provide only high precision labels, we also refined the classes from our initial set of 2000 to the most precise 1100 classes for our launch.

During our development process we had many more qualitative observations we felt are worth mentioning:

1) Generalization performance. Even though there was a significant difference in visual appearance between the training and test sets, the network appeared to generalize quite well. To train the system, we used images mined from the web which did not match the typical appearance of personal photos. Images on the web are often used to illustrate a single concept and are carefully composed, so an image of a flower might only be a close up of a single flower. But personal photos are unstaged and impromptu, a photo of a flower might contain many other things in it and may not be very carefully composed. So our training set image distribution was not necessarily a good match for the distribution of images we wanted to run the system on, as the examples below illustrate. However, we found that our system trained on web images was able to generalize and perform well on photos.

A typical photo of a flower found on the web.
A typical photo of a flower found in an impromptu photo.

2) Handling of classes with multi-modal appearance. The network seemed to be able to handle classes with multimodal appearance quite well, for example the “car” class contains both exterior and interior views of the car. This was surprising because the final layer is effectively a linear classifier which creates a single dividing plane in a high dimensional space. Since it is a single plane, this type of classifier is often not very good at representing multiple very different concepts.

3) Handling abstract and generic visual concepts. The system was able to do reasonably well on classes that one would think are somewhat abstract and generic. These include "dance", "kiss", and "meal", to name a few. This was interesting because for each of these classes it did not seem that there would be any simple visual clues in the image that would make it easy to recognize this class. It would be difficult to describe them in terms of simple basic visual features like color, texture, and shape.

Photos recognized as containing a meal.
4) Reasonable errors. Unlike other systems we experimented with, the errors which we observed often seemed quite reasonable to people. The mistakes were the type that a person might make - confusing things that look similar. Some people have already noticed this, for example, mistaking a goat for a dog or a millipede for a snake. This is in contrast to other systems which often make errors which seem nonsensical to people, like mistaking a tree for a dog.

Photo of a banana slug mistaken for a snake.
Photo of a donkey mistaken for a dog.

5) Handling very specific visual classes. Some of the classes we have are very specific, like specific types of flowers, for example “hibiscus” or “dhalia”. We were surprised that the system could do well on those. To recognize specific subclasses very fine detail is often needed to differentiate between the classes. So it was surprising that a system that could do well on a full image concept like “sunsets” could also do well on very specific classes.

Photo recognized as containing a hibiscus flower.
Photo recognized as containing a dahlia flower.
Photo recognized as containing a polar bear.
Photo recognized as containing a grizzly bear.

The resulting computer vision system worked well enough to launch to people as a useful tool to help improve personal photo search, which was a big step forward. So, is computer vision solved? Not by a long shot. Have we gotten computers to see the world as well as people do? The answer is not yet, there’s still a lot of work to do, but we’re closer.

Tuesday, 11 June 2013

2013 Google PhD Fellowships: 5 Years of Supporting the Future of Computer Science



We are extremely excited to announce the 2013 Global Google PhD Fellows. From all around the globe, these 39 PhD students represent the fifth class in the program’s history, a select group recognized by Google researchers and their institutions as some of the most promising young academics in the world. As we welcome the newest class of PhD Fellows, we take a look back at the program’s roots and hear from two past recipients.

In 2009, Google launched its PhD Fellowship Program, created to recognize and support outstanding graduate students pursuing work in computer science, related disciplines or promising research areas. In its inaugural year, 13 United States PhD students were awarded fellowships, drawn from an extremely competitive pool of applicants. The global program now covers Europe, China, India and Australia and continues to draw some of the best young researchers, reflecting Google’s commitment to building strong relations with the academic community.

Among those first recipients of the fellowship award are 2009 PhD Fellow Roxana Geambasu, Visiting Professor in the Computer Science Department at Columbia University, and 2010 European Doctoral Fellow Roland Angst, Visiting Assistant Professor at Stanford University and affiliated with the Max Planck Center for Visual Computing and Communication. As early recipients of the award, Roxana and Roland reflect on the impact that the Google Fellowship program had on their careers.

For Roxana, the fellowship provided the tools and connections that helped lay the foundation for her academic career. She believes industrial fellowship programs are very important, as they give students an opportunity to interact more closely with industry.

“Beyond the financial support, I think that the fellowship impacted my career in many important ways. First, the Google fellowships are regarded as highly competitive, so receiving the award was probably a big plus on my resume when I was interviewing for faculty positions.”

“Second, the award yielded a mentor within Google, Brad Chen, with whom I've kept in touch ever since, as well as opportunities to visit the campus, deliver talks and meet Google engineers. Brad and I continue to meet at conferences and discuss my work, his work and (of late) the work of my students; it’s through that relationship I’m exposed to new people from Google and gain valuable advice about faculty award opportunities.”

Roland Angst credits the award with the ability to lighten his teaching load and instead focus on his research, which ultimately prepared him for his future academic career. Like Roxanna, Roland states that the fellowship also gave him the opportunity to establish connections with people working in related topics in industry.

“In my view, programs such as the Google Fellowship Awards represent an important and integral link between industry and universities. Firstly, such programs increase the awareness in the academic world for relevant problems in industry. Secondly, these programs allow the IT industry to express their gratitude to the educational services provided by the universities on which the IT industry heavily relies on."

We welcome the latest recipients of the Global Google PhD Fellowships for 2013 with great excitement and high expectations. Recognized for their incredible innovation, creativity and leadership, we are very happy to support these excellent PhD students and offer our sincere congratulations.

Monday, 10 June 2013

Building A Visual Planetary Time Machine



When a societal or scientific issue is highly contested, visual evidence can cut to the core of the debate in a way that words alone cannot — communicating complicated ideas that can be understood by experts and non-experts alike. After all, it took the invention of the optical telescope to overturn the idea that the heavens revolved around the earth.

Last month, Google announced a zoomable and explorable time-lapse view of our planet. This time-lapse Earth enables you explore the last 29 years of our planet’s history — from the global scale to the local scale, all across the planet. We hope this new visual dataset will ground debates, encourage discovery, and shift perspectives about some of today’s pressing global issues.

This project is a collaboration between Google’s Earth Engine team, Carnegie Mellon University’s CREATE Lab, and TIME Magazine — using nearly a petabyte of historical record from USGS’s and NASA’s Landsat satellites. And in this post, we’d like to give a little insight into the process required to build this time-lapse view of our planet.

Previews of the phenomena visible in these time-lapses.

First we'll describe Google’s Earth Engine system for deriving the time-series imagery. Second, we'll tell you more about CMU’s open-source “Time Machine” software for creating and streaming large, explorable time-series imagery.

Annual Composites: Distilling a Massive Dataset

Google Earth Engine brings together the world's scientific satellite imagery — over a petabyte of multispectral imagery recording over 40 years of history — and makes it available online with tools that scientists, independent researchers, and nations can use to mine this massive warehouse of data to detect changes, map trends and quantify differences on the Earth's surface using Google’s computational infrastructure. Today, the platform is used to monitor the Amazon and estimate forest carbon in Tanzania, among hundreds of other partners developing new uses for the technology.

Using Earth Engine, we first built annual global mosaics at a resolution of 30 meters per pixel for each year from 1984 through 2012. We started with a total of 2,068,467 scenes from the Landsat 4, 5, and 7 satellites, comprising 909 terabytes of data. The Earth’s atmosphere is a constantly-shifting sea of clouds, so in order to assemble a seamless cloud-free view of each year we analyzed all the images available at each location and used a simple cloud model to separate out the clouds from the ground. To help correct for atmospheric and seasonal effects, we used an additional 20TB of data from the MODIS MCD43A4 product to build a cloud-free low-resolution model of the Earth over time. We combined all this to produce a statistical estimate of the color of each pixel for every year for which data was available. Producing the final 29 global mosaics took a bit less than a day and consumed approximately 260,000 core-hours of CPU.

Some areas of the planet are almost perpetually cloudy, obscuring satellite views. In addition, before the more capable Landsat 7 began operating in 1999, coverage in some areas of the world was sparse, particularly in Asia, for various operational and technological reasons. We wrestled with how best to visualize areas with missing or cloud-obscured images from each year. In the end, after much experimentation, we chose to simply interpolate between valid image years. Other techniques, such as greying out invalid data, created distractingly large artifacts, visually drowning out the valid information. However, the downside with the approach we have taken is that it can be difficult to tell which data is original and which is interpolated. We are exploring the possibility of including a view that allows drilling down into the non-interpolated, original mosaics.

"Time Machine": An HTML5 Time-Series Exploration Tool

Once we had produced the final global images, we adapted the Carnegie Mellon CREATE Lab’s open-source “Time Machine” software, which enables authoring, streaming, and exploring very-high-resolution videos. Time Machine videos take advantage of the power of HTML5 and modern web browsers: they are streamed as multiresolution, overlapping video tiles and displayed in a web page by manipulating the HTML5 <video> tag, in much the same way that Google Maps first demonstrated using the HTML <img> tag.

Examples of zoomable timelapses with hundreds of millions or billions of pixels per frame include documenting plant growth, bee colony collapse, and very-large-scale simulations of the universe. Time-lapse Earth, however, sets a new record for giant videos: each frame of the video is a global Mercator-projected map with a resolution of 30 meters per pixel at the equator, for a total of 1.78 trillion pixels per frame. That’s about a million times larger than a standard HD video stream. In order to scale to such large videos, we needed to integrate Time Machine’s data production pipeline into Earth Engine and the rest of Google’s infrastructure. Encoding the final video tiles consumed approximately 1.4 million core-hours of CPU in Google’s data centers over the course of about a day. For CMU's researchers, this would have been impossible without Google's resources.

Combining all three phases of product generation:
  • Total processing time: 3 days
  • Total CPU usage: 1.8 million core-hours
  • Peak CPU usage: 66,000 simultaneous cores
Destination locations of top 1500 share links, weighted by number of visits.

Time-lapse Earth is powerful because it helps us to access and construct the story of our planet. That story will become richer with each release, as we continue to improve fidelity and add data. The story-teller is everyone — scientists and citizens alike provide the real value by interacting, exploring, layering their knowledge upon the globe, and sharing their insights so that we can all better understand our world.

We are especially proud of the collaboration that made time-lapse Earth possible, and believe it to be an exemplar of how industry, academia, government, and the press can benefit from working together deeply over a period of years. By drawing on the strengths of each member of the collaborative community, Google strives to integrate the world's technical expertise and knowledge in order to tackle innovative and groundbreaking projects. In doing so, it is our goal to deliver an impactful service, one that can put a focus on the dramatic effect we are having on our planet.

Monday, 3 June 2013

The Story Behind Course Builder



Last summer, we ran a MOOC (Massive Open Online Course) on Power Searching. Soon after, we open sourced Course Builder, the platform that we developed on Google technologies to present the course. Since then, we have released four versions of Course Builder adding features such as user-friendly content development, administrative support, dashboards on student performance and behavior, new assessment types including peer review, accessibility, internationalization, etc. A large number of courses have been hosted on Course Builder, with many more in the pipeline.

This work started with the observation that we have all the component technology one needs to create a platform for delivering a learning experience similar to other MOOCs that were being offered on Coursera and Udacity. So we set about wiring together these components (YouTube, App Engine, Groups, Apps, Google+ and Hangouts, etc.) to create the first version of Course Builder.

As we talked with faculty and others who wanted to create online learning experiences, we saw an opportunity for Course Builder to play an important role in the MOOC space. Our goal is to provide the capability for anyone to create a MOOC or even an “OOC”. We believe that an online environment can be used for a wide variety of education-related activities beyond just the standard university course. We have implemented a feature set that supports this goal.

Our users include not only colleges and universities, but also non-profits and K12 organizations. We host academic courses such as Information Visualization and Game Theory, as well as short courses including Mapping with Google, Digital Learning in K12, YouTube Creator Academy, and Giving with Purpose. Supporting this diversity in users, content and format is why we created Course Builder.

Hosting the platform on App Engine has provided additional capabilities that are essential for our users, particularly colleges and universities. It’s possible to brand a MOOC anyway the user wants. The user also owns the relationship with the student directly, and owns any data that they collect to use anyway they like. Given Course Builder is open source, it is possible to easily add customized features. Add to that App Engine’s scalability, self-managed hosting and the extensible component architecture built into Course Builder, and you have a powerful, flexible platform that can support any number of students and any type of content.

We will continue to support this diverse user base, and work to get even more great teachers and innovative learning designers involved in experimenting in this brave new world of online learning. The potential for positive disruption and change is enormous.