Selper Pro: Computer Vision

Showing posts with label Computer Vision. Show all posts

Monday, 7 December 2015

How to Classify Images with TensorFlow

Posted by Pete Warden, Software Engineer

Prior to joining Google, I spent a lot of time trying to get computers to recognize objects in images. At Jetpac my colleagues and I built mustache detectors to recognize bars full of hipsters, blue sky detectors to find pubs with beer gardens, and dog detectors to spot canine-friendly cafes. At first, we used the traditional computer vision approaches that I'd used my whole career, writing a big ball of custom logic to laboriously recognize one object at a time. For example, to spot sky I'd first run a color detection filter over the whole image looking for shades of blue, and then look at the upper third. If it was mostly blue, and the lower portion of the image wasn't, then I'd classify that as probably a photo of the outdoors.

I'd been an engineer working on vision problems since the late 90's, and the sad truth was that unless you had a research team and plenty of time behind you, this sort of hand-tailored hack was the only way to get usable results. As you can imagine, the results were far from perfect and each detector I wrote was a custom job, and didn't help me with the next thing I needed to recognize. This probably seems laughable to anybody who didn't work in computer vision in the recent past! It's such a primitive way of solving the problem, it sounds like it should have been superseded long ago.

That's why I was so excited when I started to play around with deep learning. It became clear as I tried them out that the latest approaches using convolutional neural networks were producing far better results than my hand-tuned code on similar problems. Not only that, the process of training a detector for a new class of object was much easier. I didn't have to think about what features to detect, I'd just supply a network with new training examples and it would take it from there.

Those experiences converted me into a deep learning enthusiast, and so when Jetpac was acquired and I had the chance to join Google and work with many of the stars of the field, I couldn't resist. What impressed me more than anything was the team's willingness to share their knowledge with the rest of the world.

I'm especially happy that we've just managed to release TensorFlow, our internal machine learning framework, because it gives me a chance to show practical, usable examples of why I'm so convinced deep learning is an essential tool for anybody working with images, speech, or text in ML.

Given my background, my favorite first example is using a deep network to spot objects in an image. One of the early showcases for the new approach to neural networks was an annual competition to recognize 1,000 different classes of objects, from the Imagenet data set, and TensorFlow includes a pre-trained network for that task. If you look inside the examples folder in the source code, you'll see “label_image”, which is a small C++ application for using that network.

The README has the instructions for building TensorFlow on your machine, downloading the binary files defining the network, and compiling the sample code. Once it's all built, just run it with no arguments, and you should see a list of results showing "Military Uniform" at the top. This is running on the default image of Admiral Grace Hopper, and correctly spots her attire.

Image via Wikipedia

After that, try pointing it at your own images using the “--image” command line flag, and you should see a set of labels for each. If you want to know more about what's going on under the hood, the C++ section of the TensorFlow Inception tutorial goes into a lot more detail.

The only things it will spot are those that are in the original 1,000 Imagenet classes, and it will always try to find something, which can lead to some funny results. There are no people categories, so on portraits you'll often see objects that are associated with people like seat belts or oxygen masks, or in Lincoln’s case, a bow tie!

Image via U.S History Images

If the image is poorly lit, then “nematode” is usually the top pick since most training photos of those are taken in very dim surroundings. It's also not perfect in its identification, with an error rate of 5.6% for getting the right label in the top five results. However, that’s not all that bad considering Stanford’s Andrej Karpathy found that even someone who was trained at the job could only achieve a slightly-better 5.1% error doing the same task manually. We can do even better if we combine the outputs of four trained models into an "ensemble", with an error rate of just 3.5%.

It's unlikely that the set of labels it produces is exactly what you need for your application, so the next step would be to train your own network. That is a much bigger task than running a pre-trained one like this, but one of the things I like about TensorFlow is that it spans the whole lifecycle of a machine learning model, from experimentation, to training, and into production, as this example shows. To get started training, I'd recommend looking at this simple tutorial on recognizing hand-drawn digits from the MNIST data set.

I hope that sharing this framework will help developers build amazing user experiences we’d never even think of. We’ve been having a massive amount of fun with TensorFlow, and I can’t wait to see what interesting image tools you build using it!

Wednesday, 1 July 2015

DeepDream - a code example for visualizing Neural Networks

Posted by Alexander Mordvintsev, Software Engineer, Christopher Olah, Software Engineering Intern and Mike Tyka, Software Engineer

Two weeks ago we blogged about a visualization tool designed to help us understand how neural networks work and what each layer has learned. In addition to gaining some insight on how these networks carry out classification tasks, we found that this process also generated some beautiful art.

Top: Input image. Bottom: output image made using a network trained on places by MIT Computer Science and AI Laboratory.

We have seen a lot of interest and received some great questions, from programmers and artists alike, about the details of how these visualizations are made. We have decided to open source the code we used to generate these images in an IPython notebook, so now you can make neural network inspired images yourself!

The code is based on Caffe and uses available open source packages, and is designed to have as few dependencies as possible. To get started, you will need the following (full details in the notebook):

NumPy, SciPy, PIL, IPython, or a scientific python distribution such as Anaconda or Canopy.
Caffe deep learning framework (Installation instructions)

Once you’re set up, you can supply an image and choose which layers in the network to enhance, how many iterations to apply and how far to zoom in. Alternatively, different pre-trained networks can be plugged in.

It'll be interesting to see what imagery people are able to generate. If you post images to Google+, Facebook, or Twitter, be sure to tag them with #deepdream so other researchers can check them out too.

Wednesday, 17 June 2015

Inceptionism: Going Deeper into Neural Networks

Posted by Alexander Mordvintsev, Software Engineer, Christopher Olah, Software Engineering Intern and Mike Tyka, Software Engineer

Update - 13/07/2015
Images in this blog post are licensed by Google Inc. under a Creative Commons Attribution 4.0 International License. However, images based on places by MIT Computer Science and AI Laboratory require additional permissions from MIT for use.

Artificial Neural Networks have spurred remarkable recent progress in image classification and speech recognition. But even though these are very useful tools based on well-known mathematical methods, we actually understand surprisingly little of why certain models work and others don’t. So let’s take a look at some simple techniques for peeking inside these networks.

We train an artificial neural network by showing it millions of training examples and gradually adjusting the network parameters until it gives the classifications we want. The network typically consists of 10-30 stacked layers of artificial neurons. Each image is fed into the input layer, which then talks to the next layer, until eventually the “output” layer is reached. The network’s “answer” comes from this final output layer.

One of the challenges of neural networks is understanding what exactly goes on at each layer. We know that after training, each layer progressively extracts higher and higher-level features of the image, until the final layer essentially makes a decision on what the image shows. For example, the first layer maybe looks for edges or corners. Intermediate layers interpret the basic features to look for overall shapes or components, like a door or a leaf. The final few layers assemble those into complete interpretations—these neurons activate in response to very complex things such as entire buildings or trees.

One way to visualize what goes on is to turn the network upside down and ask it to enhance an input image in such a way as to elicit a particular interpretation. Say you want to know what sort of image would result in “Banana.” Start with an image full of random noise, then gradually tweak the image towards what the neural net considers a banana (see related work in [1], [2], [3], [4]). By itself, that doesn’t work very well, but it does if we impose a prior constraint that the image should have similar statistics to natural images, such as neighboring pixels needing to be correlated.

So here’s one surprise: neural networks that were trained to discriminate between different kinds of images have quite a bit of the information needed to generate images too. Check out some more examples across different classes:

Why is this important? Well, we train networks by simply showing them many examples of what we want them to learn, hoping they extract the essence of the matter at hand (e.g., a fork needs a handle and 2-4 tines), and learn to ignore what doesn’t matter (a fork can be any shape, size, color or orientation). But how do you check that the network has correctly learned the right features? It can help to visualize the network’s representation of a fork.

Indeed, in some cases, this reveals that the neural net isn’t quite looking for the thing we thought it was. For example, here’s what one neural net we designed thought dumbbells looked like:

There are dumbbells in there alright, but it seems no picture of a dumbbell is complete without a muscular weightlifter there to lift them. In this case, the network failed to completely distill the essence of a dumbbell. Maybe it’s never been shown a dumbbell without an arm holding it. Visualization can help us correct these kinds of training mishaps.

Instead of exactly prescribing which feature we want the network to amplify, we can also let the network make that decision. In this case we simply feed the network an arbitrary image or photo and let the network analyze the picture. We then pick a layer and ask the network to enhance whatever it detected. Each layer of the network deals with features at a different level of abstraction, so the complexity of features we generate depends on which layer we choose to enhance. For example, lower layers tend to produce strokes or simple ornament-like patterns, because those layers are sensitive to basic features such as edges and their orientations.

Left: Original photo by Zachi Evenor. Right: processed by Günther Noack, Software Engineer

Left: Original painting by Georges Seurat. Right: processed images by Matthew McNaughton, Software Engineer

If we choose higher-level layers, which identify more sophisticated features in images, complex features or even whole objects tend to emerge. Again, we just start with an existing image and give it to our neural net. We ask the network: “Whatever you see there, I want more of it!” This creates a feedback loop: if a cloud looks a little bit like a bird, the network will make it look more like a bird. This in turn will make the network recognize the bird even more strongly on the next pass and so forth, until a highly detailed bird appears, seemingly out of nowhere.

The results are intriguing—even a relatively simple neural network can be used to over-interpret an image, just like as children we enjoyed watching clouds and interpreting the random shapes. This network was trained mostly on images of animals, so naturally it tends to interpret shapes as animals. But because the data is stored at such a high abstraction, the results are an interesting remix of these learned features.

Of course, we can do more than cloud watching with this technique. We can apply it to any kind of image. The results vary quite a bit with the kind of image, because the features that are entered bias the network towards certain interpretations. For example, horizon lines tend to get filled with towers and pagodas. Rocks and trees turn into buildings. Birds and insects appear in images of leaves.

The original image influences what kind of objects form in the processed image.

This technique gives us a qualitative sense of the level of abstraction that a particular layer has achieved in its understanding of images. We call this technique “Inceptionism” in reference to the neural net architecture used. See our Inceptionism gallery for more pairs of images and their processed results, plus some cool video animations.

We must go deeper: Iterations

If we apply the algorithm iteratively on its own outputs and apply some zooming after each iteration, we get an endless stream of new impressions, exploring the set of things the network knows about. We can even start this process from a random-noise image, so that the result becomes purely the result of the neural network, as seen in the following images:

Neural net “dreams”— generated purely from random noise, using a network trained on places by MIT Computer Science and AI Laboratory. See our Inceptionism gallery for hi-res versions of the images above and more (Images marked “Places205-GoogLeNet” were made using this network).

The techniques presented here help us understand and visualize how neural networks are able to carry out difficult classification tasks, improve network architecture, and check what the network has learned during training. It also makes us wonder whether neural networks could become a tool for artists—a new way to remix visual concepts—or perhaps even shed a little light on the roots of the creative process in general.

Sunday, 7 June 2015

Google Computer Vision research at CVPR 2015

Posted by Vincent Vanhoucke, Google Research Scientist

Much of the world's data is in the form of visual media. In order to utilize meaningful information from multimedia and deliver innovative products, such as Google Photos, Google builds machine-learning systems that are designed to enable computer perception of visual input, in addition to pursuing image and video analysis techniques focused on image/scene reconstruction and understanding.

This week, Boston hosts the 2015 Conference on Computer Vision and Pattern Recognition (CVPR 2015), the premier annual computer vision event comprising the main CVPR conference and several co-located workshops and short courses. As a leader in computer vision research, Google will have a strong presence at CVPR 2015, with many Googlers presenting publications in addition to hosting workshops and tutorials on topics covering image/video annotation and enhancement, 3D analysis and processing, development of semantic similarity measures for visual objects, synthesis of meaningful composites for visualization/browsing of large image/video collections and more.

Learn more about some of our research in the list below (Googlers highlighted in blue). If you are attending CVPR this year, we hope you’ll stop by our booth and chat with our researchers about the projects and opportunities at Google that go into solving interesting problems for hundreds of millions of people. Members of the Jump team will also have a prototype of the camera on display and will be showing videos produced using the Jump system on Google Cardboard.

Tutorials:
Applied Deep Learning for Computer Vision with Torch
Koray Kavukcuoglu, Ronan Collobert, Soumith Chintala

DIY Deep Learning: a Hands-On Tutorial with Caffe
Evan Shelhamer, Jeff Donahue, Yangqing Jia, Jonathan Long, Ross Girshick

ImageNet Large Scale Visual Recognition Challenge Tutorial
Olga Russakovsky, Jonathan Krause, Karen Simonyan, Yangqing Jia, Jia Deng, Alex Berg, Fei-Fei Li

Fast Image Processing With Halide
Jonathan Ragan-Kelley, Andrew Adams, Fredo Durand

Open Source Structure-from-Motion
Matt Leotta, Sameer Agarwal, Frank Dellaert, Pierre Moulon, Vincent Rabaud

Oral Sessions:
Modeling Local and Global Deformations in Deep Learning: Epitomic Convolution, Multiple Instance Learning, and Sliding Window Detection
George Papandreou, Iasonas Kokkinos, Pierre-André Savalle

Going Deeper with Convolutions
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, Andrew Rabinovich

DynamicFusion: Reconstruction and Tracking of Non-Rigid Scenes in Real-Time
Richard A. Newcombe, Dieter Fox, Steven M. Seitz

Show and Tell: A Neural Image Caption Generator
Oriol Vinyals, Alexander Toshev, Samy Bengio, Dumitru Erhan

Long-Term Recurrent Convolutional Networks for Visual Recognition and Description
Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, Trevor Darrell

Visual Vibrometry: Estimating Material Properties from Small Motion in Video
Abe Davis, Katherine L. Bouman, Justin G. Chen, Michael Rubinstein, Frédo Durand, William T. Freeman

Fast Bilateral-Space Stereo for Synthetic Defocus
Jonathan T. Barron, Andrew Adams, YiChang Shih, Carlos Hernández

Poster Sessions:
Learning Semantic Relationships for Better Action Retrieval in Images
Vignesh Ramanathan, Congcong Li, Jia Deng, Wei Han, Zhen Li, Kunlong Gu, Yang Song, Samy Bengio, Charles Rosenberg, Li Fei-Fei

FaceNet: A Unified Embedding for Face Recognition and Clustering
Florian Schroff, Dmitry Kalenichenko, James Philbin

A Mixed Bag of Emotions: Model, Predict, and Transfer Emotion Distributions
Kuan-Chuan Peng, Tsuhan Chen, Amir Sadovnik, Andrew C. Gallagher

Best-Buddies Similarity for Robust Template Matching
Tali Dekel, Shaul Oron, Michael Rubinstein, Shai Avidan, William T. Freeman

Articulated Motion Discovery Using Pairs of Trajectories
Luca Del Pero, Susanna Ricco, Rahul Sukthankar, Vittorio Ferrari

Reflection Removal Using Ghosting Cues
YiChang Shih, Dilip Krishnan, Frédo Durand, William T. Freeman

P3.5P: Pose Estimation with Unknown Focal Length
Changchang Wu

MatchNet: Unifying Feature and Metric Learning for Patch-Based Matching
Xufeng Han, Thomas Leung, Yangqing Jia, Rahul Sukthankar, Alexander C. Berg

Inferring 3D Layout of Building Facades from a Single Image
Jiyan Pan, Martial Hebert, Takeo Kanade

The Aperture Problem for Refractive Motion
Tianfan Xue, Hossein Mobahei, Frédo Durand, William T. Freeman

Video Magnification in Presence of Large Motions
Mohamed Elgharib, Mohamed Hefeeda, Frédo Durand, William T. Freeman

Robust Video Segment Proposals with Painless Occlusion Handling
Zhengyang Wu, Fuxin Li, Rahul Sukthankar, James M. Rehg

Ontological Supervision for Fine Grained Classification of Street View Storefronts
Yair Movshovitz-Attias, Qian Yu, Martin C. Stumpe, Vinay Shet, Sacha Arnoud, Liron Yatziv

VIP: Finding Important People in Images
Clint Solomon Mathialagan, Andrew C. Gallagher, Dhruv Batra

Fusing Subcategory Probabilities for Texture Classification
Yang Song, Weidong Cai, Qing Li, Fan Zhang

Beyond Short Snippets: Deep Networks for Video Classification
Joe Yue-Hei Ng, Matthew Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga, George Toderici

Workshops:
THUMOS Challenge 2015
Program organizers include: Alexander Gorban, Rahul Sukthankar

DeepVision: Deep Learning in Computer Vision 2015
Invited Speaker: Rahul Sukthankar

Large Scale Visual Commerce (LSVisCom)
Panelist: Luc Vincent

Large-Scale Video Search and Mining (LSVSM)
Invited Speaker and Panelist: Rahul Sukthankar
Program Committee includes: Apostol Natsev

Vision meets Cognition: Functionality, Physics, Intentionality and Causality
Program Organizers include: Peter Battaglia

Big Data Meets Computer Vision: 3rd International Workshop on Large Scale Visual Recognition and Retrieval (BigVision 2015)
Program Organizers include: Samy Bengio
Includes speaker Christian Szegedy - “Scalable approaches for large scale vision”

Observing and Understanding Hands in Action (Hands 2015)
Program Committee includes: Murphy Stein

Fine-Grained Visual Categorization (FGVC3)
Program Organizers include: Anelia Angelova

Large-scale Scene Understanding Challenge (LSUN)
Winners of the Scene Classification Challenge: Julian Ibarz, Christian Szegedy and Vincent Vanhoucke
Winners of the Caption Generation Challenge: Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan

Looking from above: when Earth observation meets vision (EARTHVISION)
Technical Committee includes: Andreas Wendel

Computer Vision in Vehicle Technology: Assisted Driving, Exploration Rovers, Aerial and Underwater Vehicles
Invited Speaker: Andreas Wendel
Program Committee includes: Andreas Wendel

Women in Computer Vision (WiCV)
Invited Speaker: Mei Han

ChaLearn Looking at People (sponsor)

Fine-Grained Visual Categorization (FGVC3) (sponsor)

Wednesday, 8 April 2015

Beyond Short Snippets: Deep Networks for Video Classification

Posted by Software Engineers George Toderici and Sudheendra Vijayanarasimhan

Convolutional Neural Networks (CNNs) have recently shown rapid progress in advancing the state of the art of detecting and classifying objects in static images, automatically learning complex features in pictures without the need for manually annotated features. But what if one wanted not only to identify objects in static images, but also analyze what a video is about? After all, a video isn’t much more than a string of static images linked together in time.

As it turns out, video analysis provides even more information to the object detection and recognition task performed by CNN’s by adding a temporal component through which motion and other information can be also be used to improve classification. However, analyzing entire videos is challenging from a modeling perspective because one must model variable length videos with a fixed number of parameters. Not to mention that modeling variable length videos is computationally very intensive.

In Beyond Short Snippets: Deep Networks for Video Classification, to be presented at the 2015 Computer Vision and Pattern Recognition conference (CVPR 2015), we¹ evaluated two approaches - feature pooling networks and recurrent neural networks (RNNs) - capable of modeling variable length videos with a fixed number of parameters while maintaining a low computational footprint. In doing so, we were able to not only show that learning a high level global description of the video’s temporal evolution is very important for accurate video classification, but that our best networks exhibited significant performance improvements over previously published results on the Sports 1 million dataset (Sports-1M).

In previous work, we employed 3D-convolutions (meaning convolutions over time and space) over short video clips - typically just a few seconds - to learn motion features from raw frames implicitly and then aggregate predictions at the video level. For purposes of video classification, the low level motion features were only marginally outperforming models in which no motion was modeled.

To understand why, consider the following two images which are very similar visually but obtain drastically different scores from a CNN model trained on static images:

Slight differences in object poses/context can change the predicted class/confidence of CNNs trained on static images.

Since each individual video frame forms only a small part of the video’s story, static frames and short video snippets (2-3 secs) use incomplete information and could easily confuse subtle fine-grained distinctions between classes (e.g: Tae Kwon Do vs. Systema) or use portions of the video irrelevant to the action of interest.

To get around this frame-by-frame confusion, we used feature pooling networks that independently process each frame and then pool/aggregate the frame-level features over the entire video at various stages. Another approach we took was to utilize an RNN (derived from Long Short Term Memory units) instead of feature pooling, allowing the network itself to decide which parts of the video are important for classification. By sharing parameters through time, both feature pooling and RNN architectures are able to maintain a constant number of parameters while capturing a global description of the video’s temporal evolution.

In order to feed the two aggregation approaches, we compute an image “pixel-based” CNN model, based on the raw pixels in the frames of a video. We processed videos for the “pixel-based” CNNs at one frame per second to reduce computational complexity. Of course, at this frame rate implicit motion information is lost.

To compensate, we incorporate explicit motion information in the form of optical flow - the apparent motion of objects across a camera's viewfinder due to the motion of the objects or the motion of the camera. We compute optical flow images over adjacent frames to learn an additional “optical flow” CNN model.

Left: Image used for the pixel-based CNN; Right: Dense optical flow image used for optical flow CNN

The pixel-based and optical flow based CNN model outputs are provided as inputs to both the RNN and pooling approaches described earlier. These two approaches then separately aggregate the frame-level predictions from each CNN model input, and average the results. This allows our video-level prediction to take advantage of both image information and motion information to accurately label videos of similar activities even when the visual content of those videos varies greatly.

Badminton (top 25 videos according to the max-pooling model). Our methods accurately label all 25 videos as badminton despite the variety of scenes in the various videos because they use the entire video’s context for prediction.

We conclude by observing that although very different in concept, the max-pooling and the recurrent neural network methods perform similarly when using both images and optical flow. Currently, these two architectures are the top performers on the Sports-1M dataset. The main difference between the two was that the RNN approach was more robust when using optical flow alone on this dataset. Check out a short video showing some example outputs from the deep convolutional networks presented in our paper.

1 Research carried out in collaboration with University of Maryland, College Park PhD student Joe Yue-Hei Ng and University of Texas at Austin PhD student Matthew Hausknecht, as part of a Google Software Engineering Internship^↩

Monday, 8 December 2014

High Quality Object Detection at Scale

Google Engineers Christian Szegedy, Scott Reed^*, Dumitru Erhan, and Dragomir Anguelov

Update - 26/02/2015
We recently discovered a bug in the evaluation methodology of our object detector. Consequently, the large numbers we initially reported below are not realistic, due to the fact that our separately trained context extractor was contaminated with half of the validation set images. Therefore, our initial results were overly optimistic and were not attainable by the methodology described in the paper. Re-evaluating our initial results, we have restricted ourselves to reporting only the single-model results on the other half of the dedicated validation set without retraining the models. With the updated evaluation, we are still able to report the best single-model result on the ILSVRC 2014 detection challenge data set, with 0.43 mAP when combining both Selective Search and MultiBox proposals with our post-classification model. The original draft of our paper "Scalable, High Quality Object Detection" has been updated to reflect this information. We are deeply sorry if our initial reported results caused any confusion in the community. Original post follows below.
-C. Szegedy, S. Reed, D. Erhan, and D. Anguelov

The ILSVRC detection challenge is an influential academic benchmark for measuring the quality of object detection. This summer, the GoogLeNet team reported top results in the 2014 edition of the challenge, with ~2X improvement over the previous year’s best results. However, the quality of our results came at a high computational cost: processing each image took about two minutes on a state-of-the-art workstation.

Naturally, we began to think of how we could both improve the accuracy and reduce the computation time needed. Given the already high quality of previous results like those of GoogLeNet^[6], we expected that further improvements to detection quality would be increasingly hard to achieve. In our recent paper Scalable, High Quality Object Detection^[7], we detail advances that instead have resulted in an accelerated rate of progress in object detection:

Evolution of detection quality over time. On the y axis is the mean average precision of the best published results at any given time. The blue line shows result using individual models, the red line is multi-model ensembles. Overfeat^[8] was the state-of-the-art at end of last year, followed by R-CNN^[1] published in May. The later measurement points are the results of our team.^[6,7]

As seen in the plot above, the mean average precision has been improved since August from 0.45 to 0.56: a 23% relative gain. The new approach can also match the quality of the former best solution with 140X reduced computational resources.

Most current approaches for object detection employ two phases^[1]: in the first phase, some hand-engineered algorithm proposes regions of interest in the image. In the second phase, each proposed region is run through a deep neural network, identifying which proposed patches correspond to an object (and what that object is).

For the first phase, the common wisdom^[1,2,3,4] was that it took skillfully crafted code to produce high quality region proposals. This has come with a drawback though: these methods don’t produce reliable scoring for the proposed regions. This forces the second phase to evaluate most of the proposed patches in order to achieve good results.

So we revisited our prior “MultiBox” work^[5], in which we let the computer learn to pick the proposals to see whether we could avoid relying on any of the hand-crafted methods above. Although the MultiBox method, using previous generation vision network architectures, could not compete with hand-engineered proposal approaches, there were several advantages of fully relying on machine learning only. First, the quality of proposals increases with each new improved network architecture or training methodology without additional programming effort. Second, the regions come with confidence scores which are used for trading off running time versus quality. Additionally, the implementation is simplified.

Once we used new variants of the network architecture introduced in [6], MultiBox also started to perform much better; Now, we could match the coverage of alternative methods with half as many proposal patches. Also, we changed our networks to take the context of objects into account, fueling additional quality gains for the second phase. Furthermore, we came up with a new way to train deep networks to learn more robustly even when some objects are not annotated in the training set, which improved both phases.

Besides the significant gains in mean average precision, we can now cut the number of evaluated patches dramatically at a modest loss of quality: the task that used to take 2 minutes of processing time for a single image on a workstation by the GoogLeNet ensemble (of 6 networks), is now performed under a second using a single network without using GPUs. If we constrain ourselves to a single category like “dog”, we can now process 50 images/second on the same machine by a more streamlined approach^[7] that skips the proposal generation step altogether.

As a core area of research in computer vision, object detection is used for providing strong signals for photo and video search, while high quality detection could prove useful for self-driving cars and automatically generated image captions. We look forward to the continuing research in this field.

References:

[1] Rich feature hierarchies for accurate object detection and semantic segmentation
by Ross Girshick and Jeff Donahue and Trevor Darrell and Jitendra Malik (CVPR, 2014)

[2] Prime Object Proposals with Randomized Prim’s Algorithm
by Santiago Manen, Matthieu Guillaumin and Luc Van Gool

[3] Edge boxes: Locating object proposals from edges
by Lawrence C Zitnick, and Piotr Dollàr (ECCV 2014)

[4] BING: Binarized normed gradients for objectness estimation at 300fps
by Ming-Ming Cheng, Ziming Zhang, Wen-Yan Lin and Philip Torr (CVPR 2014)

[5] Scalable Object Detection using Deep Neural Networks
by Dumitru Erhan, Christian Szegedy, Alexander Toshev, and Dragomir Anguelov

[6] Going deeper with convolutions
by Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke and Andrew Rabinovich

[7] Scalable, high quality object detection
by Christian Szegedy, Scott Reed, Dumitru Erhan and Dragomir Anguelov

[8] OverFeat: Integrated Recognition, Localization and Detection using Convolutional Network by Pierre Sermanet, David Eigen, Xiang Zhang, Michael Mathieu, Rob Fergus and Yann LeCun

* A PhD student at University of Michigan -- Ann Arbor and Software Engineering Intern at Google^↩

Monday, 17 November 2014

A picture is worth a thousand (coherent) words: building a natural description of images

Posted by Google Research Scientists Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan

“Two pizzas sitting on top of a stove top oven”
“A group of people shopping at an outdoor market”
“Best seats in the house”

People can summarize a complex scene in a few words without thinking twice. It’s much more difficult for computers. But we’ve just gotten a bit closer -- we’ve developed a machine-learning system that can automatically produce captions (like the three above) to accurately describe images the first time it sees them. This kind of system could eventually help visually impaired people understand pictures, provide alternate text for images in parts of the world where mobile connections are slow, and make it easier for everyone to search on Google for images.

Recent research has greatly improved object detection, classification, and labeling. But accurately describing a complex scene requires a deeper representation of what’s going on in the scene, capturing how the various objects relate to one another and translating it all into natural-sounding language.

Automatically captioned: “Two pizzas sitting on top of a stove top oven”

Many efforts to construct computer-generated natural descriptions of images propose combining current state-of-the-art techniques in both computer vision and natural language processing to form a complete image description approach. But what if we instead merged recent computer vision and language models into a single jointly trained system, taking an image and directly producing a human readable sequence of words to describe it?

This idea comes from recent advances in machine translation between languages, where a Recurrent Neural Network (RNN) transforms, say, a French sentence into a vector representation, and a second RNN uses that vector representation to generate a target sentence in German.

Now, what if we replaced that first RNN and its input words with a deep Convolutional Neural Network (CNN) trained to classify objects in images? Normally, the CNN’s last layer is used in a final Softmax among known classes of objects, assigning a probability that each object might be in the image. But if we remove that final layer, we can instead feed the CNN’s rich encoding of the image into a RNN designed to produce phrases. We can then train the whole system directly on images and their captions, so it maximizes the likelihood that descriptions it produces best match the training descriptions for each image.

The model combines a vision CNN with a language-generating RNN so it can take in an image and generate a fitting natural-language caption.

Our experiments with this system on several openly published datasets, including Pascal, Flickr8k, Flickr30k and SBU, show how robust the qualitative results are -- the generated sentences are quite reasonable. It also performs well in quantitative evaluations with the Bilingual Evaluation Understudy (BLEU), a metric used in machine translation to evaluate the quality of generated sentences.

A selection of evaluation results, grouped by human rating.

A picture may be worth a thousand words, but sometimes it’s the words that are most useful -- so it’s important we figure out ways to translate from images to words automatically and accurately. As the datasets suited to learning image descriptions grow and mature, so will the performance of end-to-end approaches like this. We look forward to continuing developments in systems that can read images and generate good natural-language descriptions. To get more details about the framework used to generate descriptions from images, as well as the model evaluation, read the full paper here.

Friday, 5 September 2014

Building a deeper understanding of images

Posted by Christian Szegedy, Software Engineer

The ImageNet large-scale visual recognition challenge (ILSVRC) is the largest academic challenge in computer vision, held annually to test state-of-the-art technology in image understanding, both in the sense of recognizing objects in images and locating where they are. Participants in the competition include leading academic institutions and industry labs. In 2012 it was won by DNNResearch using the convolutional neural network approach described in the now-seminal paper by Krizhevsky et al.^[4]

In this year’s challenge, team GoogLeNet (named in homage to LeNet, Yann LeCun's influential convolutional network) placed first in the classification and detection (with extra training data) tasks, doubling the quality on both tasks over last year's results. The team participated with an open submission, meaning that the exact details of its approach are shared with the wider computer vision community to foster collaboration and accelerate progress in the field.

The competition has three tracks: classification, classification with localization, and detection. The classification track measures an algorithm’s ability to assign correct labels to an image. The classification with localization track is designed to assess how well an algorithm models both the labels of an image and the location of the underlying objects. Finally, the detection challenge is similar, but uses much stricter evaluation criteria. As an additional difficulty, this challenge includes a lot of images with tiny objects which are hard to recognize. Superior performance in the detection challenge requires pushing beyond annotating an image with a “bag of labels” -- a model must be able to describe a complex scene by accurately locating and identifying many objects in it. As examples, the images in this post are actual top-scoring inferences of the GoogleNet detection model on the validation set of the detection challenge.

This work was a concerted effort by Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Drago Anguelov, Dumitru Erhan, Andrew Rabinovich and myself. Two of the team members -- Wei Liu and Scott Reed -- are PhD students who are a part of the intern program here at Google, and actively participated in the work leading to the submissions. Without their dedication the team could not have won the detection challenge.

This effort was accomplished by using the DistBelief infrastructure, which makes it possible to train neural networks in a distributed manner and rapidly iterate. At the core of the approach is a radically redesigned convolutional network architecture. Its seemingly complex structure (typical incarnations of which consist of over 100 layers with a maximum depth of over 20 parameter layers), is based on two insights: the Hebbian principle and scale invariance. As the consequence of a careful balancing act, the depth and width of the network are both increased significantly at the cost of a modest growth in evaluation time. The resultant architecture leads to over 10x reduction in the number of parameters compared to most state of the art vision networks. This reduces overfitting during training and allows our system to perform inference with low memory footprint.

For the detection challenge, the improved neural network model is used in the sophisticated R-CNN detector by Ross Girshick et al.^[2], with additional proposals coming from the multibox method^[1]. For the classification challenge entry, several ideas from the work of Andrew Howard^[3] were incorporated and extended, specifically as they relate to image sampling during training and evaluation. The systems were evaluated both stand-alone and as ensembles (averaging the outputs of up to seven models) and their results were submitted as separate entries for transparency and comparison.

These technological advances will enable even better image understanding on our side and the progress is directly transferable to Google products such as photo search, image search, YouTube, self-driving cars, and any place where it is useful to understand what is in an image as well as where things are.

References:

[1] Erhan D., Szegedy C., Toshev, A., and Anguelov, D., "Scalable Object Detection using Deep Neural Networks", The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014, pp. 2147-2154.

[2] Girshick, R., Donahue, J., Darrell, T., & Malik, J., "Rich feature hierarchies for accurate object detection and semantic segmentation", arXiv preprint arXiv:1311.2524, 2013.

[3] Howard, A. G., "Some Improvements on Deep Convolutional Neural Network Based Image Classification", arXiv preprint arXiv:1312.5402, 2013.

[4] Krizhevsky, A., Sutskever I., and Hinton, G., "Imagenet classification with deep convolutional neural networks", Advances in neural information processing systems, 2012.