Introduction to Convolutional Neural Networks for Computer Vision

The classic method of fine-grained image classification is to first define different locations on the image, for example, the head, foot, or wings of a bird. Then we have to extract features from these locations, and finally, combine these features and use them to complete classification. This type of method features very high accuracy, but it requires a massive dataset and manual tagging of location information. One major trend in fine-grained classification is training without additional supervision information, instead of using only image notes. This method gets represented by the bilinear convolutional neural network (CNN) method.

Bilinear CNN computes the outer-product of convolution descriptors to find the mutual relationships between different dimensions. Because the dimensions of different descriptors correspond to different channels for convolution features, and different channels extract different semantic features, using bilinear operation allows us to capture the relationships between different semantic elements on the input image.

Image captioning is the process of generating a one or two sentence description of an image. The basic idea behind designing an image captioning network gets based on the concept behind machine translation in the field of natural language processing. After replacing the source language encoding network in a machine translator with an image CNN encoding network and extracting the features of the image, we can use the decoder network for the target language to create a text description.

Given an image and a question related to that image, visual question answering aims to answer that question from a selection of candidate answers. The concept is to use CNN to extract features from an image, RNN to extract text features from the text question, then combine the visual and textual features, and finally perform classification using the fully-connected later. The key to this task is figuring out how to connect these two types of features. Methods that directly combine these features transform them into a vector, or add or produce the visual and textual vector by adding or multiplying the elements.

For more information, please go to Deep Dive into Computer Vision with Neural Networks – Part 1.

Related Products

Container Service

Container Service is a high-performance and scalable container application management service that enables you to use Docker and Kubernetes to manage the lifecycle of containerized applications. It is able to run TensorFlow-based AlexNet for deep learning tasks in the computer vision field.

Realtime Compute

Realtime Compute offers a one-stop, high-performance platform that enables real-time big data processing based on Apache Flink. It is widely used in diverse scenarios, such as streaming data processing, offline data processing, and data lake computing. With Realtime Compute, you can process and analyze big data in real time for business insights and decision making.

Community

Introduction to Convolutional Neural Networks for Computer Vision

Related Blog Posts

Deep Dive into Computer Vision with Neural Networks – Part 2

Geoffrey Hinton's Capsule Networks: A Novel Approach to Deep Learning

Related Documentation

Implement image classification by TensorFlow

Run TensorFlow-based AlexNet in Alibaba Cloud Container Service

Related Products

Container Service

Realtime Compute

Read previous post:

Read next post:

Alibaba Clouder

You may also like

Comments

Alibaba Clouder

Related Products

E-MapReduce Service

Platform For AI