# Transfer Learning
In practice, very few people train an entire Convolutional Network from scratch (with random initialization), because it is relatively rare to have a dataset of sufficient size. Instead, it is common to pretrain a ConvNet on a very large dataset (e.g. ImageNet, which contains 1.2 million images with 1000 categories), and then use the ConvNet either as an initialization or a fixed feature extractor for the task of interest. 

## When and how to fine-tune? 
How do you decide what type of transfer learning you should perform on a new dataset? This is a function of several factors, but the two most important ones are the size of the new dataset (small or big), and its similarity to the original dataset (e.g. ImageNet-like in terms of the content of images and the classes, or very different, such as microscope images). Keeping in mind that ConvNet features are more generic in early layers and more original-dataset-specific in later layers, here are some common rules of thumb for navigating the 4 major scenarios:

- **New dataset is small and similar to original dataset** 
Since the data is small, it is not a good idea to fine-tune the ConvNet due to overfitting concerns. Since the data is similar to the original data, we expect higher-level features in the ConvNet to be relevant to this dataset as well. Hence, the best idea might be to train a linear classifier on the CNN codes.

- **New dataset is large and similar to the original dataset** 
Since we have more data, we can have more confidence that we wonâ€™t overfit if we were to try to fine-tune through the full network.

- **New dataset is small but very different from the original dataset** 
Since the data is small, it is likely best to only train a linear classifier. Since the dataset is very different, it might not be best to train the classifier form the top of the network, which contains more dataset-specific features. Instead, it might work better to train the SVM classifier from activations somewhere earlier in the network.

- **New dataset is large and very different from the original dataset**
Since the dataset is very large, we may expect that we can afford to train a ConvNet from scratch. However, in practice it is very often still beneficial to initialize with weights from a pretrained model. In this case, we would have enough data and confidence to fine-tune through the entire network.

([description proudly taken from Stanford](http://cs231n.github.io/transfer-learning/))

In [38]:
import h5py
from PIL import Image
from keras import applications
from keras.preprocessing.image import ImageDataGenerator
from keras import optimizers
from keras.models import Sequential, Model 
from keras.layers import Input, Dropout, Flatten, Dense, GlobalAveragePooling2D
from keras import backend as k 
from keras.callbacks import ModelCheckpoint, LearningRateScheduler, TensorBoard, EarlyStopping
from keras.layers.pooling import GlobalAveragePooling2D
from keras.applications.inception_v3 import preprocess_input

img_width, img_height = 224, 224
train_data_dir = "/home/courses/data/bootcamp/CUB/validation_224x224"
validation_data_dir = "/home/courses/data/bootcamp/CUB/train_224x224"
nb_train_samples = 10610
nb_validation_samples = 1000 
batch_size = 16
epochs = 1

Keras has several built-in models with pre-trained weights. The weights were trained on the ImageNet dataset. We aim to keep the weights of the first layers of the network. These act as low-level feature detector while the higher layers integrate dataset-specific characteristics and patterns.  

In [3]:
base_model = applications.InceptionV3(weights='imagenet', include_top=False)

## InceptionV3 model architecture

In [4]:
base_model.summary()

____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
input_1 (InputLayer)             (None, None, None, 3) 0                                            
____________________________________________________________________________________________________
conv2d_1 (Conv2D)                (None, None, None, 32 864         input_1[0][0]                    
____________________________________________________________________________________________________
batch_normalization_1 (BatchNorm (None, None, None, 32 96          conv2d_1[0][0]                   
____________________________________________________________________________________________________
activation_1 (Activation)        (None, None, None, 32 0           batch_normalization_1[0][0]      
___________________________________________________________________________________________

Finally, we load the data and augment it in order to improve the generalization abilities of our discriminator.

In [5]:
train_datagen = ImageDataGenerator(
    preprocessing_function=preprocess_input,
    horizontal_flip=False,
    zoom_range = 0,
    width_shift_range = 0,
    height_shift_range=0,
    rotation_range=0)

test_datagen = ImageDataGenerator(
    preprocessing_function=preprocess_input,
    horizontal_flip=False,
    zoom_range = 0,
    width_shift_range = 0,
    height_shift_range=0,
    rotation_range=0)

train_generator = train_datagen.flow_from_directory(
train_data_dir,
target_size = (img_height, img_width),
batch_size = batch_size, 
class_mode = "categorical")

validation_generator = test_datagen.flow_from_directory(
validation_data_dir,
target_size = (img_height, img_width),
class_mode = "categorical")

Found 10610 images belonging to 200 classes.
Found 1178 images belonging to 200 classes.


In [6]:
train_datagen = ImageDataGenerator(
    preprocessing_function=preprocess_input,
    horizontal_flip=True,
    zoom_range = 0.3,
    width_shift_range = 0.3,
    height_shift_range=0.3,
    rotation_range=30)

test_datagen = ImageDataGenerator(
    preprocessing_function=preprocess_input,
    horizontal_flip=True,
    zoom_range = 0.3,
    width_shift_range = 0.3,
    height_shift_range=0.3,
    rotation_range=30)

In [18]:
def createBaselineModel(nb_classes, FC_SIZE):
  """ create a pretty simple baseline model
  Args:
    nb_classes: # of classes
  Returns:
    new keras model with last layer
  """
  inp = Input(shape=(img_height, img_width, 3))
  x = GlobalAveragePooling2D()(inp)
  x = Dense(FC_SIZE, activation='relu')(x) 
  predictions = Dense(nb_classes, activation='softmax')(x) 
  model = Model(input=inp, output=predictions)
  model.compile(optimizer=optimizers.SGD(lr=0.0001, momentum=0.9),   
                 loss='categorical_crossentropy',metrics=['accuracy'])
  return model

** The following function allows us to stack a custom layer and discriminator ontop of the pre-trained model in order to solve our specific task. **

In [19]:
def add_new_discriminator(base_model, nb_classes, FC_SIZE):
  """Add discriminator to our convnet
  Args:
    base_model: keras model excluding top
    nb_classes: # of classes
    FC_SIZE: # of hidden neurons
  Returns:
    new keras model with last layer
  """
  x = base_model.output
  x = GlobalAveragePooling2D()(x)
  x = Dense(FC_SIZE, activation='relu')(x) 
  predictions = Dense(nb_classes, activation='softmax')(x) 
  model = Model(input=base_model.input, output=predictions)
  return model

** The first step in transfer learning requires us to freeze the layers of our base model. Depending on our data, we have put some thoughts into what layers should be frozen. **

In [40]:
def setup_to_transfer_learn(model, base_model, trainNoOfFirstLayers=0, trainNoOfLastLayers=0):
  """Freeze all layers and compile the model"""
  noLayers = len(base_model.layers)
  print("Our model consists of " + str(noLayers) + " layers.")

  # make sure that all layers are set to trainable
  for layer in base_model.layers:
    layer.trainable = True

  for layer in base_model.layers[trainNoOfFirstLayers : noLayers-trainNoOfLastLayers]:
    layer.trainable = False
    
  model.compile(optimizer='rmsprop',    
                loss='categorical_crossentropy', 
                metrics=['accuracy'])

** Create and fit a baseline model **

In [None]:
baseline_model = createBaselineModel(200, 1024)
baseline_model.summary()
baseline_model.fit_generator(
train_generator,
samples_per_epoch = nb_train_samples,
epochs = epochs,
validation_data = validation_generator,
nb_val_samples = nb_validation_samples,
workers=32
)

After freezing the layers of our base model, we only have to train our new discriminator. [(Yosinski et al., 2014)](https://arxiv.org/abs/1411.1792) argue that the lower layers of a deep neural network architecture typically represent feature detectors independent on the current dataset and learning method. Thus, it might not be required to retrain these layers.

In [41]:
model_final = add_new_discriminator(base_model, 200, 1024)
setup_to_transfer_learn(model_final, base_model, 0, 100) # finetune the upper most 100 layers of our base model

Our model consists of 311 layers.


  del sys.path[0]


** Let's have a look at the structure of our new, final model. **

In [42]:
model_final.summary()

____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
input_1 (InputLayer)             (None, None, None, 3) 0                                            
____________________________________________________________________________________________________
conv2d_1 (Conv2D)                (None, None, None, 32 864         input_1[0][0]                    
____________________________________________________________________________________________________
batch_normalization_1 (BatchNorm (None, None, None, 32 96          conv2d_1[0][0]                   
____________________________________________________________________________________________________
activation_1 (Activation)        (None, None, None, 32 0           batch_normalization_1[0][0]      
___________________________________________________________________________________________

In [43]:
# Train the model 
model_final.fit_generator(
train_generator,
samples_per_epoch = nb_train_samples,
epochs = epochs,
validation_data = validation_generator,
nb_val_samples = nb_validation_samples,)

  import sys


Epoch 1/1


<keras.callbacks.History at 0x2ab0a4563860>

** Finally, we fix all the layers of our base model and just train our discriminator. **

In [44]:
model_final_2 = add_new_discriminator(base_model, 200, 1024)
setup_to_transfer_learn(model_final_2, base_model, 0, 0) # dont finetune our base model
model_final.summary()

  del sys.path[0]


Our model consists of 311 layers.
____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
input_1 (InputLayer)             (None, None, None, 3) 0                                            
____________________________________________________________________________________________________
conv2d_1 (Conv2D)                (None, None, None, 32 864         input_1[0][0]                    
____________________________________________________________________________________________________
batch_normalization_1 (BatchNorm (None, None, None, 32 96          conv2d_1[0][0]                   
____________________________________________________________________________________________________
activation_1 (Activation)        (None, None, None, 32 0           batch_normalization_1[0][0]      
_________________________________________________________

In [None]:
# Train the model 
model_final_2.fit_generator(
train_generator,
samples_per_epoch = nb_train_samples,
epochs = epochs,
validation_data = validation_generator,
nb_val_samples = nb_validation_samples,)

  import sys


Epoch 1/1


<keras.callbacks.History at 0x2ab0a605b048>

# Your turn
Retrain your network while deactivating a different number of layers. 

- Why should you train train lower layers? In what case would you do this?
- What influence might and does this have on the overall accuracy? 