Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

Question 1

With the follow code

import numpy as np

A = np.array([[1, 2], [3, 4]])

B = np.array([[5, 6], [5, 6]])

B is:

A.  A numpy matrix

B.  An ordinary list (of lists) Python object:

C.  A numpy array

Question 2

With the above code and

A*B

the result is

A.  Concatenation of the two lists

B.  A matrix product matrix([[15, 18], [35, 42]])

C.  An elementwise product of the matrix elements: array([[ 5, 12], [15, 24]])

D.  TypeError: can't multiply sequence by non-int of type 'list'

Hint: you need to be familiar with the difference between elementwise multiplication and matrix multiplication

Question 3

Can deep neural networks be trained in an unsupervised way?

Yes

No

Hint: what is unsupervised learning? -> what is difference between unsupervised and supervised learning? -> What kind of algorithms are trained in unsupervised ways? -> Can we apply it to the

DNN?

Notes: generation models could be considered as a kind of self-learning (unsupervised learning).

Question 4

There are exactly six fish tanks in a room of the aquarium. The six tanks contain the following numbers of fish:

x1  = 5, x2  = 5, x3  = 8, x4  = 12, x5  = 15, x6  = 18 . The variance of the population is

A.  10.5

B.  24.25

C.  29.1

D.  145.5

Question 5

Given as the input of Tanh function, when would Tanh function lead to vanishing gradient problem

A.   s ⟶ 1

B.   s ⟶ 0

C.  s ⟶ −∞

D.  s ⟶ e

Hint: go back to check the figure of Tanh, and think about what is the vanishing gradient problem?

Question 6

Suppose we have trained a neural network (ReLU on the hidden layer) for a 3-category classification problem, in which the weights and bias are

Consider a test example [1, −1, 2, −3]T

1)   What is the output of the network?

2)   The ground-truth output is [0, 1,0]T . Given the squared loss function ‖y ‖2 , what is the prediction error of this test example?

3)    Given a softmax layer before y , what is the output of the softmax layer?

4)   Given a softmax layer before y , what is the final cross-entropy loss of this test example?

Hint : follow the forward propagation to get the output. Note: the final layer also contains the activation function.

1)   [ 0.12 , 0.00 , 0.99 ]T

2)   0.99

3)   [0.2378, 0. 1947, 0.5675]T

4)   1.6363

Question 7

(a) Can we initialize the deep neural network with zeros in the optimization? What is the reason? What would be a better way to initialize the deep neural network?

(b) There are many popular CNN architectures, such as AlexNet, VGG, GoogLeNet, and ResNet. Choose  at  least  two  CNN  architectures  that  you  are  familiar  with.  Introduce  their  major characteristics, and explain how these characteristics lead to performance improvement of the networks.

Hint: (a) if we has zero initialized network, that means we will has zero weight and bias. -> what happen if we multiple with zero and then add with zero? (check the week2’s slides of 34-46)          (b) explain and compare their model designs (major characteristics), discuss why these designs could lead to the performance improvement. (check the week7’s slides).

Question 8

Consider a deep learning model with multiple hidden layers. Each hidden layer uses a different activation function. The first hidden layer uses the rectified linear unit (ReLU) activation function, the second hidden layer uses the hyperbolic tangent (tanh) activation function, and the third hidden layer uses the sigmoid activation function. The output layer employs the softmax activation function for multi-class classification.

a)  Explain the  characteristics  of each  activation function  mentioned  above  and  discuss their advantages and disadvantages in the context of deep learning.

b) Given a dataset with a large number of outliers, which activation function would you choose for the hidden layers, and why? Justify your answer.

c) How would you modify the model's architecture to handle a regression task instead of multi-class classification while maintaining the same activation functions for the hidden layers? Explain the changes you would make and the reasons behind them.

Hint:

a) ReLU: It sets negative values to zero, addresses the vanishing gradient problem, but can suffer from the "dying ReLU" issue. Tanh: Ranges from - 1 to 1, addresses the vanishing gradient problem, but can saturate at extreme values. Sigmoid: Squashes values between 0 and 1, suffers from vanishing gradients for large inputs. Softmax: Used in multi-class classification, converts values into a probability distribution.

b) Use ReLU for hidden layers with outliers. ReLU is less sensitive to outliers and suppresses their impact.

c) Change output layer to linear activation for regression. It allows direct prediction of continuous values without squashing or normalization.

Question 9

In MLPs, dropout is typically applied to the activations of hidden units. How can we extend the idea of dropout to process Recurrent Neural Network (RNN) architectures.

Hint:

Dropout is typically applied between the hidden layers of an MLP. During training, a fraction of neurons in each layer is randomly selected and temporarily "dropped out" by setting their outputs to zero. Dropout helps to prevent overfitting by reducing the reliance of the model on specific neurons and promoting the learning of more robust features. Dropout introduces noise during training, which acts as a regularization technique by implicitly averaging over a large number of different network architectures.

RNNs are specifically designed to model sequential data and capture temporal dependencies. Dropout in RNNs needs to be applied carefully to preserve these dependencies. For exampke, in RNNs, the same dropout mask could be applied across all time steps to maintain consistency and allow the network to learn dependencies across time. In contrast, MLPs do not have inherent sequential structures, so dropout can be applied more straightforwardly to individual layers without considering temporal dependencies.