Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

QBUS6850 Fina Exam Sample and Solution

Instruction:

(A) There will be no Multiple Choice Questions in the final exam, only Short Answer          Questions.  The number of questions in this sample may be more than the number of questions in the actual final exam. The purpose here is to show you more question     types.

(B) Those in Green are instruction, not part of solutions or answers.

(C) Those in Red are something that should be typed in answer boxes if it is in the actual

exam.

(D) Those in Blue are working details.  You may NOT need type them all in the answer    boxes, so that you can save your time. However what you type in shall better reflect your working flow logically and clearly so that marker clearly understand you.

(E)  Some notation can be easily typed in plain text, e.g., superscript !  typed as X^T, subscrtipt "   typed as X_b.

A matrix like

= $

0

0

1

1

'

may be typed as X = [ [1, 0, 0],  [1, 0, 1]; [1, 1, 0], [1, 1, 1]] (here color is used to clearly show you who is who.). This is the Python way, row-by-row.

It is also a good way to write as dL(W)/dW2 etc.

Solutions

Short Answer Questions

1.    In Lecture 8, you studied Deep Learning for graph data.  List TWO points where learning algorithms for graph data are different from the conventional deep learning algorithms.

Possible Answers:

(1) The data relation information will be taken into account.

(2)  Data (node) information will be agglomerated

(3)  Node and link learning tasks are more likely to be semi-supervised

(4) All the training data (as a graph) must be taken as “one entire” input to the network

Please use your own words/your own way to organize it better for the similar questions in the exam.

2.    Describe the major steps of AdaBoost Classification algorithm: (1) Calculate the voting power;  (2) Update the weight for each example.

Answer:    Note that each example has a weight w (at first iteration each w is 1/N)

(1)   Consider all the given classifiers and list the examples misclassified by each classifier, e.g., the table on slide 22 of Lecture 5;

(2)  Find out the best classifier in terms of their misclassification rates (a classifier’s                         misclassification rate is defined as the sum of weights of examples who are misclassified by a classifier), then voting power of the best classifier alpha  = ½ log((1-epilon)/epsilon) where epsilon is the misclassification rate of the best classifier

(3)  For each example, update their weight according to whether it is correctly classified by the best classifier by using the formula from the formula sheet.

3.    In the example of AdaBoost algorithm in Lecture, we choose the "best" classifier from a given set of classifiers in each step.  How would you like change these steps of picking up the "best" classifier into a learning process to learn a classifier?

Possible Answer:

We can regard the process of choosing the best classifier from a given sent of classifiers in AdaBoost as a learning process. Thus, in general case, what we need to do is to use a learning algorithm such as decision tree etc to learn a model according to the data         importance (i.e weights).

4.    Given the following graph

(1)  What is its adjacency matrix  A ?  What is its adjacency matrix with self-loop "

(2)  Who are the direct neighbours of node !  ?

(3)  Who are the 2-hop neighbours (including direct neighbours) of node "  ?

Answers:

(1)  The adjacency matrix is 0-1 matrix of size 9x9 [9 nodes] where 1 at position (i,j) means an edge between node #  and $ . Its is

0    0    0    1    1    0    0    0    0

0    0    0    1    1    0    0    0    0

0    0    0    0    0    1    1    0    0

1    1    0    0    1    0    0    1    0

= 1    1    0    1    0    1    0    1    0

0    0    1    0    1    0    1    0    1

0    0    1    0    0    1    0    0    1

0    0    0    1    1    0    0    0    1

0    0    0    0    0    1    1    1    0

Here is how we get it.  For the first row of A, we check node "  in the graph to see who are the neighbour nodes, we see they are nodes % ! . Here put 1 at columns 4 and 5 on the first   row.   We use the same way to get other rows for A.

1    0    0    1    1    0    0    0    0

0    1    0    1    1    0    0    0    0

0    0    1    0    0    1    1    0    0

1    1    0    1    1    0    0    1    0

= A  + = 1    1    0    1    1    1    0    1    0

0    0    1    0    1    1    1    0    1

0    0    1    0    0    1    1    0    1

0    0    0    1    1    0    0    1    1

0    0    0    0    0    1    1    1    1

(2)  The direct neighbours of node !  are ", &, %, ', ( .

(3)  The 2-hop neighbor is the node connected by two edges or the node connected by one edge, here the answer is &, %, !, ', ( .

5.    For a three classes classification program where the input is in 4 dimension, suppose that, after running Gradient Boosting Method, you have received the following overall model:

( () = , * () = ,

+ () = ,

if ) 0.65

if )  > 0.65

if * 2.95

if *  > 2.95

if ) 1.70

if )  > 1.70

For a given new input = [4.7, 3.2, 1.3, 0.2],  please predict its class label.

Solution:

First we can calculate the Gradient Boosting model outputs:

( ( ) = 1.104; * ( ) = − 0.298; + ( ) = − 0.488

(You may write it as F1(x)=1.104, F2(x)=-0.298, F3(x)=-0.488 to save time)

According to the probability definition, we have

( ( ) = = = 0.690 * ( ) = = = 0.170 + ( ) = = = 0.140

Now this sample can be classified as Class 1 as (  is largest.

6.   The following questions are about Gradient Boosting.

(1) Describe the basic Gradient Boosting algorithm for regression with the below squared loss function step by step.

A 4 , (4 )D = (4  − (4 ))*

(2) If we change the loss function to the absolute loss defined as follows

A 4 , (4 )D = |4 (4 )|

How would you modify the Gradient Boosting algorithm in question (1)?

(3) If we are considering three-class classification, what loss function should be suggested?

(4) In general, for both regression and classification, what is being fitted in each step of Gradient Boosting? Is this a regression problem for both cases?

(5) In the model fitting described in question (4), suppose you only have two models ℎ( ) and ℎ( ) to choose from, then which model should be chosen?

(6) Suppose in (5) you have chosen ( ) which involves only feature " and is defined as

You continue the Gradient Boosting algorithm one more step and found another tree stump as

Your final model is () = () + () .    Write out the mathematical expression of the model () .

Answers:

(1)

•   Start with an initial model, for example, a constant model () =

•    Do the following steps until some criteria are satisfied

o Calculate negative gradient

A 4 , ( 4 )D

( 4 ) = = 4 ( 4 )

o Fit a (simple) model (e.g., decision tree stump) ℎ() to negative gradients −( 4 ) for = 1, 2, … . ,

o Construct the new overall model

() () + () = 1

(2)  For the given loss function A 4 , ( 4 )D = |4 ( 4 )|, we have

( 4 ) = = T 1(1)

( 4 ) > 4 ( 4 ) < 4

Fitting a model ℎ() to target values 1 or -1 (negative gradients) at each data in each loop.

(3 )  A possible way is to construct a cross-entropy loss function (over one data),

+

= − X 4; log; ( 4 ) ; :(

where ( (), * (), + (), are soft-max functions from the models ( (), * (), + (),

as below:

; ( 4 ) =

-'( ()

-$( ()  + -"( ()  + -&( () ,

= 1, 2, 3


Thus in modelling we shall produce three models ( (), * (), + () . With the cross- entropy loss function, we will have

#$ < =(,-'( ()?

; ( 4 ) = #-'( ()         = 4; ; ( 4 )

(4 )  In Gradient Boosting algorithm for both regression and classification, we are fitting simple  models to the negative gradient of loss function at each data points. This is a     regression problem with a simple model in each GB loop.

(5 )  As this is a regression problem and only two models are given/allowed, so we shall test the mean squared errors between negative gradients of loss function and each       given model outputs over all the data. Then choose the model which gives a smaller     mean squared error. Note: as only two particularly models are given, we don’t need     actually perform a regression algorithm.

(6) Both stumps are piecewise function,  according to the definition, we know when (  ≤ 0.40, both ℎ( ) and ℎ*  ( ) are 0.5, hence ( ) = 1.0.  When 0.40 < (  ≤ 0.65, ℎ( ) = 0.5 and ℎ* ( ) = 0.7; hence ( ) = ℎ( ) + ℎ* ( ) = 1.2.  Finally when >       0.65, we know ( ) = ℎ( ) + ℎ* ( ) = 0.3 + 0.7 = 1.0