从最后一层去除Softmax可获得更好的结果

如何解决从最后一层去除Softmax可获得更好的结果

我正在解决nlp任务，即在Keras中将英语句子转换为德语。但是该模型不是在学习...但是，一旦我从最后一层删除了softmax，它就开始工作了！这是Keras中的错误，还是与其他原因有关？


optimizer = Adam()
loss_object = tf.keras.losses.SparseCategoricalCrossentropy(
    from_logits=True,reduction='none')

def loss_function(real,pred):
  mask = tf.math.logical_not(tf.math.equal(real,0))
  loss_ = loss_object(real,pred)

  mask = tf.cast(mask,dtype=loss_.dtype)
  loss_ *= mask

  return tf.reduce_mean(loss_)

EPOCHS = 20
batch_size = 64

batch_per_epoch = int(train_x1.shape[0] / batch_size)

embed_dim = 256
units = 1024
attention_units = 10

encoder_embed = Embedding(english_vocab_size,embed_dim)
decoder_embed = Embedding(german_vocab_size,embed_dim)

encoder = GRU(units,return_sequences=True,return_state=True,recurrent_initializer='glorot_uniform')
decoder = GRU(units,recurrent_initializer='glorot_uniform')

dense = Dense(german_vocab_size)

attention1 = Dense(attention_units)
attention2 = Dense(attention_units)
attention3 = Dense(1)

def train_step(english_input,german_target):
    loss = 0
    
    with tf.GradientTape() as tape:
      enc_output,enc_hidden = encoder(encoder_embed(english_input))

      dec_hidden = enc_hidden

      dec_input = tf.expand_dims([german_tokenizer.word_index['startseq']] * batch_size,1)

      for i in range(1,german_target.shape[1]):
        attention_weights = attention1(enc_output) + attention2(tf.expand_dims(dec_hidden,axis=1))
        attention_weights = tanh(attention_weights)
        attention_weights = attention3(attention_weights)
        attention_weights = Softmax(axis=1)(attention_weights)

        Context_Vector = tf.reduce_sum(enc_output * attention_weights,axis=1)
        Context_Vector = tf.expand_dims(Context_Vector,axis = 1)

        x = decoder_embed(dec_input)

        x = Concatenate(axis=-1)([x,Context_Vector])

        dec_output,dec_hidden = decoder(x)

        output = tf.reshape(dec_output,(-1,dec_output.shape[2]))

        prediction = dense(output)

        loss += loss_function(german_target[:,i],prediction)

        dec_input = tf.expand_dims(german_target[:,1)

    batch_loss = (loss / int(german_target.shape[1]))

    variables = encoder_embed.trainable_variables + decoder_embed.trainable_variables + encoder.trainable_variables + decoder.trainable_variables + dense.trainable_variables + attention1.trainable_variables + attention2.trainable_variables + attention3.trainable_variables

    gradients = tape.gradient(loss,variables)

    optimizer.apply_gradients(zip(gradients,variables))

    return batch_loss

代码摘要

该代码仅将英语句子和德语句子作为输入（以德语句子作为输入来实施教师强迫方法），并预测翻译后的德语句子。损失函数为SparseCategoricalCrossentropy，但它减去0的损失。例如，假设我们有一个句子，即：' StartSeq这是Stackoverflow 0 0 0 0 0 EndSeq '（该句子的填充也为零，使所有输入句子相同长度）。现在，我们将计算每个单词的损失，而不是0的损失。这样做可以使模型更好。 注意-此模型实现实现了Bahdanau Attention

问题

当我将softmax应用于最后一层的预测概率时，该模型什么都不学。但是它可以在没有softmax的情况下正确学习。为什么会这样？

从最后一层去除Softmax可获得更好的结果

如何解决从最后一层去除Softmax可获得更好的结果

代码摘要

问题

相关推荐