Sequence Model-week3编程题1-Neural Machine Translation with Attention

2023-11-29 22:26:16

1. Neural Machine Translation

下面将构建一个神经机器翻译(NMT)模型，将人类可读日期 ("25th of June, 2009") 转换为机器可读日期 ("2009-06-25").
使用 attention model.

from keras.layers import Bidirectional, Concatenate, Permute, Dot, Input, LSTM, Multiply
from keras.layers import RepeatVector, Dense, Activation, Lambda
from keras.optimizers import Adam
from keras.utils import to_categorical
from keras.models import load_model, Model
import keras.backend as K
import numpy as np

from faker import Faker
import random
from tqdm import tqdm
from babel.dates import format_date
from nmt_utils import *
import matplotlib.pyplot as plt
%matplotlib inline

2. 人类可读日期转机器可读日期(Translating human readable dates into machine readable dates)

你将在这里构建的模型可以用于从一种语言翻译到另一种语言，例如从英语翻译到印地语。
然而，语言翻译需要大量的数据集，通常需要几天的GPU训练。
我们将执行一个更简单的 “日期翻译” 任务。
该网络将输入以各种可能的格式编写的日期(e.g. "the 29th of August 1958", "03/30/1968", "24 JUNE 1987")
该网络将翻译他们为标准化的机器可读日期 (e.g. "1958-08-29", "1968-03-30", "1987-06-24").(YYYY-MM-DD)

2.1 Dataset

我们将在10,000个人类可读日期及其等效、标准化、机器可读日期的数据集上对模型进行培训。加载数据集：

m = 10000
dataset, human_vocab, machine_vocab, inv_machine_vocab = load_dataset(m)

print(dataset[:10])

print(human_vocab, len(human_vocab))
print(machine_vocab, len(machine_vocab))

[(‘tuesday january 31 2006‘, ‘2006-01-31‘), (‘20 oct 1986‘, ‘1986-10-20‘), (‘25 february 2008‘, ‘2008-02-25‘), (‘sunday november 11 1984‘, ‘1984-11-11‘), (‘20.06.72‘, ‘1972-06-20‘), (‘august 9 1983‘, ‘1983-08-09‘), (‘march 30 1993‘, ‘1993-03-30‘), (‘10 jun 2017‘, ‘2017-06-10‘), (‘december 21 2001‘, ‘2001-12-21‘), (‘monday october 11 1999‘, ‘1999-10-11‘)]
{‘ ‘: 0, ‘.‘: 1, ‘/‘: 2, ‘0‘: 3, ‘1‘: 4, ‘2‘: 5, ‘3‘: 6, ‘4‘: 7, ‘5‘: 8, ‘6‘: 9, ‘7‘: 10, ‘8‘: 11, ‘9‘: 12, ‘a‘: 13, ‘b‘: 14, ‘c‘: 15, ‘d‘: 16, ‘e‘: 17, ‘f‘: 18, ‘g‘: 19, ‘h‘: 20, ‘i‘: 21, ‘j‘: 22, ‘l‘: 23, ‘m‘: 24, ‘n‘: 25, ‘o‘: 26, ‘p‘: 27, ‘r‘: 28, ‘s‘: 29, ‘t‘: 30, ‘u‘: 31, ‘v‘: 32, ‘w‘: 33, ‘y‘: 34, ‘<unk>‘: 35, ‘<pad>‘: 36} 37
{‘-‘: 0, ‘0‘: 1, ‘1‘: 2, ‘2‘: 3, ‘3‘: 4, ‘4‘: 5, ‘5‘: 6, ‘6‘: 7, ‘7‘: 8, ‘8‘: 9, ‘9‘: 10} 11

dataset: 一个元组列表 (人类可读日期, 机器可读日期)。
human_vocab: 一个python字典，将人类可读日期中使用的所有字符 映射到整数值索引。
machine_vocab: 一个python字典，将机器可读日期中使用的所有字符 映射到整数值索引。这些索引不一定与 human_vocab 的索引一致。
inv_machine_vocab: machine_vocab的逆字典，从索引到字符的映射。

2.2 预处理数据

设置 Tx=30
- 我们假设 Tx 是人类可读日期的最大长度。
- 如果我们得到更长的输入,我们将不得不截断(truncate)它。
设置 Ty=10
- "YYYY-MM-DD" 是 10 characters 长度.

Tx = 30
Ty = 10
X, Y, Xoh, Yoh = preprocess_data(dataset, human_vocab, machine_vocab, Tx, Ty)

print("X.shape:", X.shape)
print("Y.shape:", Y.shape)
print("Xoh.shape:", Xoh.shape)
print("Yoh.shape:", Yoh.shape)

X.shape: (10000, 30)
Y.shape: (10000, 10)
Xoh.shape: (10000, 30, 37)
Yoh.shape: (10000, 10, 11)

你现在有：

X: 训练集中人类可读日期的处理版本.
- 其中每个字符都被它在 human_vocab 中映射该字符的索引替换.
- 每个日期都使用特殊字符（< pad >）进一步填充，确保 T_x 长度.
- X.shape = (m, Tx) where m is the number of training examples in a batch.
Y: 训练集中机器可读日期的处理版本
- 其中每个字符都被它在 machine_vocab 中映射的索引替换.
- Y.shape = (m, Ty).
Xoh: X 的 one-hot版本
- one-hot 中条目 “1” 的索引被映射到在human_vocab中对应字符. (如果 index is 2, one-hot 版本：[0,0,1,0,0,...,0]
- Xoh.shape = (m, Tx, len(human_vocab))
Yoh: Y 的 one-hot版本
- one-hot 中条目 “1” 的索引被映射到在machine_vocab中对应字符.
- Yoh.shape = (m, Tx, len(machine_vocab)).
- len(machine_vocab) = 11 由于有 10 数字(0-9) 和 - 符号.

index = 0
print("Source date:", dataset[index][0])
print("Target date:", dataset[index][1])
print()
print("Source after preprocessing (indices):", X[index])
print("Target after preprocessing (indices):", Y[index])
print()
print("Source after preprocessing (one-hot):", Xoh[index])   # 每行是一个T_t的输出，输出的是对应相应字符的一个one-hot向量.
print("Target after preprocessing (one-hot):", Yoh[index])

Source date: tuesday january 31 2006
Target date: 2006-01-31

Source after preprocessing (indices): [30 31 17 29 16 13 34  0 22 13 25 31 13 28 34  0  6  4  0  5  3  3  9 36 36
 36 36 36 36 36]
Target after preprocessing (indices): [3 1 1 7 0 1 2 0 4 2]

Source after preprocessing (one-hot): [[ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 ..., 
 [ 0.  0.  0. ...,  0.  0.  1.]
 [ 0.  0.  0. ...,  0.  0.  1.]
 [ 0.  0.  0. ...,  0.  0.  1.]]
Target after preprocessing (one-hot): [[ 0.  0.  0.  1.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  1.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  1.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  1.  0.  0.  0.]
 [ 1.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  1.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  1.  0.  0.  0.  0.  0.  0.  0.  0.]
 [ 1.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  1.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  1.  0.  0.  0.  0.  0.  0.  0.  0.]]

3. Neural machine translation with attention

如果你必须把一本书的段落从法语翻译成英语，你就不会读整个段落，然后关闭这本书并翻译。
即使在翻译过程中，你也会阅读/重读，并专注于法语段落中与你正在写的英语部分相对应的部分
注意机制告诉神经机器翻译模型，它应该注意任何步骤。

3.1 Attention mechanism

工作原理：

左图显示了 attention model.
右图显示了一个 "attention" 步骤用来计算 attention 变量 \(\alpha^{\langle t, t‘ \rangle}\).
Attention 变量 \(\alpha^{\langle t, t‘ \rangle}\) 用于计算输出中每个时间步(\(t=1, \ldots, T_y\)) 的上下文变量 \(context^{\langle t \rangle}\) (\(C^{\langle t \rangle}\)).

Sequence Model-week3编程题1-Neural Machine Translation with Attention

**Figure 1**: Neural machine translation with attention

以下是您可能要注意的模型的一些属性：

Pre-attention and Post-attention LSTMs 在the attention mechanism 两边

模型中有两个单独的 LSTM（见左图）: pre-attention and post-attention LSTMs.
Pre-attention Bi-LSTM 在图片底部是一个 Bi-directional LSTM（双向LSTM）在 attention mechanism 之前.
- The attention mechanism 是左图中间的部分（Attention）.
- The pre-attention Bi-LSTM 穿过 \(T_x\) time steps
Post-attention LSTM: 在图片顶部在 attention mechanism 之后.
- The post-attention LSTM 穿过 \(T_y\) time steps.
The post-attention LSTM 通过 hidden state \(s^{\langle t \rangle}\) 和 cell state \(c^{\langle t \rangle}\) 从一个time step 到另一个time step.

An LSTM 有一个 hidden state 和 cell state

对于post-attention sequence model 我们仅使用了基本的 RNN
- 这意味着，RNN捕获的the state 只输出 hidden state \(s^{\langle t\rangle}\).
这个任务中，我们使用一个LSTM 代替基本RNN.
- 因此，LSTM 有 hidden state \(s^{\langle t\rangle}\)，也有 cell state \(c^{\langle t\rangle}\).

每个time step 不使用前一个time step的预测

与之前的文本生成示例（例如第1周的Dinosaurus）不同，在此模型中， post-activation LSTM 在时间 \(t\) 不会用具体生成的预测 \(y^{\langle t-1 \rangle}\) 作为输入.
post-attention LSTM 在 time ‘t‘ 只需要 hidden state \(s^{\langle t\rangle}\) 和 cell state \(c^{\langle t\rangle}\) 作为输入.
我们以这种方式设计了模型，因为（与相邻字符高度相关的语言生成不同）在YYYY-MM-DD日期中，前一个字符与下一个字符之间的依赖性不强。

Concatenation(连接) of hidden states(\(a^{\langle t\rangle}\)) 来自前向(forward) 和后向(backward) pre-attention LSTMs

\(\overrightarrow{a}^{\langle t \rangle}\): hidden state of the forward-direction, pre-attention LSTM.
\(\overleftarrow{a}^{\langle t \rangle}\): hidden state of the backward-direction, pre-attention LSTM.
\(a^{\langle t \rangle} = [\overrightarrow{a}^{\langle t \rangle}, \overleftarrow{a}^{\langle t \rangle}]\): the concatenation of the activations of both the forward-direction \(\overrightarrow{a}^{\langle t \rangle}\) and backward-directions \(\overleftarrow{a}^{\langle t \rangle}\) of the pre-attention Bi-LSTM.

Computing "energies" \(e^{\langle t, t‘ \rangle}\) as a function of \(s^{\langle t-1 \rangle}\) and \(a^{\langle t‘ \rangle}\)

Recall in the lesson videos "Attention Model", at time 6:45 to 8:16, the definition of "e" as a function of \(s^{\langle t-1 \rangle}\) and \(a^{\langle t \rangle}\).
- "e" is called the "energies" variable.
- \(s^{\langle t-1 \rangle}\) is the hidden state of the post-attention LSTM
- \(a^{\langle t‘ \rangle}\) is the hidden state of the pre-attention LSTM.
- \(s^{\langle t-1 \rangle}\) and \(a^{\langle t \rangle}\) are fed into a simple neural network, which learns the function to output \(e^{\langle t, t‘ \rangle}\).
- \(e^{\langle t, t‘ \rangle}\) is then used when computing the attention \(a^{\langle t, t‘ \rangle}\) that \(y^{\langle t \rangle}\) should pay to \(a^{\langle t‘ \rangle}\).
右图使用了一个 RepeatVector node to copy \(s^{\langle t-1 \rangle}\)‘s value \(T_x\) times.
然后，它使用 Concatenation 来连接(concatenate) \(s^{\langle t-1 \rangle}\) 和 \(a^{\langle t \rangle}\).
The concatenation of \(s^{\langle t-1 \rangle}\) and \(a^{\langle t \rangle}\) is fed into a "Dense" layer, 用来计算 \(e^{\langle t, t‘ \rangle}\).
\(e^{\langle t, t‘ \rangle}\) is then passed through a softmax to compute \(\alpha^{\langle t, t‘ \rangle}\).
变量 \(e^{\langle t, t‘ \rangle}\)图中没有显示给出，但是 \(e^{\langle t, t‘ \rangle}\) 在 the Dense layer 和 the Softmax layer 之间（图右）.
将解释如何在Keras使用 RepeatVector and Concatenation.

Implementation Details

实现 neural translator，你将实现两个函数：one_step_attention() and model().

one_step_attention

The inputs to the one_step_attention at time step \(t\) are:
- \([a^{<1>},a^{<2>}, ..., a^{<T_x>}]\): all hidden states of the pre-attention Bi-LSTM.
- \(s^{<t-1>}\): the previous hidden state of the post-attention LSTM.
one_step_attention computes:
- \([\alpha^{<t,1>},\alpha^{<t,2>}, ..., \alpha^{<t,T_x>}]\): the attention weights
- \(context^{ \langle t \rangle }\): the context vector:

\[context^{<t>} = \sum_{t‘ = 1}^{T_x} \alpha^{<t,t‘>}a^{<t‘>}\tag{1} \]

Clarifying ‘context‘ and ‘c‘

the context 用 \(c^{\langle t \rangle}\) 来表示.
这个任务中，我们将 the context 用 \(context^{\langle t \rangle}\) 表示.
- 这是为了避免与 the post-attention LSTM‘s 内部存储单元变量(internal memory cell)混淆，该变量用 \(c^{\langle t \rangle}\) 表示.

实现 `one_step_attention`

Exercise: 实现 one_step_attention().

The function model() will call the layers in one_step_attention() \(T_y\) using a for-loop.
It is important that all \(T_y\) copies have the same weights.
- It should not reinitialize the weights every time.
- In other words, all \(T_y\) steps should have shared weights.
Here‘s how you can implement layers with shareable weights in Keras:
1. Define the layer objects in a variable scope that is outside of the one_step_attention function. For example, defining the objects as global variables would work.
  - Note that defining these variables inside the scope of the function model would technically work, since model will then call the one_step_attention function. For the purposes of making grading and troubleshooting easier, we are defining these as global variables. Note that the automatic grader will expect these to be global variables as well.
2. Call these objects when propagating the input.
We have defined the layers you need as global variables.
- Please run the following cells to create them.
- Please note that the automatic grader expects these global variables with the given variable names. For grading purposes, please do not rename the global variables.
Please check the Keras documentation to learn more about these layers. The layers are functions. Below are examples of how to call these functions.
- RepeatVector()

var_repeated = repeat_layer(var1)

* [Concatenate()](https://keras.io/layers/merge/#concatenate)

concatenated_vars = concatenate_layer([var1,var2,var3])

* [Dense()](https://keras.io/layers/core/#dense)

var_out = dense_layer(var_in)

* [Activation()](https://keras.io/layers/core/#activation)

activation = activation_layer(var_in)

* [Dot()](https://keras.io/layers/merge/#dot)

dot_product = dot_layer([var1,var2])