LSTM时间序列预测-VAL和测试损耗要低于火车损耗

如何解决LSTM时间序列预测-VAL和测试损耗要低于火车损耗

我正在尝试根据先前步骤中的速度来预测下一个车辆速度。我当前实现此目标的方法是使用LSTM神经网络进行时间序列预测。我已经阅读了很多有关此问题的教程，并且现在已经建立了自己的预测车速的程序。在当前设置中，我尝试根据前20个预测下一个车速。

数据： 我有约1000个不同的.csv文件的数据集。每个.csv文件都包含真实汽车行驶的车速，每秒测量一次。路线是不同的，但在同一地区且来自同一驱动程序（我）。因此，每个.csv文件的长度都不同。

A typical drive cycle from my dataset

数据检索和拆分：

我获得了.csv文件的文件名，并将其拆分为训练，验证和测试集。我这样做是因为应尽快进行拆分，以防止泄漏。

full_files = [f for f in os.listdir(search_path) if os.path.isfile(f) and f.endswith(".csv")]
random.shuffle(full_files)
args = [cells,files_batched,None,history_range,future_range,steps_skipped,shifting]

files_len = len(full_files)

train_count = round(files_len * 0.7)
val_count = round(files_len * 0.2)
test_count = round(files_len * 0.1)

train_files = full_files[:train_count]
val_files = full_files[train_count:train_count + val_count]
test_files = full_files[train_count + val_count:train_count + val_count + test_count]

train_data = tuple(map(np.array,zip(*list(data_generator(train_files,*args)))))
val_data = tuple(map(np.array,zip(*list(data_generator(val_files,*args)))))
test_data = tuple(map(np.array,zip(*list(data_generator(test_files,*args)))))

我编写了一个生成器“ data_generator”，该生成器基本上会遍历每个文件，将其读取并直接将其拆分为我想要的LSTM输入形状。 Future_range是我用作标签的时间步长，历史记录范围是以前的时间步长，单元格是我想从每个.csv文件中提取的功能。我知道这可能不是最好的方法，但可以确保不同的.csv文件之间没有片段重叠（如果我读完所有内容然后将其分割，就会发生这种情况）。

def multivariate_data(data_set,target_vector,start_index,end_index,history_size,target_size,skip_data=1,index_step=1):
    data = []
    labels = []

    start_index = start_index + history_size
    if end_index is None:
        end_index = len(data_set) - target_size

    for i in range(start_index,index_step):
        indices = range(i - history_size,i,skip_data)

        data.append(data_set[indices])
        labels.append(target_vector[i:i + target_size])

    return np.array(data),np.array(labels)


def data_generator(file_list,feature_cells,file_batches,sample_batches,history_step,prediction_step,shift=True):
    i = 0
    while True:
        if i >= len(file_list):
            break
        else:
            printProgressBar(i,len(file_list) - 1)
            file_chunk = file_list[i * file_batches:(i + 1) * file_batches]
            for file in file_chunk:
                temp = pd.read_csv(open(file,'r'),usecols=feature_cells,sep=";",header=None)
                norm_values = temp.values

                index_step = 1
                if shift is False:
                    index_step = history_step + 1
                train,label = multivariate_data(norm_values,norm_values[:,0],skip_data,index_step)

                if sample_batches is not None:
                    for index in range(0,len(train),sample_batches):
                        batch = train[index: index + sample_batches],label[index: index + sample_batches]
                        if batch[0].shape != (sample_batches,len(feature_cells)):
                            continue
                        yield batch
                else:
                    for index in range(0,len(train)):
                        yield train[index],label[index]
        i += 1

缩放和随机播放：

现在，我将我的MinMaxScaler安装到训练集上（以防止泄漏），并将转换应用于训练集，验证集和测试集。然后，我创建张量切片并重新整理数据。

scaler = MinMaxScaler(feature_range=scaling)
scaler.fit(train_data[0].reshape(-1,train_data[0].shape[-1]))

train_x = scaler.transform(train_data[0].reshape(-1,train_data[0].shape[-1])).reshape(train_data[0].shape)
train_y = scaler.transform(train_data[1].reshape(-1,train_data[1].shape[-1])).reshape(train_data[1].shape)

val_x = scaler.transform(val_data[0].reshape(-1,val_data[0].shape[-1])).reshape(val_data[0].shape)
val_y = scaler.transform(val_data[1].reshape(-1,val_data[1].shape[-1])).reshape(val_data[1].shape)

test_x = scaler.transform(test_data[0].reshape(-1,test_data[0].shape[-1])).reshape(test_data[0].shape)
test_y = scaler.transform(test_data[1].reshape(-1,test_data[1].shape[-1])).reshape(test_data[1].shape)

train_len = len(train_x)
val_len = len(val_x)
test_len = len(test_x)

train_set = tf.data.Dataset.from_tensor_slices((train_x,train_y))
train_set = train_set.cache().shuffle(train_len).batch(batch_size).repeat()

val_set = tf.data.Dataset.from_tensor_slices((val_x,val_y))
val_set = val_set.batch(batch_size).repeat()

test_set = tf.data.Dataset.from_tensor_slices((test_x,test_y))
test_set = test_set.batch(batch_size).repeat()

培训：

最后，我创建我的模型并将其拟合到我的数据中。

train_steps = train_len // batch_size
val_steps = val_len // batch_size
test_steps = test_len // batch_size

model = Sequential()

model.add(LSTM(128,return_sequences=True))
model.add(Dropout(dropout))

model.add(LSTM(64))
model.add(Dropout(dropout))

model.add(Dense(future_range))

early_stopping = EarlyStopping(monitor='val_loss',patience=2,mode='min')
checkpoint = ModelCheckpoint(search_path + "\\model",monitor='loss',verbose=0,save_best_only=True,mode='min')
optimizer = tf.optimizers.Adam(learning_rate=learning_rate)

model.compile(loss=tf.losses.MeanSquaredError(),optimizer=optimizer,metrics=[root_mean_squared_error])
history = model.fit(train_set,validation_data=val_set,epochs=epochs,steps_per_epoch=train_steps,validation_steps=val_steps,callbacks=[early_stopping])

当我开始训练时，损失从〜0.2开始，并且在第一次之后就下降到0.05以下！时代。验证损失始终低于培训损失。另外，测试仪的损耗也非常低。

Validation and training loss

我认为很明显，对于这样的NN，这些结果并不“合理”，因为车速通常是一个相当复杂的函数。我已经在互联网上搜索了可能的错误，并且对我来说唯一合法的错误是数据泄漏。但是我认为，信息不会从培训泄漏到验证集。我直接分割文件，仅对训练数据使用缩放比例。

我也检查了这篇文章，但我认为这些问题都不适合： https://www.kdnuggets.com/2017/08/37-reasons-neural-network-not-working.html

很抱歉，我犯了一个愚蠢的错误，但是我是深度学习的新手，并不十分清楚自己的方法。这可能是什么问题？

编辑：我试图将模型从单步预测更改为多步预测。 “未来”预测的形状始终相同（始终为线性且不遵循正确的形状）。测试设置（也用于下面的预测）的MSE损失较低，但RMSE很高。怎么会这样？

解决方法

这更多是讨论问题。您是否使用沉重的辍学？因为他们可能会彻底解释这里发生的事情。尝试使用不同的辍学值进行实验。您还可以使用其他正则化技术。

在辍学的示例中：由于禁用神经元，有关每个样本的一些信息丢失了，随后的层次尝试基于不完整的表示来构造答案。培训损失较高，因为人为地使网络很难提供正确的答案。但是，在验证期间，所有单元都可用，因此网络具有完整的计算能力-因此，其性能可能比训练中更好。

我将总结可能的原因：

正则化在训练过程中应用，但在验证/测试过程中不应用。
在每个时期测量训练损失，而在每个时期测量验证损失。
验证集可能比训练集更容易（否则可能会泄漏）。尝试交叉验证。

您可以找到有关主题here

的更多详细信息

LSTM时间序列预测-VAL和测试损耗要低于火车损耗

如何解决LSTM时间序列预测-VAL和测试损耗要低于火车损耗

解决方法

相关推荐