【tensorflow2.0】处理文本数据-imdb数据

一,准备数据

imdb数据集的目标是根据电影评论的文本内容预测评论的情感标签。

训练集有20000条电影评论文本,测试集有5000条电影评论文本,其中正面评论和负面评论都各占一半。

文本数据预处理较为繁琐,包括中文切词(本示例不涉及),构建词典,编码转换,序列填充,构建数据管道等等。

在tensorflow中完成文本数据预处理的常用方案有两种,第一种是利用tf.keras.preprocessing中的Tokenizer词典构建工具和tf.keras.utils.Sequence构建文本数据生成器管道。

第二种是使用tf.data.Dataset搭配.keras.layers.experimental.preprocessing.TextVectorization预处理层。

第一种方法较为复杂,其使用范例可以参考以下文章。

https://zhuanlan.zhihu.com/p/67697840

第二种方法为TensorFlow原生方式,相对也更加简单一些。

我们此处介绍第二种方法。

首先看一下train.csv中的部分内容是什么:

"0    It really boggles my mind when someone comes across a movie like this and claims it to be one of the worst slasher films out there. This is by far not one of the worst out there"     still not a good movie     but not the worst nonetheless. Go see something like Death Nurse or Blood Lake and then come back to me and tell me if you think the Night Brings Charlie is the worst. The film has decent camera work and editing     which is way more than I can say for many more extremely obscure slasher films.<br /><br />The film doesn't deliver on the on-screen deaths     there's one death where you see his pruning saw rip into a neck     but all other deaths are hardly interesting. But the lack of on-screen graphic violence doesnt mean this isn't a slasher film     just a bad one.<br /><br />The film was obviously intended not to be taken too seriously. The film came in at the end of the second slasher cycle     so it certainly was a reflection on traditional slasher elements     done in a tongue in cheek way. For example     after a kill     Charlie goes to the towns 'welcome sign and marks the population down one less. This is something that can only get a laugh.<br /><br />If you're into slasher films     definitely give this film a watch. It is slightly different than your usual slasher film with possibility of two killers     but not by much. The comedy of the movie is pretty much telling the audience to relax and not take the movie so god darn serious. You may forget the movie     you may remember it. Ill remember it because I love the name.                                                                                                            
"0    Mary Pickford becomes the chieftain of a Scottish clan after the death of her father"     and then has a romance. As fellow commenter Snow Leopard said     the film is rather episodic to begin. Some of it is amusing     such as Pickford whipping her clansmen to church     while some of it is just there. All in all     the story is weak     especially the recycled     contrived romance plot-line and its climax. The transfer is so dark its difficult to appreciate the scenery     but even accounting for that     this doesn't appear to be director Maurice Tourneurs best work. Pickford and Tourneur collaborated once more in the somewhat more accessible 'The Poor Little Rich Girl     typecasting Pickford as a child character.                                                                                                                        0    Well"     at least my theater group did     lol. So of course I remember watching Grease since I was a little girl     while it was never my favorite musical or story     it does still hold a little special place in my heart since its still a lot of fun to watch. I heard horrible things about Grease 2 and that's why I decided to never watch it     but my boyfriend said that it really wasnt all that bad and my friend agreed     so I decided to give it a shot     but I called them up and just laughed. First off the plot is totally stolen from the first one and it wasn't really clever     not to mention they just used the same characters     but with different names and actors. Tell me     how did the Pink Ladies and T-Birds continue years on after the former gangs left? Not to mention the creator face motor cycle enemy     gee     what a striking resemblance to the guys in the first film as well as these T-Birds were just stupid and ridiculous.<br /><br />Another year at Rydell and the music and dancing hasnt stopped. But when a new student who is Sandy's cousin comes into the scene     he is love struck by a pink lady     Stephanie. But she must stick to the code where only Pink Ladies must stick with the T-Birds     so the new student     decides to train as a T-Bird to win her heart. So he dresses up as a rebel motor cycle bandit who can ride well and defeat the evil bikers from easily kicking the T-Birds butts. But will he tell Stephanie who he really is or will she find out on her own? Well     find out for yourself.<br /><br />Grease 2 is like a silly TV show of some sort that didn't work. The gang didnt click as well as the first Grease did     not to mention Frenchy coming back was a bit silly and unbelievable     because I thought that she graduated from Rydell     but apparently she didn't. The songs were not really that catchy; Im glad that Michelle was able to bounce back so fast     but that's probably because she was the only one with talent in this silly little sequel     I wouldnt really recommend this film     other than if you are curious     but I warned you     this is just a pathetic attempt at more money from the famous musical.<br /><br />2/10                                                            1    I must give How She Move a near-perfect rating because the content is truly great. As a previous reviewer commented"     I have no idea how this film has found itself in IMDBs bottom 100 list! Thats absolutely ridiculous! Other films--particular those that share the dance theme--can't hold a candle to this one in terms of its combination of top-notch     believable acting     and amazing dance routines.<br /><br />From start to finish the underlying story (this is not just about winning a competition) is very easy to delve into     and surprisingly realistic. None of the main characters in this are 2-dimensional by any means and     by the end of the film     its very easy to feel emotionally invested in them. (And     even if you're not the crying type     you might get a little weepy-eyed before the credits roll.) <br /><br />I definitely recommend this film to dance-lovers and     even more so     to those who can appreciate a poignant and well-acted storyline. How She Move isnt perfect of course (what film is?)     but it's definitely a cut above movies that use pretty faces to hide a half-baked plot and/or characters who lack substance. The actors and settings in this film make for a very realistic ride that is equally enthralling thanks to the amazing talent of the dancers!                                                                                                                    
0    I must say"     when I read the storyline on the back of the case     It sounded really interesting     but when I started to watch the movie seemed boring at first and even more at the end. Some scenes are way too long and the story has not been worked out properly.                                                                                                                                                    
0    i am 13 and i hated this film its the worst film on earth i totally wasted my time watching it and was disappointed with it cause on the cover and on the back the film it looks pretty good"     but i was wrong its bad. but when i saw delta she was totally different and a bad actress and i really didnt know how old the 2 girls was trying to be i was so confused. the film was in some parts confusing and i didn't enjoy it at all but i watched all the film just to see if it was going to get better but it didnt     it was boring    dull and did i say BORING.and i don't think many other people liked it as well as me.boring boring boring                                                                                                                                                    
0    The acting may be okay"     the more u watch this movie     the more u wish you werent     this movie is so horrible     that if I could get a hold of every copy     I would burn them all and not look back     this movie is terrible!!                                                                                                                                        0    I've seen some bad things in my time. A half dead cow trying to get out of waist high mud; a head on collision between two cars; a thousand plates smashing on a kitchen floor; human beings living like animals.<br /><br />But never in my life have I seen anything as bad as The Cat in the Hat.<br /><br />This film is worse than 911"     worse than Hitler     worse than Vllad the Impaler     worse than people who put kittens in microwaves.<br /><br />It is the most disturbing film of all time     easy.<br /><br />I used to think it was a joke     some elaborate joke and that Mike Myers was maybe a high cocaine sniffing drug addled betting junkie who lost a bet or something.<br /><br />I shudder                                                                                                                                            

每一个单元格里面都是一句话。

然后是构造训练集和测试集:

import numpy as np 
 pandas as pd 
from matplotlib  pyplot as plt
 tensorflow as tf
from tensorflow.keras  models,layers,preprocessing,optimizers,losses,metrics
from tensorflow.keras.layers.experimental.preprocessing  TextVectorization
 re,string
 
train_data_path = ./data/imdb/train.csv"
test_data_path =  ./data/imdb/test.csv
 
MAX_WORDS = 10000  # 仅考虑最高频的10000个词
MAX_LEN = 200   每个样本保留200个词的长度
BATCH_SIZE = 20 
 
 
 构建管道
def split_line(line):
    arr = tf.strings.split(line,\t)
    label = tf.expand_dims(tf.cast(tf.strings.to_number(arr[0]),tf.int32),axis = 0)
    text = tf.expand_dims(arr[1],1)"> 0)
    return (text,label)
 
ds_train_raw =  tf.data.TextLineDataset(filenames = [train_data_path]) \
   .map(split_line,num_parallel_calls = tf.data.experimental.AUTOTUNE) \
   .shuffle(buffer_size = 1000).batch(BATCH_SIZE) \
   .prefetch(tf.data.experimental.AUTOTUNE)
 
ds_test_raw = tf.data.TextLineDataset(filenames = [test_data_path]) \
   .map(split_line,1)"> tf.data.experimental.AUTOTUNE) \
   .batch(BATCH_SIZE) \
   .prefetch(tf.data.experimental.AUTOTUNE)
 
 
 构建词典
 clean_text(text):
    lowercase = tf.strings.lower(text)
    stripped_html = tf.strings.regex_replace(lowercase,1)"><br />',1)">' )
    cleaned_punctuation = tf.strings.regex_replace(stripped_html,[%s]' % re.escape(string.punctuation),1)">'')
     cleaned_punctuation
 
vectorize_layer = TextVectorization(
    standardize=clean_text,split = whitespace,max_tokens=MAX_WORDS-1,有一个留给占位符
    output_mode=intMAX_LEN)
 
ds_text = ds_train_raw.map(lambda text,label: text)
vectorize_layer.adapt(ds_text)
print(vectorize_layer.get_vocabulary()[0:100])
 
 
 单词编码
ds_train = ds_train_raw.map(
[btheandaoftoisinitithisthatwasasforwithmoviebutfilmonnotyouhisarehavebeheoneitsatallbyantheyfromwhosolikeherjustorabouthasifoutsometherewhatgoodmorewhenverysheevenmynowoulduptimeonlywhichstoryreallytheirwerehadseecanmethanwemuchwellgetbeenwillintopeoplealsootherdobadbecausegreatfirsthowhimmostdontmadethenthemfilmsmovieswaymakecouldtooanyaftercharacters']

二,定义模型

使用Keras接口有以下3种方式构建模型:使用Sequential按层顺序构建模型,使用函数式API构建任意结构模型,继承Model基类构建自定义模型。

此处选择使用继承Model基类构建自定义模型。

 演示自定义模型范例,实际上应该优先使用Sequential或者函数式API
 
tf.keras.backend.clear_session()
 
class CnnModel(models.Model):
    def __init__(self):
        super(CnnModel,self).()
 
     build(self,input_shape):
        self.embedding = layers.Embedding(MAX_WORDS,7,input_length=MAX_LEN)
        self.conv_1 = layers.Conv1D(16,kernel_size= 5,name = conv_1",activation = relu)
        self.pool = layers.MaxPool1D()
        self.conv_2 = layers.Conv1D(128,kernel_size=2,1)">conv_2)
        self.flatten = layers.Flatten()
        self.dense = layers.Dense(1,1)">sigmoid)
        super(CnnModel,self).build(input_shape)
 
     call(self,x):
        x = self.embedding(x)
        x = self.conv_1(x)
        x = self.pool(x)
        x = self.conv_2(x)
        x = self.flatten(x)
        x = self.dense(x)
        (x)
 
model = CnnModel()
model.build(input_shape =(None,MAX_LEN))
model.summary()

模型结构:

三,训练模型

训练模型通常有3种方法,内置fit方法,内置train_on_batch方法,以及自定义训练循环。此处我们通过自定义训练循环训练模型。

 打印时间分割线
@tf.function
 printbar():
    ts = tf.timestamp()
    today_ts = ts%(24*60*60)
 
    hour = tf.cast(today_ts//3600+8,tf.int32)%tf.constant(24)
    minite = tf.cast((today_ts%3600)//60),tf.int32)
 
     timeformat(m):
        if tf.strings.length(tf.strings.format({}:
            return(tf.strings.format(0{}else tf.strings.join([timeformat(hour),timeformat(minite),timeformat(second)],separator = :)
    tf.print(=========="*8,end = ""print(timestring)
optimizer = optimizers.Nadam()
loss_func = losses.BinaryCrossentropy()
 
train_loss = metrics.Mean(name=train_loss)
train_metric = metrics.BinaryAccuracy(name=train_accuracy)
 
valid_loss = metrics.Mean(name=valid_loss)
valid_metric = metrics.BinaryAccuracy(name=valid_accuracy)
 
 
@tf.function
 train_step(model,features,labels):
    with tf.GradientTape() as tape:
        predictions = model(features,training = True)
        loss = loss_func(labels,predictions)
    gradients = tape.gradient(loss,model.trainable_variables)
    optimizer.apply_gradients(zip(gradients,model.trainable_variables))
 
    train_loss.update_state(loss)
    train_metric.update_state(labels,predictions)
 
 
@tf.function
 valid_step(model,labels):
    predictions = model(features,1)"> False)
    batch_loss = train_model(model,ds_train,ds_valid,epochs):
    for epoch in tf.range(1,epochs+1):
 
        for features,labels in ds_train:
            train_step(model,labels)
 
         ds_valid:
            valid_step(model,1)">此处logs模板需要根据metric具体情况修改
        logs = Epoch={},Loss:{},Accuracy:{},Valid Loss:{},Valid Accuracy:{}' 
 
        if epoch%1==0:
            printbar()
            tf.(tf.strings.format(logs,(epoch,train_loss.result(),train_metric.result(),valid_loss.result(),valid_metric.result())))
            tf.)
 
        train_loss.reset_states()
        valid_loss.reset_states()
        train_metric.reset_states()
        valid_metric.reset_states()
 
train_model(model,ds_test,epochs = 6)

训练结果:

================================================================================14:45:06
Epoch=1,Loss:0.474225521,Accuracy:0.7376,Valid Loss:0.336961836,Valid Accuracy:0.8526

================================================================================14:45:12
Epoch=2,Loss:0.245222151,Accuracy:0.9035,Valid Loss:0.326947063,Valid Accuracy:0.8666

================================================================================14:45:17
Epoch=3,Loss:0.165854618,Accuracy:0.93795,Valid Loss:0.365531504,Valid Accuracy:0.867

================================================================================14:45:23
Epoch=4,Loss:0.104812928,Accuracy:0.96395,Valid Loss:0.448238105,Valid Accuracy:0.861

================================================================================14:45:29
Epoch=5,Loss:0.0595887862,Accuracy:0.98125,Valid Loss:0.602612,Valid Accuracy:0.8624

================================================================================14:45:35
Epoch=6,Loss:0.0318539739,Accuracy:0.9905,Valid Loss:0.762770712,Valid Accuracy:0.8598

四,评估模型

通过自定义训练循环训练的模型没有经过编译,无法直接使用model.evaluate(ds_valid)方法

 evaluate_model(model,ds_valid):
     ds_valid:
         valid_step(model,labels)
    logs = Valid Loss:{},1)"> 
    tf.五,使用模型

可以使用以下方法:

  • model.predict(ds_test)
  • model(x_test)
  • model.call(x_test)
  • model.predict_on_batch(x_test)

推荐优先使用model.predict(ds_test)方法,既可以对Dataset,也可以对Tensor使用。

model.predict(ds_test)for x_test,_ in ds_test.take(1):
    (model(x_test))
    以下方法等价:
    print(model.call(x_test))
    print(model.predict_on_batch(x_test))

评估结果:

tf.Tensor(
[[9.9007505e-01]
 [9.9999797e-01]
 [9.9836570e-01]
 [2.6509229e-06]
 [4.7592866e-01]
 [3.7760619e-05]
 [8.0391978e-08]
 [1.6816575e-05]
 [9.9996006e-01]
 [9.9695146e-01]
 [1.0000000e+00]
 [9.9962234e-01]
 [1.9009445e-08]
 [9.7622436e-01]
 [4.4549329e-06]
 [2.8802201e-01]
 [1.0730105e-04]
 [3.8324962e-03]
 [2.2874507e-03]
 [9.9966860e-01]],shape=(20,1),dtype=float32)

六,保存模型

推荐使用TensorFlow原生方式保存模型。

model.save(./data/tf_model_savedmodeltf)
export saved model.)
 
model_loaded = tf.keras.models.load_model()
model_loaded.predict(ds_test)

结果:

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/resource_variable_ops.py:1817: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed  a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
INFO:tensorflow:Assets written to: ./data/tf_model_savedmodel/assets
export saved model.
WARNING:tensorflow:No training configuration found in save file,so the model was *not* compiled. Compile it manually.
array([[0.99007505],[0.999998  ],[0.9983657 ],...,[0.99621141.        ]],dtype=float32)

 

参考:

开源电子书地址:https://lyhue1991.github.io/eat_tensorflow2_in_30_days/

GitHub 项目地址:https://github.com/lyhue1991/eat_tensorflow2_in_30_days

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。

相关推荐


MNIST数据集可以说是深度学习的入门,但是使用模型预测单张MNIST图片得到数字识别结果的文章不多,所以本人查找资料,把代码写下,希望可以帮到大家~1#BudingyourfirstimageclassificationmodelwithMNISTdataset2importtensorflowastf3importnumpyasnp4impor
1、新建tensorflow环境(1)打开anacondaprompt,输入命令行condacreate-ntensorflowpython=3.6注意:尽量不要更起名字,不然环境容易出错在选择是否安装时输入“y”(即为“yes”)。其中tensorflow为新建的虚拟环境名称,可以按喜好自由选择。python=3.6为指定python版本为3
这篇文章主要介绍“张量tensor是什么”,在日常操作中,相信很多人在张量tensor是什么问题上存在疑惑,小编查阅了各式资料,整理出简单好用的操作方法,希望对大...
tensorflow中model.fit()用法model.fit()方法用于执行训练过程model.fit(训练集的输入特征,训练集的标签,batch_size,#每一个batch的大小epochs,#迭代次数validation_data=(测试集的输入特征,
https://blog.csdn.net/To_be_little/article/details/124438800 目录1、查看GPU的数量2、设置GPU加速3、单GPU模拟多GPU环境1、查看GPU的数量importtensorflowastf#查看gpu和cpu的数量gpus=tf.config.experimental.list_physical_devices(device_type='GPU')cpus=tf.c
根据身高推测体重const$=require('jquery');consttf=require('@tensorflowfjs');consttfvis=require('@tensorflowfjs-vis');/*根据身高推测体重*///把数据处理成符合模型要求的格式functiongetData(){//学习数据constheights=[150,151,160,161,16
#!/usr/bin/envpython2#-*-coding:utf-8-*-"""CreatedonThuSep610:16:372018@author:myhaspl@email:myhaspl@myhaspl.com二分法求解一元多次方程"""importtensorflowastfdeff(x):y=pow(x,3)*3+pow(x,2)*2-19return
 继续上篇的pyspark集成后,我们再来看看当今热的不得了的tensorflow是如何继承进pycharm环境的参考:http://blog.csdn.net/include1224/article/details/53452824思路其实很简单,说下要点吧1.python必须要3.564位版本(上一篇直接装的是64位版本的Anaconda)2.激活3.5版本的
首先要下载python3.6:https://www.python.org/downloadselease/python-361/接着下载:numpy-1.13.0-cp36-none-win_amd64.whl 安装这两个:安装python3.6成功,接着安装numpy.接着安装tensorflow: 最后测试一下: python3.6+tensorflow安装完毕,高深的AI就等着你去
参考书《TensorFlow:实战Google深度学习框架》(第2版)以下TensorFlow程序完成了从图像片段截取,到图像大小调整再到图像翻转及色彩调整的整个图像预处理过程。#!/usr/bin/envpython#-*-coding:UTF-8-*-#coding=utf-8"""@author:LiTian@contact:694317828@qq.com
参考:TensorFlow在windows上安装与简单示例写在开头:刚开始安装的时候,由于自己的Python版本是3.7,安装了好几次都失败了,后来发现原来是tensorflow不支持3.7版本的python,所以后来换成了Python3.6,就成功了。。。。。anconda:5.3.2python版本:3.6.8tensorflow版本:1.12.0安装Anconda
实验介绍数据采用CriteoDisplayAds。这个数据一共11G,有13个integerfeatures,26个categoricalfeatures。Spark由于数据比较大,且只在一个txt文件,处理前用split-l400000train.txt对数据进行切分。连续型数据利用log进行变换,因为从实时训练的角度上来判断,一般的标准化方式,
 1)登录需要一个 invitationcode,申请完等邮件吧,大概要3-5个小时;2)界面3)配置数据集,在右边列设置 
模型文件的保存tensorflow将模型保持到本地会生成4个文件:meta文件:保存了网络的图结构,包含变量、op、集合等信息ckpt文件:二进制文件,保存了网络中所有权重、偏置等变量数值,分为两个文件,一个是.data-00000-of-00001文件,一个是.index文件checkpoint文件:文本文件,记录了最新保持
原文地址:https://blog.csdn.net/jesmine_gu/article/details/81093686这里只是做个收藏,防止原链接失效importosimportnumpyasnpfromPILimportImageimporttensorflowastfimportmatplotlib.pyplotaspltangry=[]label_angry=[]disgusted=[]label_d
 首先声明参考博客:https://blog.csdn.net/beyond_xnsx/article/details/79771690?tdsourcetag=s_pcqq_aiomsg实践过程主线参考这篇博客,相应地方进行了变通。接下来记载我的实践过程。  一、GPU版的TensorFlow的安装准备工作:笔者电脑是Windows10企业版操作系统,在这之前已
1.tensorflow安装  进入AnacondaPrompt(windows10下按windows键可找到)a.切换到创建好的tensorflow36环境下:activatetensorflow36    b.安装tensorflow:pipinstlltensorflow    c.测试环境是否安装好       看到已经打印出了"h
必须走如下步骤:sess=tf.Session()sess.run(result)sess.close()才能执行运算。Withtf.Session()assess:Sess.run()通过会话计算结果:withsess.as_default():print(result.eval())表示输出result的值生成一个权重矩阵:tf.Variable(tf.random_normal([2,3]
tf.zeros函数tf.zeros(shape,dtype=tf.float32,name=None)定义在:tensorflow/python/ops/array_ops.py.创建一个所有元素都设置为零的张量. 该操作返回一个带有形状shape的类型为dtype张量,并且所有元素都设为零.例如:tf.zeros([3,4],tf.int32)#[[0,0,
一、Tensorflow基本概念1、使用图(graphs)来表示计算任务,用于搭建神经网络的计算过程,但其只搭建网络,不计算2、在被称之为会话(Session)的上下文(context)中执行图3、使用张量(tensor)表示数据,用“阶”表示张量的维度。关于这一点需要展开一下       0阶张量称