如何使用 Wiki：Fasttext.vec 和 Google 新闻：Word2vec.bin 预训练文件作为 Keras 嵌入层的权重

如何解决如何使用 Wiki：Fasttext.vec 和 Google 新闻：Word2vec.bin 预训练文件作为 Keras 嵌入层的权重

我有一个函数可以从 GloVe.txt 中提取预训练的嵌入并将它们加载为 Kears Embedding Layer 权重，但是对于给定的两个文件，我该如何做呢？

This accepted stackoverflow answer 给我一种感觉，.vec 可以被视为 .txt，我们可能使用相同的技术来提取我们用于 fasttext.vec 的 glove.txt。我的理解正确吗？

我浏览了很多博客和堆栈答案，以找到如何处理二进制文件？我发现 in this stack answer 二进制文件或 .bin 文件是 MODEL 本身而不是嵌入，您可以使用 Gensim 将 bin 文件转换为文本文件。我认为它可以保存嵌入，我们可以像加载 Glove 一样加载预训练的嵌入。我的理解正确吗？

这是执行此操作的代码。我想知道我是否走在正确的道路上，因为我在任何地方都找不到满意的答案。

     tokenizer.fit_on_texts(data) # tokenizer is Keras Tokenizer()
     vocab_size = len(tokenizer.word_index) + 1 # extra 1 for unknown words
     encoded_docs = tokenizer.texts_to_sequences(data) # data is lists of lists of sentences
     padded_docs = pad_sequences(encoded_docs,maxlen=max_length,padding='post')   # max_length is say 30  


     model = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin',binary=True) # this will load the binary Word2Vec model
     model.save_word2vec_format('GoogleNews-vectors-negative300.txt',binary=False) # this will save the VECTORS in a text file. Can load it using the below function?


    def load_embeddings(vocab_size,fitted_tokenizer,emb_file_path,emb_dim=300):
        '''
        It can load GloVe.txt for sure. But is it the right way to load paragram.txt,fasttext.vec and word2vec.bin if converted to .txt?
        '''
        embeddings_index = dict()
        f = open(emb_file_path)
        for line in f:
            values = line.split()
            word = values[0]
            coefs = asarray(values[1:],dtype='float32')
            embeddings_index[word] = coefs
        f.close()

        embedding_matrix = zeros((vocab_size,emb_dim))
        for word,i in tokenizer.word_index.items():
            embedding_vector = embeddings_index.get(word)
            if embedding_vector is not None:
                embedding_matrix[i] = embedding_vector
                
        return embedding_matrix

我的问题是我们可以直接加载 .vec 文件吗？我们可以使用给定的 .bin 函数加载我上面描述的 load_embeddings() 文件吗？

解决方法

我已经找到了答案：如果有任何问题，请更新。

class PreProcess():
    # check: https://stackabuse.com/pythons-classmethod-and-staticmethod-explained/ for @staticmethod use
    @staticmethod # You don't have to create an object of this class in order access this method. Preprocess.preprocess_data()
    def preprocess_data(data:list,max_length:int):
        '''
        Method to parse,tokenize,build vocab and padding the text data
        args:
            data: List of all the texts as: ['this is text 1','this is text 2 of different length']
            max_length: maximum length to consider for an individual text entry in data
        out:
            vocab size,fitted tokenizer object,encoded input text and padded input text
        '''
        tokenizer = Tokenizer() # set num_words,oov_token arguments depending on your usecase
        tokenizer.fit_on_texts(data)
        vocab_size = len(tokenizer.word_index) + 1 # extra 1 for unknown words which will be all 0s when loading pre trained embeddings
        encoded_docs = tokenizer.texts_to_sequences(data)
        padded_docs = pad_sequences(encoded_docs,maxlen=max_length,padding='post')  
        return vocab_size,tokenizer,encoded_docs,padded_docs
    
    
    @staticmethod
    def load_pretrained_embeddings(fitted_tokenizer,vocab_size:int,emb_file:str,emb_dim:int=300,):
    '''
    All 300D Embeddings: https://www.kaggle.com/reppy4620/embeddings 
    '''
    if '.bin' in emb_file: # if it is binary file,it is not embeddings but the MODEL itself. It could be fasttext or word2vec model
        model = KeyedVectors.load_word2vec_format(emb_file,binary=True)
        # emb_file = emb_file.replace('.bin','.txt') # general purpose path
        emb_file = './new_emb_file.txt' # for Kaggle because you have to save data in out dir only
        model.save_word2vec_format(emb_file,binary=False)
    
    # open and read the contents of the .txt / .vec file (.vec is same as .txt file)
    embeddings_index = dict() 
    with open(emb_file,encoding="utf8",errors='ignore') as f:
        for i,line in enumerate(f): # each line is as: hello 0.9 0.3 0.5 0.01 0.001 ...
            if i>0: # why this? You'll see in most of the Kaggle Kernals as if len(line)>100. It is because there is a difference between GloVe style and Word2Vec style embeddings
                # check this link: https://radimrehurek.com/gensim/scripts/glove2word2vec.html

                values = line.split(' ') 
                word = values[0] # first value is "hello" 
                coefs = np.asarray(values[1:],dtype='float32') # everything else is vector of "hello"
                embeddings_index[word] = coefs
                
    # create the embedding matrix or Embedding weights based on your data
    embedding_matrix = np.zeros((vocab_size,emb_dim)) # build embeddings based on our vocab size
    for word,i in fitted_tokenizer.word_index.items(): # get each vocab token one by one
        embedding_vector = embeddings_index.get(word) # get from loaded embeddings
        if embedding_vector is not None:
            embedding_matrix[i] = embedding_vector # if it is present,just replace the corresponding vectors
            
    return embedding_matrix

             
    @staticmethod
    def load_ELMO(data):
        pass

如何使用 Wiki：Fasttext.vec 和 Google 新闻：Word2vec.bin 预训练文件作为 Keras 嵌入层的权重

如何解决如何使用 Wiki：Fasttext.vec 和 Google 新闻：Word2vec.bin 预训练文件作为 Keras 嵌入层的权重

解决方法

相关推荐