使用Python进行词嵌入：docc

使用 python（和 gensim）实现 doc2vec

注意：此代码是用 python 3.6.1 (+gensim 2.3.0) 编写的

doc2vec与gensim的python实现及应用

import reimport numpy as npfrom gensim.models import doc2vecfrom gensim.models.doc2vec import taggeddocumentfrom nltk.corpus import gutenbergfrom multiprocessing import poolfrom scipy import spatial

导入训练数据集从nltk库导入莎士比亚的哈姆雷特语料库

sentences = list(gutenberg.sents(‘shakespeare-hamlet.txt’)) # import the corpus and convert into a listprint(‘type of corpus: ‘, type(sentences))print(‘length of corpus: ‘, len(sentences))

语料库类型：类“list”
语料库长度：3106

print(sentences[0]) # title, author, and yearprint(sentences[1])print(sentences[10])

[‘[‘, ‘the’, ‘悲剧’, ‘of’, ‘哈姆雷特’, ‘by’, ‘威廉’, ‘莎士比亚’, ‘1599’, ‘]’]
[‘actus’, ‘primus’, ‘.’]
[‘弗兰’, ‘.’]

预处理数据

使用re模块预处理数据将所有字母转换为小写删除标点符号、数字等对于doc2vec模型，输入数据应采用可迭代的taggeddocuments格式”每个 taggeddocument 实例都包含单词和标签因此，每个文档（即句子或段落）应该有一个可识别的唯一标签

for i in range(len(sentences)): sentences[i] = [word.lower() for word in sentences[i] if re.match(‘^[a-za-z]+’, word)] print(sentences[0]) # title, author, and yearprint(sentences[1])print(sentences[10])

[‘the’、’悲剧’、’of’、’哈姆雷特’、’by’、’威廉’、’莎士比亚’]
[‘actus’, ‘primus’]
[‘弗兰’]

for i in range(len(sentences)): sentences[i] = taggeddocument(words = sentences[i], tags = [‘sent{}’.format(i)]) # converting each sentence into a taggeddocumentsentences[0]

taggeddocument(words=[‘the’, ‘tragedie’, ‘of’, ‘hamlet’, ‘by’, ‘william’, ‘shakespeare’], tags=[‘sent0’])

创建和训练模型创建 doc2vec 模型并使用 hamlet 语料库对其进行训练关键参数说明（radimrehurek./gensim/models/doc2vec.html）句子：训练数据（必须是带有标记化句子的列表）size：嵌入空间的尺寸sg: cbow 如果为 0，skip-gram 如果为 1窗口：每个上下文所占的单词数（如果窗口大小为3，考虑左邻域中的3个单词和右邻域中的3个单词）min_count：词汇表中包含的最小单词数iter：训练迭代次数workers：要训练的工作线程数量

model = doc2vec(documents = sentences,dm = 1, size = 100, min_count = 1, iter = 10, workers = pool()._processes)model.init_sims(replace = true)

保存和加载模型doc2vec模型可以本地保存和加载这样做可以减少再次训练模型的时间

model.save(‘doc2vec_model’)model = doc2vec.load(‘doc2vec_model’)

相似度计算嵌入单词（即向量）之间的相似度可以使用余弦相似度等指标来计算

model.most_similar(‘hamlet’)

[(‘horatio’, 0.9978846311569214),
(‘女王’, 0.9971947073936462),
(‘莱尔特斯’, 0.9971820116043091),
(‘国王’, 0.9968599081039429),
(‘妈妈’, 0.9966716170310974),
(‘哪里’, 0.9966292381286621),
(‘迪尔’, 0.9965540170669556),
(‘奥菲莉亚’, 0.9964221715927124),
(‘非常’, 0.9963752627372742),
(‘哦’, 0.9963476657867432)]

v1 = model[‘king’]v2 = model[‘queen’]# define a function that putes cosine similarity between two wordsdef cosine_similarity(v1, v2): return 1 – spatial.distance.cosine(v1, v2)cosine_similarity(v1, v2)

0.99437165260314941

以上就是使用 Python 进行词嵌入：docc的详细内容，更多请关注范的资源库其它相关文章！

转载请注明：范的资源库 » 使用Python进行词嵌入：docc