使用Sklearn来做TF-IDF

昨天是讲了Sklearn的安装以及如何作TF-IDF,今天就来写一下,如何使用Sklearn做TF-IDF。

一般我们使用框架,一定要看的文档一定是官方文档,在文档中寻找是否有相关算法的使用方法。巧了,官方文档中正好有TF-IDF的使用方法

那么我们就以官方文档的注解,来做今天的TF-IDF。

首先是框架的引入。根据文档,可以使用如下命令引入。

>>> from sklearn.feature_extraction.text import TfidfVectorizer

现在呢,便是引入了Sklearn中的TfidfVectorizer,接着是我们的文库

docs = [
    "it is a good day, I like to stay here",
    "I am happy to be here",
    "I am bob",
    "it is sunny today",
    "I have a party today",
    "it is a dog and that is a cat",
    "there are dog and cat on the tree",
    "I study hard this morning",
    "today is a good day",
    "tomorrow will be a good day",
    "I like coffee, I like book and I like apple",
    "I do not like it",
    "I am kitty, I like bob",
    "I do not care who like bob, but I like kitty",
    "It is coffee time, bring your cup",
]

接着呢,我们需要首先初始化一个TfidfVectorizer向量

>>> vectorizer = TfidfVectorizer()

然后将文本输入,返回词频矩阵。

>>> tf_idf = vectorizer.fit_transform(docs)
>>> print(tf_idf)
  (0, 20)       0.39027579919573646
  (0, 29)       0.4494547871352568
........省略
  (14, 22)      0.27226932852633273

再查看从特征整数索引到特征名称的数组映射,也就是

>>> print(vectorizer.get_feature_names())
['am', 'and', 'apple', 'are', 'be', 'bob', 'book', 'bring', 'but', 'care', 'cat', 'coffee', 'cup', 'day', 'do', 'dog', 'good', 'happy', 'hard', 'have', 'here', 'is', 'it', 'kitty', 'like', 'morning', 'not', 'on', 'party', 'stay', 'study', 'sunny', 'that', 'the', 'there', 'this', 'time', 'to', 'today', 'tomorrow', 'tree', 'who', 'will', 'your']

也可以查看每个特征索引的idf值:

>>> print('idf: ',[(n,idf) for idf,n in zip(vectorizer.idf_,vectorizer.get_feature_names())])
idf:  [('am', 2.386294361119891), ('and', 2.386294361119891), ('apple', 3.0794415416798357), ('are', 3.0794415416798357), ('be', 2.6739764335716716), ('bob', 2.386294361119891), ('book', 3.0794415416798357), ('bring', 3.0794415416798357), ('but', 3.0794415416798357), ('care', 3.0794415416798357), ('cat', 2.6739764335716716), ('coffee', 2.6739764335716716), ('cup', 3.0794415416798357), ('day', 2.386294361119891), ('do', 2.6739764335716716), ('dog', 2.6739764335716716), ('good', 2.386294361119891), ('happy', 3.0794415416798357), ('hard', 3.0794415416798357), ('have', 3.0794415416798357), ('here', 2.6739764335716716), ('is', 1.9808292530117262), ('it', 1.9808292530117262), ('kitty', 2.6739764335716716), ('like', 1.9808292530117262), ('morning', 3.0794415416798357), ('not', 2.6739764335716716), ('on', 3.0794415416798357), ('party', 3.0794415416798357), ('stay', 3.0794415416798357), ('study', 3.0794415416798357), ('sunny', 3.0794415416798357), ('that', 3.0794415416798357), ('the', 3.0794415416798357), ('there', 3.0794415416798357), ('this', 3.0794415416798357), ('time', 3.0794415416798357), ('to', 2.6739764335716716), ('today', 2.386294361119891), ('tomorrow', 3.0794415416798357), ('tree', 3.0794415416798357), ('who', 3.0794415416798357), ('will', 3.0794415416798357), ('your', 3.0794415416798357)]

得到了词频矩阵之后,我们就可以进行应用了,但是呢,我又是比较懒,不想再去写原生的代码,于是呢,便又去寻找文档,幸运的是,又发现了一个可以用的Sklearn的包,是一个计算余弦相似性的包

>>> from sklearn.metrics.pairwise import cosine_similarity

之后就是我们输入的其他的句子

>>> q = "I get a coffee cup"

因为这次是只有一句话,所以我们调用transform函数就可以

>>> qtf_idf = vectorizer.transform([q])

然后调用函数得到余弦相乘后的矩阵

>>> res = cosine_similarity(tf_idf, qtf_idf)

打印出来会看到是一个二维的数组

[[0.        ]
 [0.        ]
 [0.        ]
 [0.        ]
 [0.        ]
 [0.        ]
 [0.        ]
 [0.        ]
 [0.        ]
 [0.        ]
 [0.21398863]
 [0.        ]
 [0.        ]
 [0.        ]
 [0.56058105]]

我们需要将其变为一维的数组,之后再对其进行排序,然后取前三个最大的数的索引的值

>>> res = res.ravel().argsort()[-3:]
>>> print(res)
[13 10 14]

然后呢,我们就可以看一看是哪三句话与我们输入的 q 最相似了

>>> print("\ntop 3 docs for '{}':\n{}".format(q, [docs[i] for i in res]))
top 3 docs for 'I get a coffee cup':
['I do not care who like bob, but I like kitty', 'I like coffee, I like book and I like apple', 'It is coffee time, bring your cup']

完整代码如下

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

docs = [
    "it is a good day, I like to stay here",
    "I am happy to be here",
    "I am bob",
    "it is sunny today",
    "I have a party today",
    "it is a dog and that is a cat",
    "there are dog and cat on the tree",
    "I study hard this morning",
    "today is a good day",
    "tomorrow will be a good day",
    "I like coffee, I like book and I like apple",
    "I do not like it",
    "I am kitty, I like bob",
    "I do not care who like bob, but I like kitty",
    "It is coffee time, bring your cup",
]
vectorizer = TfidfVectorizer()
tf_idf = vectorizer.fit_transform(docs)
print('idf: ',[(n,idf) for idf,n in zip(vectorizer.idf_,vectorizer.get_feature_names())])
print("v2i: ", vectorizer.vocabulary_)
q = "I get a coffee cup"
qtf_idf = vectorizer.transform([q])
res = cosine_similarity(tf_idf, qtf_idf)
res = res.ravel().argsort()[-3:]
print("\ntop 3 docs for '{}':\n{}".format(q, [docs[i] for i in res]))

评论

  1. 2r3a
    3月前
    2021-7-17 11:22:16

    快递单号网,真实物流信息,一单一号,超级单号网www.chaojidanhao.cn

  2. d2a
    3月前
    2021-7-30 10:31:23

    专业快递空包代发 24小时自助下单,快速提供底单!www.aickd.com

  3. f3s
    2月前
    2021-8-15 10:54:42

    快递单号、快递单号网站、淘宝单号提供申诉电子面单 手写底单www.dydanhw.com

  4. n4f
    3周前
    2021-10-06 13:36:51

    电商发货必备代发网站,信封包/ab单代发服务,就是礼品窝http://www.lpwo.cn

发送评论 编辑评论


				
|´・ω・)ノ
ヾ(≧∇≦*)ゝ
(☆ω☆)
(╯‵□′)╯︵┴─┴
 ̄﹃ ̄
(/ω\)
∠( ᐛ 」∠)_
(๑•̀ㅁ•́ฅ)
→_→
୧(๑•̀⌄•́๑)૭
٩(ˊᗜˋ*)و
(ノ°ο°)ノ
(´இ皿இ`)
⌇●﹏●⌇
(ฅ´ω`ฅ)
(╯°A°)╯︵○○○
φ( ̄∇ ̄o)
ヾ(´・ ・`。)ノ"
( ง ᵒ̌皿ᵒ̌)ง⁼³₌₃
(ó﹏ò。)
Σ(っ °Д °;)っ
( ,,´・ω・)ノ"(´っω・`。)
╮(╯▽╰)╭
o(*////▽////*)q
>﹏<
( ๑´•ω•) "(ㆆᴗㆆ)
😂
😀
😅
😊
🙂
🙃
😌
😍
😘
😜
😝
😏
😒
🙄
😳
😡
😔
😫
😱
😭
💩
👻
🙌
🖕
👍
👫
👬
👭
🌚
🌝
🙈
💊
😶
🙏
🍦
🍉
😣
Source: github.com/k4yt3x/flowerhd
颜文字
Emoji
小恐龙
花!
上一篇
下一篇