[python] 基于词云的关键词提取：wordcloud的使用、源码分析、中文词云生成和代码重写

2022-09-29 10:35:38

1. 词云简介

词云，又称文字云、标签云，是对文本数据中出现频率较高的“关键词”在视觉上的突出呈现，形成关键词的渲染形成类似云一样的彩色图片，从而一眼就可以领略文本数据的主要表达意思。常见于博客、微博、文章分析等。

除了网上现成的Wordle、Tagxedo、Tagul、Tagcrowd等词云制作工具，在python中也可以用wordcloud包比较轻松地实现（官网、github项目）：

from wordcloud import WordCloud

import matplotlib.pyplot as plt

# Read the whole text.

text = open('constitution.txt').read()

# Generate a word cloud image
wordcloud = WordCloud().generate(text)

# Display the generated image:

# the matplotlib way:

plt.imshow(wordcloud, interpolation='bilinear')

plt.axis("off")

生成的词云如下：

还可以设置图片作为mask：

alice_mask = np.array(Image.open(path.join(d, "alice_mask.png")))

wc = WordCloud(background_color="white", max_words=2000, mask=alice_mask, stopwords=stopwords, contour_width=3, contour_color='steelblue')

wc.generate(text)

2. 安装

pip install wordcloud

词云：解决pip install wordcloud安装过程中报错“error: command 'x86_64-linux-gnu-gcc' failed with exit status 1”问题

3. 根据源码分析wordcloud的实现原理

总的来说，wordcloud做的是三件事：

(1) 文本预处理

(2) 词频统计

(3) 将高频词以图片形式进行彩色渲染

从上面的代码可以看到，用 wordcloud.generate(text) 就完成了这三项工作。

源码：

def generate(self, text):

    """Generate wordcloud from text.

    The input "text" is expected to be a natural text. If you pass a sorted

    list of words, words will appear in your output twice. To remove this

    duplication, set ``collocations=False``.

    Alias to generate_from_text.

    Calls process_text and generate_from_frequencies.

    Returns

    -------

    self

    """

    return self.generate_from_text(text)

def generate_from_text(self, text):

    """Generate wordcloud from text.

    The input "text" is expected to be a natural text. If you pass a sorted

    list of words, words will appear in your output twice. To remove this

    duplication, set ``collocations=False``.

    Calls process_text and generate_from_frequencies.

    ..versionchanged:: 1.2.2

        Argument of generate_from_frequencies() is not return of

        process_text() any more.

    Returns

    -------

    self

    """

    words = self.process_text(text)

    self.generate_from_frequencies(words)

    return self

generate()和generate_from_text()

它的调用顺序是：

generate(self, text)

=>

self.generate_from_text(text)

=>

words = self.process_text(text)

self.generate_from_frequencies(words)

其中 process_text(text) 对应的是文本预处理和词频统计，而 generate_from_frequencies(words) 对应的是根据词频中生成词云。

(1) process_text(text)　主要是进行分词和去噪。

具体地，它做了以下操作：

检测文本编码
分词(根据规则进行tokenize)、保留单词字符(A-Za-z0-9_)和单引号(')、去除单字符
去除停用词
去除后缀('s) -- 针对英文
去除纯数字
统计一元和二元词频计数(unigrams_and_bigrams) -- 可选

返回的结果是一个字典 dict(string, int) ，表示的是分词后的token以及对应出现的次数。

这里有一些需要注意的地方，文章后面会再提到。

源码如下：

def process_text(self, text):

    """Splits a long text into words, eliminates the stopwords.

    Parameters

    ----------

    text : string

        The text to be processed.

    Returns

    -------

    words : dict (string, int)

        Word tokens with associated frequency.

    ..versionchanged:: 1.2.2

        Changed return type from list of tuples to dict.

    Notes

    -----

    There are better ways to do word tokenization, but I don't want to

    include all those things.

    """

    stopwords = set([i.lower() for i in self.stopwords])

    flags = (re.UNICODE if sys.version < '' and type(text) is unicode

             else 0)

    regexp = self.regexp if self.regexp is not None else r"\w[\w']+"

    words = re.findall(regexp, text, flags)

    # remove stopwords

    words = [word for word in words if word.lower() not in stopwords]

    # remove 's

    words = [word[:-2] if word.lower().endswith("'s") else word

             for word in words]

    # remove numbers

    words = [word for word in words if not word.isdigit()]

    if self.collocations:

        word_counts = unigrams_and_bigrams(words, self.normalize_plurals)

    else:

        word_counts, _ = process_tokens(words, self.normalize_plurals)

    return word_counts

def process_text(self, text)

(2) generate_from_frequencies(words)　主要是根据上一步的结果生成词云分布。

具体地，它做了以下操作：

对词计数结果进行排序，并归一化(normalized)到0~1之间，得到词频
创建图像并确定font_size初始值
给self.words_赋值，记录的是出现频率最高的前max_words个词，以及对应的归一化后的词频，即dict(token, normalized_frequency)
画出灰度图：词频越大，font_size越大；根据生成的随机数来决定字的水平/垂直方向
- 若随机数小于self.prefer_horizontal则为水平方向，否则为垂直方向；
- 如果空间不足，优先考虑旋转方向，其次考虑将字体变小
给self.layout_赋值，记录的是词和词频、字体大小、位置、方向、以及颜色，即list(zip(frequencies, font_sizes, positions, orientations, colors))

可以看到，这个函数的主要目的在于得到self.layout_的值，记录了要生成词云分布图所需要的信息。

后面wordcloud.to_file(filename)或者plt.imshow(wordcloud)会把结果以图像的形式呈现出来。其中to_file()函数就会先检测是否已经给self.layout_赋值，如果没有的话会报错。

源码如下：

def generate_from_frequencies(self, frequencies, max_font_size=None):

    """Create a word_cloud from words and frequencies.

    Parameters

    ----------

    frequencies : dict from string to float

        A contains words and associated frequency.

    max_font_size : int

        Use this font-size instead of self.max_font_size

    Returns

    -------

    self

    """

    # make sure frequencies are sorted and normalized

    frequencies = sorted(frequencies.items(), key=itemgetter(1), reverse=True)

    if len(frequencies) <= 0:

        raise ValueError("We need at least 1 word to plot a word cloud, "

                         "got %d." % len(frequencies))

    frequencies = frequencies[:self.max_words]

    # largest entry will be 1

    max_frequency = float(frequencies[0][1])

    frequencies = [(word, freq / max_frequency)

                   for word, freq in frequencies]

    if self.random_state is not None:

        random_state = self.random_state

    else:

        random_state = Random()

    if self.mask is not None:

        mask = self.mask

        width = mask.shape[1]

        height = mask.shape[0]

        if mask.dtype.kind == 'f':

            warnings.warn("mask image should be unsigned byte between 0"

                          " and 255. Got a float array")

        if mask.ndim == 2:

            boolean_mask = mask == 255

        elif mask.ndim == 3:

            # if all channels are white, mask out

            boolean_mask = np.all(mask[:, :, :3] == 255, axis=-1)

        else:

            raise ValueError("Got mask of invalid shape: %s"

                             % str(mask.shape))

    else:

        boolean_mask = None

        height, width = self.height, self.width

    occupancy = IntegralOccupancyMap(height, width, boolean_mask)

    # create image

    img_grey = Image.new("L", (width, height))

    draw = ImageDraw.Draw(img_grey)

    img_array = np.asarray(img_grey)

    font_sizes, positions, orientations, colors = [], [], [], []

    last_freq = 1.

    if max_font_size is None:

        # if not provided use default font_size

        max_font_size = self.max_font_size

    if max_font_size is None:

        # figure out a good font size by trying to draw with

        # just the first two words

        if len(frequencies) == 1:

            # we only have one word. We make it big!

            font_size = self.height

        else:

            self.generate_from_frequencies(dict(frequencies[:2]),

                                           max_font_size=self.height)

            # find font sizes

            sizes = [x[1] for x in self.layout_]

            try:

                font_size = int(2 * sizes[0] * sizes[1]

                                / (sizes[0] + sizes[1]))

            # quick fix for if self.layout_ contains less than 2 values

            # on very small images it can be empty

            except IndexError:

                try:

                    font_size = sizes[0]

                except IndexError:

                    raise ValueError('canvas size is too small')

    else:

        font_size = max_font_size

    # we set self.words_ here because we called generate_from_frequencies

    # above... hurray for good design?

    self.words_ = dict(frequencies)

    # start drawing grey image

    for word, freq in frequencies:

        # select the font size

        rs = self.relative_scaling

        if rs != 0:

            font_size = int(round((rs * (freq / float(last_freq))

                                   + (1 - rs)) * font_size))

        if random_state.random() < self.prefer_horizontal:

            orientation = None

        else:

            orientation = Image.ROTATE_90

        tried_other_orientation = False

        while True:

            # try to find a position

            font = ImageFont.truetype(self.font_path, font_size)

            # transpose font optionally

            transposed_font = ImageFont.TransposedFont(

                font, orientation=orientation)

            # get size of resulting text

            box_size = draw.textsize(word, font=transposed_font)

            # find possible places using integral image:

            result = occupancy.sample_position(box_size[1] + self.margin,

                                               box_size[0] + self.margin,

                                               random_state)

            if result is not None or font_size < self.min_font_size:

                # either we found a place or font-size went too small

                break

            # if we didn't find a place, make font smaller

            # but first try to rotate!

            if not tried_other_orientation and self.prefer_horizontal < 1:

                orientation = (Image.ROTATE_90 if orientation is None else

                               Image.ROTATE_90)

                tried_other_orientation = True

            else:

                font_size -= self.font_step

                orientation = None

        if font_size < self.min_font_size:

            # we were unable to draw any more

            break

        x, y = np.array(result) + self.margin // 2

        # actually draw the text

        draw.text((y, x), word, fill="white", font=transposed_font)

        positions.append((x, y))

        orientations.append(orientation)

        font_sizes.append(font_size)

        colors.append(self.color_func(word, font_size=font_size,

                                      position=(x, y),

                                      orientation=orientation,

                                      random_state=random_state,

                                      font_path=self.font_path))

        # recompute integral image

        if self.mask is None:

            img_array = np.asarray(img_grey)

        else:

            img_array = np.asarray(img_grey) + boolean_mask

        # recompute bottom right

        # the order of the cumsum's is important for speed ?!

        occupancy.update(img_array, x, y)

        last_freq = freq

    self.layout_ = list(zip(frequencies, font_sizes, positions,

                            orientations, colors))

    return self

def generate_from_frequencies(self, frequencies, max_font_size=None)

4. 应用到中文语料应该要注意的点

wordcloud包是由Andreas Mueller在2015-03-20发布1.0.0版本，现在最新的是2018-03-13发布的1.4.1版本。

英文语料可以直接输入到wordcloud中，但是对于中文语料，仅仅用wordcloud不能直接生成中文词云图。

原因：

英文单词以空格分隔，而我们从前面process_text(text)看到源码中是直接用正则表达式(默认为r"\w[\w']+")进行处理：

In  : re.findall(r"\w[\w']+", "It's Monday today.")

Out: ["It's", 'Monday', 'today']

但是中文里面词与词之间一般不用字符分隔：

In : re.findall(r"\w[\w']+", "今天天气不错，蓝天白云，还有温暖的阳光 哈　哈哈")

Out: ['今天天气不错', '蓝天白云', '还有温暖的阳光', '哈哈']

可以看出，原生的wordcloud是为英文服务的，去除标点符号（单符号'除外)并分割成token；

而应用到中文语料上的时候，注意要先分好词，再用空格分隔连接成字符串，最后输入到wordcloud。

另外要注意的是，无论是对英文还是中文，默认是把单字符剔除掉（因为 regexp = self.regexp if self.regexp is not None else r"\w[\w']+" ），如果想要保留单字符，将regexp参数讲表达式设置为 r"\w[\w']*" 即可。

from wordcloud import WordCloud

from scipy.misc import imread

def generate_wordcloud(text, max_words=200, pic_path=None):

    """

    生成词云

    :param text: 一段以空格为间断的字符串

    :param max_words: 词数目上限

    :param pic_path: 输出图片路径

    :return:

    """

    mk = imread("tuoyuan.jpg")

    wc = WordCloud(font_path="/usr/share/fonts/myfonts/msyh.ttf", background_color="white", max_words=max_words,

                   mask=mk, width=1000, height=500, max_font_size=100, prefer_horizontal=0.95, collocations=False)

    wc.generate(text=text)

    if pic_path:

        wc.to_file(pic_path)

    else:

        plt.imshow(wc)

        plt.axis("off")

        plt.show()

    return wc.words_

def run_wordcloud(corpus, max_words, pic_path=None):
text = " ".join([" ".join(line) for line in corpus])   # 将分词后的结果用空格连接

    word2weight = generate_wordcloud(text=text, max_words=max_words, pic_path=pic_path)

    word2weight_sorted = sorted(word2weight.items(), key=lambda x: x[1], reverse=True)

    logging.info([(k, float("%.5f" % v)) for k, v in word2weight_sorted])

5. 重写代码

用词云是为了直观地看语料的关键信息，在本人的实际工作应用中，主要目的在于获取关键信息，而不太关注界面的呈现方式。

所以在了解wordcloud源码实现原理之后，决定自己用代码实现。

一方面，使得代码的实现更公开透明，在效率相当的情况下尽量避免使用第三方库，效果可控，甚至还可以提升效率；

另一方面，能结合实际情况更灵活地处理问题。

针对中文的预处理，可以和分词结合一起完成。这里主要进行：分词和词性标注、小写化、去停用词、去数字、去单字符、以及保留指定词性。

import jieba

import jieba.posseg as pseg

class Utils(object):

    def __init__(self, utils_data=None):

        self.stopwords = self.init_utils(utils_data)

        self.pos_save = {

            "n", "an", "Ng", "nr", "ns", "nt", "nz", "vn", "un",  # 名

            "v", "vg", "vd",  # 动

            "a", "ag", "ad",  # 形

            "j", "l", "i", "z", "b", "g", "s", "h",  # j简称略语、l习用语、i成语、z状态词、b区别词、g语素、s处所词、h前接成分

            "zg", "eng",

            "x"}  # 未知（自定义词）

    def _init_utils(self, utils_data):

        for wd in utils_data["user_dict"]:

            jieba.add_word(wd)

        return set(utils_data["stopwords"])

    def _token_filter(self, token):  # 去停用词; 去数字; 去单字

        return token not in self.stopwords and not token.isdigit() and len(token) >= 2

    def _token_filter_with_flag(self, pair_word_flag):  # 保留指定词性

        return self.token_filter(pair_word_flag.word) and pair_word_flag.flag in self.pos_save

    def cut(self, text):

        return list(filter(self._token_filter, list(jieba.cut(text.lower()))))  # 分词; 小写化;

    def cut_with_flag(self, text):

        pairs = list(filter(self._token_filter_with_flag,  list(pseg.cut(text.lower()))))  # 分词和词性标注; 小写化;

        return [p.word for p in pairs]

做完文本分词和其它预处理之后，直接统计词及对应的出现次数即可。为了更直观，这里输出的是词计数，而不是归一化后的词频。排序结果与wordcloud等同。

    def word_count(corpus, n_gram=1, n=None):

        counter = Counter()

        if n_gram == 1:

            for line in corpus:

                counter.update(line)

        elif n_gram == 2:

            for line in corpus:

                size = len(line)

                counter.update(["%s_%s" % (line[idx], line[idx + 1]) for idx in range(size) if idx + 1 < size])  # 有序

        else:

            logging.info("[Error] Invalid value of param n_gram: %s (only 1 or 2 accepted)" % n_gram)

        return counter.most_common(n=n)

另外还可以统计高频词的共现情况、把高频词/词共现反向映射到对应的句子等等，便于从高频词层面到高频句子类型层面的归纳。

参考：

https://pypi.org/project/wordcloud/

https://github.com/amueller/word_cloud

http://python.jobbole.com/87496/

https://www.jianshu.com/p/ead991a08563

https://blog.csdn.net/qq_34739497/article/details/78285972

https://www.cnblogs.com/sunnyeveryday/p/7043399.html

https://www.cnblogs.com/naraka/p/8992058.html

https://www.cnblogs.com/franklv/p/6995150.html

https://blog.csdn.net/Tang_Chuanlin/article/details/79862505

https://www.cnblogs.com/zjutlitao/archive/2016/08/04/5734876.html

码农公寓

1. 词云简介

2. 安装

3. 根据源码分析wordcloud的实现原理

4. 应用到中文语料应该要注意的点

5. 重写代码

相关文章