- 当前环境:Python3.10.0下去下载的WordCloud词云库
- 初始化函数来源:WordCloud初始化函数__init__代码
__init__部分代码如下:
def __init__(self, font_path=None, width=400, height=200, margin=2,
ranks_only=None, prefer_horizontal=.9, mask=None, scale=1,
color_func=None, max_words=200, min_font_size=4,
stopwords=None, random_state=None, background_color='black',
max_font_size=None, font_step=1, mode="RGB",
relative_scaling='auto', regexp=None, collocations=True,
colormap=None, normalize_plurals=True, contour_width=0,
contour_color='black', repeat=False,
include_numbers=False, min_word_length=0, collocation_threshold=30):
其中,
- font_path:String类型,格式可以是OTF 或TTF。Linux下默认是和WordCloud.py同目录下的DroidSansMono.ttf,如果没有该字体库,可以自定义添加。
- width:int类型,画布的宽度:默认400
- height:int类型,画布的高度:默认200
- prefer_horizontal:float类型,词语水平方向排版出现的频率,默认 0.9 (所以词语垂直方向排版出现频率为 0.1 )
- mask:默认为空。如果参数为空,则使用二维遮罩作为画布来绘制词云。如果 mask 非空,设置的宽高值将被忽略,遮罩形状被 mask 取代。除全白(#FFFFFF)的部分将不会绘制,其余部分会用于绘制词云。如:bg_pic = imread(‘读取一张图片.png’),背景图片的画布一定要设置为白色(#FFFFFF),然后显示的形状为不是白色的其他颜色。可以用ps工具将自己要显示的形状复制到一个纯白色的画布上再保存,就ok了。
- contour_width:float类型,默认为0。如果mask不为空,且contour_width>0,则画布将会绘制contour_width大小的轮廓
- contour_color:默认为‘black’。画布轮廓的颜色
- scale:float类型,默认为1。按照比例进行放大画布,如设置为1.5,则长和宽都是原来画布的1.5倍。对于大型文字云图像,使用比例而不是更大的画布尺寸会明显更快,但可能会导致用词更粗糙。
- min_font_size:int (default=4) ,显示的最小的字体大小
- font_step:int (default=1),字体步长,如果步长大于1,会加快运算但是可能导致结果出现较大的误差
- max_words:number (default=200),要显示的词的最大个数
- stopwords:设置需要屏蔽的词,默认为空。如果设置,则为String集合
- background_color:背景颜色(default=“black”)
- max_font_size:int or None (default=None),显示的最大的字体大小。如果为None,则使用图像的高度。
- mode:String类型,(default=“RGB”)。当参数为“RGBA”并且background_color不为空时,背景为透明
- relative_scaling:float (default=‘auto’)。词频和字体大小的关联性。当relative_scaling=0时,只考虑单词等级。如果relative_scaling=1,那么频率加倍的单词,其大小也会加倍。如果考虑单词频率,而不仅仅是它们的排名,relative_scaling在0.5左右通常看起来不错。如果’auto’,它将被设置为0.5,除非repeat为true,在这种情况下,它将被设置为0
- color_func:callable, default=None。生成新颜色的函数
- regexp:string or None (optional),使用正则表达式分隔输入的文本。将输入文本拆分为process_text中的标记的正则表达式。如果指定None,则使用’ ’ r"w[w’]+" ’ '。如果使用generate_from_frequencies则忽略。
- collocations:bool, default=True。是否包括两个词的搭配(发现词云的单词重复出现时,可修改其为False)
- colormap:tring or matplotlib colormap, default=”viridis” #给每个单词随机分配颜色,若指定color_func,则忽略该方法
- normalize_plurals:bool, default=True。是否删除单词的末尾’s’。如果为True,那么带结尾’s’的单词将被删除,其计数将添加到不带结尾’s’的版本中——除非单词以’ss’结尾。如果使用generate_from_frequencies则忽略。
- repeat : bool, default=False。是否重复单词和短语,直到达到max_words或min_font_size。
- include_numbers : bool, default=False。是否包含数字作为短语
- min_word_length : int, default=0。一个单词必须包含的最小字母数。
- collocation_threshold:int, default=30。Bigrams必须具有比该参数更大的Dunning似然搭配得分才能算作Bigrams。默认值30是任意的。
- 以及和_init_一样处在wordcloud中的函数:
- it_words(frequencies) :根据词频生成词云
generate(text) :根据文本生成词云
generate_from_frequencies(frequencies[, …]) :根据词频生成词云
generate_from_text(text) :根据文本生成词云
process_text(text) :将长文本分词并去除屏蔽词(此处指英语,中文分词还是需要自己用别的库先行实现,使用上面的 fit_words(frequencies) )
recolor([random_state, color_func, colormap]) :对现有输出重新着色。重新上色会比重新生成整个词云快很多
to_array() :转化为 numpy array
to_file(filename) :输出到文件
附带__init__初始化代码(全函数):
r"""Word cloud object for generating and drawing.
Parameters
----------
font_path : string
Font path to the font that will be used (OTF or TTF).
Defaults to DroidSansMono path on a Linux machine. If you are on
another OS or don't have this font, you need to adjust this path.
width : int (default=400)
Width of the canvas.
height : int (default=200)
Height of the canvas.
prefer_horizontal : float (default=0.90)
The ratio of times to try horizontal fitting as opposed to vertical.
If prefer_horizontal < 1, the algorithm will try rotating the word
if it doesn't fit. (There is currently no built-in way to get only
vertical words.)
mask : nd-array or None (default=None)
If not None, gives a binary mask on where to draw words. If mask is not
None, width and height will be ignored and the shape of mask will be
used instead. All white (#FF or #FFFFFF) entries will be considerd
"masked out" while other entries will be free to draw on. [This
changed in the most recent version!]
contour_width: float (default=0)
If mask is not None and contour_width > 0, draw the mask contour.
contour_color: color value (default="black")
Mask contour color.
scale : float (default=1)
Scaling between computation and drawing. For large word-cloud images,
using scale instead of larger canvas size is significantly faster, but
might lead to a coarser fit for the words.
min_font_size : int (default=4)
Smallest font size to use. Will stop when there is no more room in this
size.
font_step : int (default=1)
Step size for the font. font_step > 1 might speed up computation but
give a worse fit.
max_words : number (default=200)
The maximum number of words.
stopwords : set of strings or None
The words that will be eliminated. If None, the build-in STOPWORDS
list will be used. Ignored if using generate_from_frequencies.
background_color : color value (default="black")
Background color for the word cloud image.
max_font_size : int or None (default=None)
Maximum font size for the largest word. If None, height of the image is
used.
mode : string (default="RGB")
Transparent background will be generated when mode is "RGBA" and
background_color is None.
relative_scaling : float (default='auto')
importance of relative word frequencies for font-size. With
relative_scaling=0, only word-ranks are considered. With
relative_scaling=1, a word that is twice as frequent will have twice
the size. If you want to consider the word frequencies and not only
their rank, relative_scaling around .5 often looks good.
If 'auto' it will be set to 0.5 unless repeat is true, in which
case it will be set to 0.
.. versionchanged: 2.0
Default is now 'auto'.
color_func : callable, default=None
Callable with parameters word, font_size, position, orientation,
font_path, random_state that returns a PIL color for each word.
Overwrites "colormap".
See colormap for specifying a matplotlib colormap instead.
To create a word cloud with a single color, use
``color_func=lambda *args, **kwargs: "white"``.
The single color can also be specified using RGB code. For example
``color_func=lambda *args, **kwargs: (255,0,0)`` sets color to red.
regexp : string or None (optional)
Regular expression to split the input text into tokens in process_text.
If None is specified, ``r"w[w']+"`` is used. Ignored if using
generate_from_frequencies.
collocations : bool, default=True
Whether to include collocations (bigrams) of two words. Ignored if using
generate_from_frequencies.
.. versionadded: 2.0
colormap : string or matplotlib colormap, default="viridis"
Matplotlib colormap to randomly draw colors from for each word.
Ignored if "color_func" is specified.
.. versionadded: 2.0
normalize_plurals : bool, default=True
Whether to remove trailing 's' from words. If True and a word
appears with and without a trailing 's', the one with trailing 's'
is removed and its counts are added to the version without
trailing 's' -- unless the word ends with 'ss'. Ignored if using
generate_from_frequencies.
repeat : bool, default=False
Whether to repeat words and phrases until max_words or min_font_size
is reached.
include_numbers : bool, default=False
Whether to include numbers as phrases or not.
min_word_length : int, default=0
Minimum number of letters a word must have to be included.
collocation_threshold: int, default=30
Bigrams must have a Dunning likelihood collocation score greater than this
parameter to be counted as bigrams. Default of 30 is arbitrary.
See Manning, C.D., Manning, C.D. and Schütze, H., 1999. Foundations of
Statistical Natural Language Processing. MIT press, p. 162
https://nlp.stanford.edu/fsnlp/promo/colloc.pdf#page=22
Attributes
----------
``words_`` : dict of string to float
Word tokens with associated frequency.
.. versionchanged: 2.0
``words_`` is now a dictionary
``layout_`` : list of tuples (string, int, (int, int), int, color))
Encodes the fitted word cloud. Encodes for each word the string, font
size, position, orientation and color.
Notes
-----
Larger canvases with make the code significantly slower. If you need a
large word cloud, try a lower canvas size, and set the scale parameter.
The algorithm might give more weight to the ranking of the words
than their actual frequencies, depending on the ``max_font_size`` and the
scaling heuristic.
"""
def __init__(self, font_path=None, width=400, height=200, margin=2,
ranks_only=None, prefer_horizontal=.9, mask=None, scale=1,
color_func=None, max_words=200, min_font_size=4,
stopwords=None, random_state=None, background_color='black',
max_font_size=None, font_step=1, mode="RGB",
relative_scaling='auto', regexp=None, collocations=True,
colormap=None, normalize_plurals=True, contour_width=0,
contour_color='black', repeat=False,
include_numbers=False, min_word_length=0, collocation_threshold=30):
if font_path is None:
font_path = FONT_PATH
if color_func is None and colormap is None:
version = matplotlib.__version__
if version[0] < "2" and version[2] < "5":
colormap = "hsv"
else:
colormap = "viridis"
self.colormap = colormap
self.collocations = collocations
self.font_path = font_path
self.width = width
self.height = height
self.margin = margin
self.prefer_horizontal = prefer_horizontal
self.mask = mask
self.contour_color = contour_color
self.contour_width = contour_width
self.scale = scale
self.color_func = color_func or colormap_color_func(colormap)
self.max_words = max_words
self.stopwords = stopwords if stopwords is not None else STOPWORDS
self.min_font_size = min_font_size
self.font_step = font_step
self.regexp = regexp
if isinstance(random_state, int):
random_state = Random(random_state)
self.random_state = random_state
self.background_color = background_color
self.max_font_size = max_font_size
self.mode = mode
if relative_scaling == "auto":
if repeat:
relative_scaling = 0
else:
relative_scaling = .5
if relative_scaling < 0 or relative_scaling > 1:
raise ValueError("relative_scaling needs to be "
"between 0 and 1, got %f." % relative_scaling)
self.relative_scaling = relative_scaling
if ranks_only is not None:
warnings.warn("ranks_only is deprecated and will be removed as"
" it had no effect. Look into relative_scaling.",
DeprecationWarning)
self.normalize_plurals = normalize_plurals
self.repeat = repeat
self.include_numbers = include_numbers
self.min_word_length = min_word_length
self.collocation_threshold = collocation_threshold



