用Python统计词频 - 三九宝宝网

[用Excel做人事统计]不管是高校、企业还是政府机关，每年完成单位各种各样的人事情况统计、制作人员名册的工作都是必不可少的。SQL Server、Access等数据库软件入门门槛较高，不适合普通工作人员使...+阅读

import pandas as pda='''123456 人性尴尬啊哈147852 哈哈不好看123456 啊哈147852 哈哈147852 嗯嗯二刷147852 略略略人性123456 尴尬147963 人性极端123456 啊哈147963 不好看'''arr=[[x] for x in a.split('\n')]df=pd.DataFrame(arr,)df1=df[0].str.split('\t',expand=True);df1.columns='id a b c'.split()df2=df1.set_index('id').stack().reset_index()df2.drop_duplicates(inplace=True)print(df2[0].value_counts（)）## 源文件txt.txt，输出文件txt_out.txtimport pandas as pddf=pd.read_csv(r'd:/txt.txt',encoding='gbk',header=None)df1=df[0].str.split('\t',expand=True);df1.columns='id a b c'.split()df2=df1.set_index('id').stack().reset_index()df2.drop_duplicates(inplace=True)df2[0].value_counts().to_csv(r'd:/txt_out.txt',header=None,sep='\t')

如何用python实现英文短文的双词频统计

简单版：#！/usr/bin/env python3 import re import jieba from collections import Counter fname = 'counttest.txt' with open(fname) as f: s = f.read() pattern = re.compile(r'[a-zA-Z]+\-?[a-zA-Z]*') english_words = Counter(pattern.findall(s)) other_words = Counter(jieba.cut(pattern.sub('', s))) print('\n英文单词统计结果：\n'+'-'*17) print('\n'.join(['{}: {}'.format(i, j) for i, j in english_words.most_common()])) print('\n中文及符号统计结果：\n'+'-'*19) print('\n'.join(['{}: {}'.format(i, j) for i, j in other_words.most_common（)]））复杂版：#！/usr/bin/env python# -*- coding: utf-8 -*- from __future__ import print_function, division, unicode_literals import sys, re, time, os, jieba from collections import Counter from datetime import datetime class WordCounter(object): def __init__(self, from_file, to_file=None, coding=None, jieba_cut=None)： '''根据设定的进程数，把文件from_file分割成大小基本相同，数量等同与进程数的文件段，来读取并统计词频，然后把结果写入to_file中，当其为None时直接打印在终端或命令行上。

Args: from_file 要读取的文件 to_file 结果要写入的文件 coding 文件的编码方式，默认为采用chardet模块读取前1万个字符来自动判断 jieba_cut 是否启用结巴分词，默认为None How to use: w = WordCounter('a.txt', 'b.txt') w.run() ''' if not os.path.isfile(from_file): raise Exception('No such file：文件不存在') self.f1 = from_file self.filesize = os.path.getsize(from_file) self.f2 = to_file if coding is None: try: import chardet except ImportError: os.system('pip install chardet') print('-'*70) import chardet with open(from_file, 'rb') as f: coding = chardet.detect(f.read(10000))['encoding'] self.coding = coding self._c = [Counter(), Counter()] self.jieba = False if jieba_cut is not None: self.jieba = True def run(self): start = time.time() if 1: self.count_direct(self.f1) if self.f2 not in ['None', 'Null', 'none', 'null', None]: with open(self.f2, 'wb') as f: f.write(self.result.encode(self.coding)) else: print('\nEnglish words:\n' + '-'*15) print(self.result) cost = '{:.1f}'.format(time.time()-start) size = humansize(self.filesize) tip = '\nFile size: {}. Cost time: {} seconds' # print(tip.format(size, cost)) self.cost = cost + 's' def count_direct(self, from_file)： '''直接把文件内容全部读进内存并统计词频''' start = time.time() with open(from_file, 'rb') as f: line = f.read() for i in range(len(self._c)): self._c[i].update(self.parse(line)[i]) def parse(self, line)： #解析读取的文件流 text = line.decode(self.coding) text = re.sub(r'\-\n', '', text) #考虑同一个单词被分割成两段的情况，删除行末的-号 pattern = re.compile(r'[a-zA-Z]+\-?[a-zA-Z]*') #判断是否为英文单词 english_words = pattern.findall(text) rest = pattern.sub('', text) ex = Counter(jieba.cut(rest)) if self.jieba else Counter(text) return Counter(english_words), ex def flush(self)： #清空统计结果 self._c = [Counter(), Counter()] property def counter(self)： #返回统计结果的Counter类 return self._c property def result(self)： #返回统计结果的字符串型式，等同于要写入结果文件的内容 ss = [] for c in self._c: ss.append(['{}: {}'.format(i, j) for i, j in c.most_common()]) tip = '\n\n中文及符号统计结果：\n'+'-'*15+'\n' return tip.join(['\n'.join(s) for s in ss]) def humansize(size)：＂＂＂将文件的大小转成带单位的形式 >>>humansize(1024) == '1 KB' True >>>humansize(1000) == '1000 B' True >>>humansize(1024*1024) == '1 M' True >>>humansize(1024*1024*1024*2) == '2 G' True ＂＂＂ units = ['B', 'KB', 'M', 'G', 'T'] for unit in units: if size

求看python统计中文词频的代码有一个地方不懂求大神

首先要说明一个概念：gbk编码里一个中文字符的‘长度’是2。 str = '中国' #gbk编码要取得'中'这个字符，需要用分片str[0:2]，而不是索引str[0]。以z4为例，下面这些代码的效果是这样的。 x = '同舟共济与时俱进艰苦奋斗' i+= z4.findall(x) # 返回['同舟共济'，'与时俱进'， '艰苦奋斗'] i+= z4.findall(x[2：]) # 返回['舟共济与'， '时俱进艰'] i+= z4.findall(x[4：]) # 返回['共济与时'， '俱进艰苦'] i+= z4.findall(x[6：]) # 返回['济与时俱'， '进艰苦奋'] 目的是取得所有连续4字中文字符串。

以下为关联文档：

用excel统计学生成绩用函数COUNTIF 计合格率：如成续分数是整数的话，合格率=COUNTIF(range,＂&gt；「合格分数-1」＂)/全班人数例子：「合格分数-1」为59 合格率=COUNTIF(b1:b50,＂>59＂)/全班人数如成续分...

用Excel做统计图最低0.27元/天开通文库会员，可在文库查看完整内容> 原发布者：13468555640 Excel制作常用统计图Excel有较强的作图功能，可根据需要选择各类型的图形。Excel提供的统计图：包括柱形...

用sql统计单元格个数的函数是什么怎么用1、表达式：COUNTIF(Range, Criteria) 中文表达式：COUNTIF（统计范围，条件） 2、说明： A、条件不能超过 255 个字符，否则会返回错误。 B、统计文本个数时，不能包含前导空格与尾部空格，...

Word字数统计怎么用字数统计是Word中最常用到的功能之一，下面Word联盟就来讲讲Word2003、2007、2010目前最流行的几个版本中字数统计的使用。 Word2003字数统计如何使用打开Word文档，单击菜单栏...

Excel怎么用过滤统计假设在F1单元格输入产地信息，修改你的公式=SUMPRODUCT((Sheet2!$C$2:$C$17=Sheet1!$F$1)*(Sheet2!$B$2:$B$17=Sheet1!B$1)*(NOT(ISERROR(FIND($A2,Sheet2!$A$2:$A$17))))) 修...

统计用区划代码和统计用城乡划分代码的附件12009年统计用区划代码（样稿）统计用区划代码区划名称 110000000000北京市 110100000000 市辖区 110101000000东城区 110101001000 东华门街道办事处 110101001001 多福巷社区...

统计用区划代码和统计用城乡划分代码的附件1

excel怎么做问卷统计怎么用EXCEL做问卷的数据统计按照附图所示格式填写原始数据。在J6单元格内填写公式： =SUMPRODUCT(C$3:H$3,(C6:H6=C$2:H$2)*1)+I6 选中J6单元格，下拉复制到最后一行。在主表的后面的空白单元格，例如C18单...

excel怎么做问卷统计怎么用EXCEL做问卷的数据统计

英文文本中的单词词频统计推荐用软件Replace Pioneer，可以找出所有单词并可以按词频排序。详细步骤：首先需要把word文档保存为txt文本文件，然后： 1. 安装并打开Replace Pioneer，选择Tools->Pattern Cou...

开发Python用哪些工具好刚学python时，面对简陋的官方版idle和一大堆开发平台和发行版，不知道究竟如何下手。在进行多方尝试后，我最后的选择是Anaconda + Pycharm，用anaconda集成的ipython做工作台，做一...