求选取一段文章中出现频率最多的20个单词的好方法

wxlcanary 2009-08-27 11:34:12

大致的需求如下：
1.有一段英文文章，需要删除掉 a, the, have 这些不需要参与统计的单词
2.统计文章中剩余的所有单词的出现频率，显示出20个出现频率最高的单词，及其出现的次数。
3.如果两个或多个单词出现的频率相同，按字母顺序显示。
4.可以用c#或java script来实现，输出的结果要显示到网页中。

输出的结果举例如下：

public 10
air 7
method 7
root 7
meeting 6
tea 6
blue 5
sky 2
.....

我没有想到很好的办法来完成，请各位帮助看看，是否有简单快捷的统计方法？多谢！

...全文

780 13 打赏收藏转发到动态举报

写回复

用AI写文章

13 条回复

切换为时间正序

请发表友善的回复…

发表回复

perhapstang 2010-05-21

打赏
举报

求答案。

wxlcanary 2009-09-01

打赏
举报

多谢 Flashcom，真的是快捷，易懂又实用的方法。

纠结的程序猿 2009-08-28

打赏
举报

1. 下载PilotEdit 2.8, http://topic.csdn.net/u/20090818/22/df665ee5-bd6f-4c6d-84b6-9831217e4e02.html
将文件内容拷贝到一个新建的文件中，假设内容如下。
Faster payments and more options increases your cash flow
Certification by RegNow ensures you receive proper credit for driving a sale
EPC product rankings show which products convert and pay
Robust reporting gives you every bit of data possible
Product Data Feed feature lets you house and display content from our catalog
NEW - RegNow Affiliate Site Builder Tool helps you quickly
build an affiliate site or landing page! Learn More, watch the demo video Demo Video
Click here for more info…

2.用PilotEdit把所有的空格换成回车换行符可以得到类似下面的文件（新文件内容没有全部列出来）
Faster
payments
and
more
options
increases
your
cash
flow
......
Click
here
for
more
info…

3. 点排序按钮，点按钮“查找重复的行”，可以得到下面的结果：
文件6[10]:
文件6[23]:
文件6[33]:
文件6[43]:
文件6[57]:
文件6[68]:
文件6[84]:
文件6[90]:
- - - - - - - - - - 文件6: 找到8次“”。 - - - - - - - - - -
文件6[13]: RegNow
文件6[60]: RegNow
- - - - - - - - - - 文件6: 找到2次“RegNow”。 - - - - - - - - - -
文件6[3]: and
文件6[31]: and
文件6[51]: and
- - - - - - - - - - 文件6: 找到3次“and”。 - - - - - - - - - -
文件6[19]: for
文件6[87]: for
- - - - - - - - - - 文件6: 找到2次“for”。 - - - - - - - - - -
文件6[4]: more
文件6[88]: more
- - - - - - - - - - 文件6: 找到2次“more”。 - - - - - - - - - -
文件6[15]: you
文件6[37]: you
文件6[49]: you
文件6[66]: you
- - - - - - - - - - 文件6: 找到4次“you”。 - - - - - - - - - -

4. 下面对步骤三所得的结果进行操作。
点排序按钮，选择“比较由正则表达式定义的字符串”，输入下面的正则表达式和目标字符串：
正则表达式：- - - - - - - - - - * 找到*次
目标字符串：%04
点按钮“按照降序排列”，可以得到下面的结果：
- - - - - - - - - - 文件6: 找到8次“”。 - - - - - - - - - -
- - - - - - - - - - 文件6: 找到4次“you”。 - - - - - - - - - -
- - - - - - - - - - 文件6: 找到3次“and”。 - - - - - - - - - -
- - - - - - - - - - 文件6: 找到2次“RegNow”。 - - - - - - - - - -
- - - - - - - - - - 文件6: 找到2次“for”。 - - - - - - - - - -
- - - - - - - - - - 文件6: 找到2次“more”。 - - - - - - - - - -
文件6[10]:
文件6[23]:
文件6[33]:
文件6[43]:
文件6[57]:
文件6[68]:
文件6[84]:
文件6[90]:
文件6[13]: RegNow
文件6[60]: RegNow
文件6[3]: and
文件6[31]: and
文件6[51]: and
文件6[19]: for
文件6[87]: for
文件6[4]: more
文件6[88]: more
文件6[15]: you
文件6[37]: you
文件6[49]: you
文件6[66]: you

深海之蓝 2009-08-27

打赏
举报

背包算法

小case 2009-08-27

打赏
举报

认真点的话，去找本C语言为基础的数据结构的书复习一下，这问题是很简单的

zhouyanfss 2009-08-27

打赏
举报

笨方法



//string content="文章内容"

string[] a=content.Split(new char[]{'\n',' ',',',';','.'});

foreach (string tmpStr in a)

{

   //记数操作

}

BitCoffee 2009-08-27

打赏
举报



        //参考

        private void button1_Click(object sender, EventArgs e)

        {

            List <KeyValuePair <string,  int> >  L  =  cutWord(this.richTextBox1.Text); 

            this.richTextBox2.Text  =  " "; 

            int  count  =  0; 

            for  (int  i  =  L.Count-1;  i  > 0;  i--) 

            { 

               if  (count  >  15) 

               { 

                   break; 

               } 

               count++; 

               this.richTextBox2.Text  +=  L[i].Key  + "[ "  +  L[i].Value  + "]\n "; 

            }

        }



        private  List <KeyValuePair <string,  int> >    cutWord(string  article) 

        { 

            Dictionary <string,  int>  D  =  new  Dictionary <string,  int> (); 

            //if  len(escape(x))  /len(x)=6  then  isGB=true  else  isGB=false 

            //HttpUtility..:: 



            System.Text.RegularExpressions.Regex  Re  =  new  System.Text.RegularExpressions.Regex(@"[^\u4e00-\u9fa5]+ "); 

            for  (int  l  =  2;  l  <=  4;  l++) 

            { 

                for  (int  i  =  0;  i  <  article.Length-l;  i++) 

                { 

                    string  theWord  =  article.Substring(i,  l); 

                    if  (Re.Replace(theWord, " ")  ==  theWord) 

                    { 

                        if  (D.ContainsKey(theWord)) 

                        { 

                            D[theWord]++; 

                        } 

                        else 

                        { 

                            D.Add(theWord,1); 

                        } 

                     }                                                            

                } 

             } 

                

             List <KeyValuePair <string,  int> >  L  =  new  List <KeyValuePair <string,  int> > (); 

             foreach  (KeyValuePair <string,  int>  K  in  D) 

             { 

                if  (K.Value  >  1) 

                { 

                    L.Add(K); 

                } 

             } 



             L.Sort(delegate(KeyValuePair <String,  int>  a,  KeyValuePair <String,  int>  b) 

             {                                    

                if  (a.Value  ==  b.Value) 

                { 

                    if  (a.Key.Length  ==  b.Key.Length) 

                    { 

                        return  0; 

                    } 

                    else 

                    { 

                        if  (a.Key.Length  >  b.Key.Length) 

                        { 

                            return  1; 

                        } 

                        else 

                        { 

                            return  -1; 

                        } 

                    }                                                

                 } 

                 if  (a.Value  >  b.Value) 

                 { 

                    return  1; 

                 } 

                 else 

                 { 

                    return  -1; 

                 } 

              } ); 

              return  (L); 

        }

robin521 2009-08-27

打赏
举报

将字符读入一个二维数组,然后进行比较,对每个有重复值的字符加一个Tag并记数,然后冒泡算法,排序!

showjim 2009-08-27

打赏
举报

hash统计再排序,排序以20个元素为上限

Flashcom 2009-08-27

打赏
举报

用字典Dictionary<string,int>
获取单词,根据空格或所有用到的字符来分隔取得所有的单词,创建字典
Dictionary<string,int> list = new Dictionary<string,int>();
和两个数组
string[] words=null;
int[] count=null;
循环所有单词,以单词作为list的Key,出现次数作为value,



循环所有单词

if (list.ContainsKey(word))

{

    list[word]++;

}

else

{

    list[word] = 1;

}







words = new string[list.Count];

count = new int[list.Count];

list.Keys.CopyTo(words);

list.Values.CopyTo(count);

Array.Sort(count, words);

SortWords();

这样就可以初步得到按出现频率单词的排序,但这里出现相同次数的单词按字符排要写多一个递归函数去调整



private void SortWords()

{

    int n1 = 0;

    int n2 = 0;

    string temp = "";

    for (int i = 1; i < list.Count; i++)

    {

        n1 = count[i - 1];

        n2 = count[i];

        if (n1 == n2)   //词频出现相同

         {

             if (String.Compare(words[i - 1], words[i]) > 0)    //比较单词大小

               {

                    temp = words[i - 1];

                    words[i - 1] = words[i];

                    words[i] = temp;    //把两个单词反转,因为出现次数是相同的,所以count数组是不用反转的

                       SortWords();    //递归调用,直到单词不再需要反转的时候

                       break;          //因为已用递归不需重复执行,跳出循环

                 }

          }

     }

}