一个晚上搞定哈利波特……可惜失败了

Nov 3, 2007   //   by Yue Yu   //   Blogs, Programming  //  13 Comments

今天晚上么心情码作业,(明天周六居然还要加课,我们老师太变态了,双休也么人来吃饭,失去烧饭的动力了),翻电脑里的存货的时候看到一堆哈利波特全集,记得是因为某位同学特别喜欢哈利波特,于是我下了电子版的全集,但是我实在提不起兴趣读这种少儿读物,怎么办呢……为了以后在小朋友们聊哈利的时候能够插上话,我准备花一晚上把哈利波特读完。

当然,作为数学学士,未来的统计硕士以及潜在的统计博士,我当然不会一字一句地去读,为了能最快地了解小哈利的主要内容,我准备码了几行代码来分析这本书里面哪些单词,短句出现的频率最高,以此来确定书的主旨。

阿拉用哈利第一本Harry Potter and the Sorcerer’s Stone来做例子。

先数一下字数:一共78546个词,比意料中的少了很多,要知道,中文的话10万字也就一点点东西,这么厚厚的一本居然才8万词都不到,看来英文的作者稿费难赚啊。

然后阿拉来数一下哪些词出现的频率最高,第一名是~~~the -_-||,一共3627次,下面是top 20:

the
3627

and
1919

to
1856

a
1688

he
1528

of
1259

harry
1214

was
1186

it
1022

in
965

his
937

you
862

said
793

had
702

i
650

on
635

at
625

that
601

they
597

as
525

看了一下top 100里的单词,都是小学生就会的,当然,可能英语就是这样的,这不能说明小学生就能读懂哈利,下次有空用别的书来做下对比才能说明问题。但是,这本书里面出现的单词一共有5658个,好像少了点,记得6级的词汇量是8k,GRE的词汇量是多少? 而且,出现频率在10次以上的单词只有983个,总共出现的次数是63428次,粗略看了一下,都是young,ugl,main之类的单词,也就是说你的词汇量只要超过1k,就能读懂80%的内容。简单吧。

接下来看一下由两个词组成的短句,以下是top 10:

of the
284

in the
262

on the
207

to the
170

out of
142

at the
131

he was
125

It was
115

to be
109

he said
105

这个……完全看不出什么意思……我们来继续看由4个词组成的短句,下面是top20:

in front of the
11

out of the way
11

the rest of the
11

the end of the
10

how to get past
9

the back of the
9

the three of them
9

he was going to
8

for the first time
7

turned out to be
7

up and down the
7

as though he was
6

at the end of
6

he said in a
6

out of the window
6

seven hundred and thirteen
6

the back of his
6

up in the air
6

Harry shook his head
5

in front of him
5

4个词出现重复的次数明显少了很多,不过我们看到有11次是有撒东西in front of the撒东西…………请大家注意,the three of them出现了9次,看来这本里面3人组还是经常集体行动的,然后有7次for the first time,不晓得都是谁的……有6次out of the window,毕竟是魔法世界,飞出窗户是很正常的事情,出现6次seven hundred and thirteen,713这个数字这么重要么?看过书的告诉我这是什么东西,密码么?还是生日?然后,居然我们的harry有5次shook his head,而且有5次有什么东西in front of him,当然,也就5次,比九九八十一难少多了。

继续来看7字短语:

for the first time in his life
3

a few words of what they were
2

and his work on alchemy with his
2

be mad ter try an rob it
2

blood and his work on alchemy with
2

caught a few words of what they
2

death is but the next great adventure
2

Dumbledore is particularly famous for his defeat
2

Dunno if he had enough human left
2

enough human left in him to die
2

famous for his defeat of the dark
2

few words of what they were saying
2

for his defeat of the dark wizard
2

for the discovery of the twelve uses
2

had enough human left in him to
2

have a clue what was going on
2

he caught a few words of what
2

he had enough human left in him
2

He was in a very good mood
2

his work on alchemy with his partner
2

基本没有重复的了,这次第一的是for the first time in his life,一共出现3次,阿拉可以大胆猜想一下,这三次可能是初遇,初吻,初夜。当然,阿拉还是知道哈利的初吻在第一本里没有被夺去,上述纯属yy。上表中his work on alchemy with出现的次数蛮多的,看到这里,阿拉可以大胆猜测一下剧情:someone do his work on alchemy with his partner, to defeat the dark wizard.而我们的小哈利会有三次遇到前所未有的困难……当然, 也仅限如此,发觉做到这里我还是没能抓到这本书的中心意思。

于是我做了最后一次尝试,15字的句子:

was too high to make out and a magnificent marble staircase facing them led to
1

swished and flicked but the feather they were supposed to be sending skyward just lay
1

the edge of a huge chessboard behind the black chessmen which were all taller than
1

house wandering around and thinking about the end of the holidays where he could see
1

that were floating in midair over four long tables where the rest of the students
1

all about the four balls and the positions of the seven players describing famous games
1

When they say every flavor they mean every flavor  you know you get all
1

I was down in the village havin a few drinks an got into a game
1

Dumbledore when we met him in the entrance hall  he already knew  he
1

it grew wider and wider  a second later they were facing an archway
1

发觉没有重复的了……而且句子太零散,要从中拼凑出主要意思还是不可能的事情,今天的尝试以失败告终……当然,阿拉还可以再做些改进,比如在搜索2字短语的时候剔除那些a the之类的字,或者直接搜索每段的第一句话,或者让我去研究一下text mining的书,看看有撒好的算法,不过今天太晚了,到此为止……

ps:本来还写了html代码做了表格画了图了的,但是space说文章太长了不让发……只能删到只有文本了。

13 Comments

  • 你真的很空。。。
    这些简单词汇都是很基本的句子组成部分,可能影响理解的关键词汇重复频率并不高,例如YY is a very lascivious person. 你的算法还有待改进,博士指数有待提高。

  • oh, is he?

  • 很惭愧,哈利波特的法语版我读起来还是很吃力……

  • 罗琳可以满足地瞑目了

  • to iza: 你可以去他温暖的被窝寻找答案

  • fiction里面一般都是用具体描述来展开一个关键事件的,所以关键词不会高频出现的。我觉得要一晚搞定还不如直接读别人写的
    故事梗概,这样会更有效些
     

  • 某些cs或者应数博士来指导我text mining怎么做吧~~~下礼拜有空的话继续改进

  • 你让机器去understand english就是吃屎啊,不可能做出来的。。。

  • machine learning的博士出现了。。。Dr. YY被鄙视了

  • 总比理解中文要强一些,戚队有撒text mining的书或者paper好推荐么?或者戚队你自己写的也可以楼下的11,我可不是Dr,我现在连master都不是呢。iza同学thanksgiving飞过来么?可以陪我买化妆品去

  • 这个帖子变成PhD学术交流会了

  • 你爸长期要求我朗读你博客上的文章给他听……
    我上次读了一个你做饭请客的文章,1拉得出了奢侈浪费的结论……所以偶再也不敢念了……
    希望弄在不要屏蔽俺的基础上写点偶可以汇报的内容……谢谢 – –

  • 楼下的……你可以上网找些笑话之类的读给我爸爸听嘛……下次如果再透露我space上的东西的话就屏蔽掉你,回来也不跟你带礼物,哼哼

Leave a comment

Connect with Facebook
December 2017
S M T W T F S
« Oct    
 12
3456789
10111213141516
17181920212223
24252627282930
31