Imong的E-rater调查报告for gmat gre and toefl
首要的问题必然是这个:到底有没有e-rater这个东西?答案是:有!根据文献调研的结果,1999年的时候ETS就已经在着手研究这种叫做Automated Essay Grading的东西,然后2001年在gmat考试中就已经开始应用了。另外,从网络上搜集的信息表明,目前国内和ETS合作的泰德集团已经引进了这个电子评改系统作为“线上托福作文测评系统”作为在网络上给考生准备托福作文的一种营利性服务平台。关于这一点可以访问这个地址得到证实:http://www.englishtide.com/ets.asp (但是关于托福正式考试的作文评分,由于我没有刻意去收集相关文献资料,因而情况不能确定。)
那么说到GRE,目前的状况是什么呢?根据今天在ETS官方网站上发布的信息,可以得出结论:目前GRE作文的测评仍然是人工完成。这个结论是通过比较gre.org和mba.com上面的关于essay scoring官方说明的措辞得到的。GMAT方面,mba.com的措辞是这样的:Each of your essays in the AWA section will be given two, independent ratings, one of which may be performed by E-rater®. 而GRE方面,gre.org的措辞如下:Each essay receives a score from two trained readers, using a 6-point holistic scale.在GRE这边,压根就没有提到过e-rater这个字。因此,我得出的结论,就是目前的GRE作文仍然是人工评分。
上面两段介绍的应该是大家最关注的事情。关于e-rater本身,其实有两点还是值得一提的。第一,ETS这边的E-rater从问世到现在,也经历着不断自我完善的过程,现在应用的E-rater系统的设计已经是对最初的系统设计进行了改进。很明显的一个例子,就是现在ETS提供的Automated Essay Grading系统,除了最初的E-rater,又有了一个Criterion的系统,后者是可以提出一些分析意见而不仅是评分,目前ETS提供的GRE作文Score it now,就是基于Criterion的开发,而泰德则在网络上提供了E-rater和Criterion两者。第二,除了ETS的E-rater以外,仍然有其它的集团在做Automated Essay Grading这样的项目。从2001年发布的几篇文献综述上能够看到的有4个版本的Automated Essay Grading系统,ETS开发的E-rater只是其中之一。
关于E-rater,大家心里面最犯嘀咕的,莫过于对这个系统的有效性的怀疑。最极端的例子,恐怕是曾经有考生在GMAT的考试中,在自己的essay里面写过“I don’t want to be graded by a robot”这样的句子。
1999年Business Week对Fred McHale,当时的GMAC Vice-President for Assessment & Research,进行了采访,其中就提到了这个问题:(http://businessweek.com/bschools/originals/bs90329.htm)
Q: What has been the biggest challenge surrounding the E-Rater since you've implemented it? Have you encountered a lot of skepticism? Are folks scratching their heads wondering how this electronic assessment software actually works, wondering if the results have any validity?
A: There has been a lot of skepticism, and it was expected. People tend to think that E-Rater is just your average grammar-checker on your word processor. But that's just not the case. All we can do is show the results.
的确,all they can do is to show the results. 在ETS官方网站上找到的文献表明,E-rater判分和human reader判分之间的一致性一直是他们的研发组关注的重点。在至少3篇文献中都提到了如下所述的实验:对某一个题目找出human reader给出了1,2,3,4,5,6的文章各n篇,然后拿给E-rater判,然后研究给出分数的一致性。实验结果表明,E-rater的判分与human reader判分的Exact Agreement与Adjacent Agreement的情况是绝对多数,而出现Disagreement的情况则是绝对少数。根据公布的实验结果,各个分数段的E-rater判分与Human reader一致性总是大于80%,平均agreement的比率在90%左右。此外,考虑到两个Human reader之间判分一致性也存在差异(这一点也有相关的实验记录),再将这种差异和E-rater与Human reader之间差异的情况进行比较,所得到的结论是,E-rater判分的有效性(也就是文献中反复出现的Automated Essay Score Validity)是完全可以得到保证的。
虽然E-rater的具体评价识别的设计我们无从得知(这个自然,基本上是商业秘密),不过从目前可以拿到的文献中也可以看出一点端倪来,例如:I also assume that shrinking high school enrollment… 这句话,至少可以分析出来:also表达了parallel argument,that表达了claim,句子涉及到的content则有assume shrink high school enrollment… 也就是说,E-rater工作原理,远远不是简单的统计点字数,统计点用词频率。
Q: Differentiate between triggers and stored procedures.
A: Triggers are programs embedded within a table that are automatically invoked by updates to another table. Stored procedures are programs embedded within a table that can be called from an application program.
Syntactic Variety: …can be called from a program
…that a program can call
Synonymy: …can be invoked from a program…
Negation: …are NOT invoked by updates…
Anaphoric Reference: TRIGGERS are programs. THEY are embedded…
E-rater focuses on three general classes of essay features: discourse, indicated by various rhetorical features that are expected to occur throughout an essay; syntactic, indicated by the structure of sentences; and content, indicated by prompt-specific vocabulary expected to be present in the essay. A total of 59 features are “extractable,” but in practice usually only the most predictive features, as measured by their regression weights, are retained and used for further scoring.
上文提及的59个feature是相当广泛的。例如,就syntactic variety而言,文献中给出了如下几点(当然,这个list是不完全的):number of complement clauses, subordinate clauses, infinitive clauses and relative clauses, occurrences of subjunctive modal auxiliary verbs such as would, could, should, might and may. 对于Argument structure,E-rater着重识别parallelism, contrast, evidence, argument development以及其它一些coherence relations. 至于Discourse的方面,下面一段文献非常有启发性:
Literature in the field of discourse analysis points out that rhetorical relations can often be identified by the occurrence of cue words and specific syntactic structures (Cohen 1984, Mann and Thompson 1988, Hovy, et al. 1992, Hirschberg and Litman 1993, Van der Linden and Martin 1995, Knott 1996). E-rater follows this approach by identifying and quantifying an essay’s use of cue words and other rhetorical structure features. For example, we adapted the conceptual framework of conjunctive relations from Quirk, et al. (1985) in which phrases such as “In summary” and “In conclusion,” are classified as conjuncts used for summarizing. E-rater identifies these phrases and others as cues for a Summary relation. Words such as “perhaps” and “possibly” are considered to be cues for a Belief relation, one used by the writer to express a belief while developing an argument in the essay. Words like “this” and “these” are often used within certain syntactic structures to indicate that the writer has not changed topics (Sidner 1986). In certain discourse contexts, structures such as infinitive clauses mark the beginning of a new argument.
由上文可以看出,通过对文章的feature的识别,E-rater完全可以做出对文章的相关判断。而下面就是一个实际的例子。就coherence这个方面而言,下面的passage得到了6分,评语是“The following paragraph demonstrates an example of a maximally coherent text, centering the company ’Famous name’s Baby Food’ and continuing with the same center through the entire paragraph.”
Yet another company that strives for the ”big bucks” through conventional thinking is Famous name’s Baby Food. This company does not go beyond the norm in their product line, product packaging or advertising. If they opted for an extreme market place, they would be ousted. Just look who their market is. As new parents, the Famous name customer wants tradition, quality and trust in their product of choice. Famous name knows this and gives it to them by focusing on ”all natural” ingredients, packaging that shows the happiest baby in the world and feel good commercials the exude great family values. Famous name has really stuck to the typical ways of doing things and in return has been awarded with a healthy bottom line.
Following the same mark-up conventions, we demonstrate text incoherence with an excerpt (a paragraph again) of a student essay scored 4. In this case, repeated Rough-Shift transitions are identified. Several entities are centered, opinion, success and conventional practices, none of which is linked to the previous or following discourse. This discontinuity created by the very short lived Cbs makes it hard to identify the topic of this paragraph and at the same time it is capturing the fact that the introduced centers are poorly developed.
I disagree with the opinion stated above. In order to achieve real and lasting success a person does not have to be a billionaire. And also because conventional practices and ways of thinking can help a person to become rich.
Overall, while it is largely the case that the raters were not actually counting occurrences of indicator cues representing e-rater features, they were tracing qualities that incorporate such features.
Specifically, when an essay writer would make a certain type of assertion in the essay, the raters would expect to see the associated use of certain types of syntactic structures. The absence of such syntax in such an instance would render the assertion superficial. While essays with and without such syntactic variety were both seen, clearly the essays containing the syntactic variety associated with that type of discourse were viewed by the raters as superior.
Obviously, e-rater does not read an essay, so it cannot “look for” or “evaluate” writing qualities. However, e-rater can, and does in some instances, detect evidentiary traces, the proverbial “breadcrumbs in the path,” that signal these qualities, using its own version of the characteristics.
刚才说了很多E-rater对文章的识别就像“breadcrumbs in the path”,那E-rater如何就能够认出这些Breadcrumb呢?很显然,如果没有预先的人工调试和素材输入,E-rater什么也识别不出来。也就是说,E-rater不可能独立的对一篇文章进行评判,E-rater判分之前,其实是另有一个预先的过程。
大家可能都已经知道,对于每一个独立题目,E-rater那里已经存好了几百篇预先人工评好分(各级分数都有几十篇)的文章,这些文章的用处是什么呢?学理科的同学(esp.学过化学的)看到这个词估计立刻就明白了:标定。上文已经提到,E-rater可以识别出50多个linguistic feature,但是每个feature之间并非简单加和,而是有regression wight的因素,那这个regression weight从哪里来?当然是要用“标准物”来“标定”,也就是用“标准essay”来确立!
这也就引出了这篇文章里我要强调的一个核心事实:如果E-rater的评分不能和Human reader的评分有效吻合,E-rater就根本没用!也就是说,E-rater的评分规则,全部都是来源于原先设立的Human reader的评分标准,都是以Human reader的标准为基础。全部的标准,都是人确立的。
也就是说,E-rater必须服从人,必须通过调试达到其自身最大程度的与Human reader的一致。
曾经有一篇流传很广的帖子里写过:“电脑评分器潜在地给人工评分者施加压力。电脑评分器和人工评分者各自给你的作文评分,如果结果存在着较大的差异,你的作文将会被传到第三个人工评分者的手中(当然这提高了ETS的费用)。这乃属常理,但ETS拒绝这么做,所以唯一的结果是人工评分者将尽力遵循电脑评分器的标准和规则。也就是说,以电脑评分器的判分为准,因为像GRE这样的标准化考试是不容主观和偏见的。 所以,不要试图取悦人工评分者期待他能否决电脑评分器给你的低分,而应该尽量符合电脑评分器的规则。”
对此,我在此再回顾一下刚才的结论:E-rater的评分规则全部来源于Human reader的预设,也就是说,如果有所谓的“E-rater应对策略”,那这个策略也就应该是能够应对Human reader的策略,再进一步说,也不过就是一般的写作策略!
也就是说,任何所谓的“迎合E-rater的写作策略”,如果真的可以“迎合”E-rater的话,那也就是在“迎合”Human reader,那也就根本没什么新鲜的,仅仅是最基础最根本的写作方法。“迎合E-rater的写作策略”这个短语的确能够吸引眼球,但它所给出的“策略”并非它诱导人构想的“单独迎合E-rater而使机器倾向给高分”的“策略”,因为根本就没有这种策略!
这一点恐怕与一部分人的设想不太一样。可能有相当一部分人都心存幻想,认为有能够单独欺骗到E-rater的策略,甚至一个国外的网站上都打出了“how to fool the E-rater”的字样。那好,说到这一点,这里有一个来自ETS官方的例子:ETS组织了一帮专家刻意去写一些essay来尝试trick the E-rater,结果呢?最成功的一位教授骗到了5分的正差异(E-rater的减去Human reader的),然而“His principal strategy was simply to write several paragraphs and to repeat them (37 times, in fact!).”!而后面几位成功者的策略也都是类似的,例如 attempted, alternatively, to write essays that is rambled, missed the point, used faulty logic, or were haphazard in their progression, but used relevant content words, complex sentence structure, or other features valued by e-rater. 且不说这些人费了多少心机写出这样能够造成E-rater出现bug的文章,这样的文章,难道在Human reader那里能够过关?这样写出来的东西,能不能被叫做“文章”恐怕要首先打问号。
如果说真的有fool the E-rater的tip的话,我想上面这个tip恐怕要让大家失望了。而能够提出来的所谓“针对E-rater的tip”,也不过就超不出那些常规的写作技巧。没什么新鲜的。
再如果,真的要说有什么tip的话,恐怕也只有这一点了:尽量克服语法和拼写的错误,因为这样的错误会干扰E-rater对更高级的feature的识别。不过说回来,这个所谓的tip,怎么看也看不出来是什么很高深的“fool the E-rater”的技巧,而只不过是最最基本的东西。也许,是因为本来就不存在这种针对机器的投机技巧呢?是不是,E-rater不值得我们过分的关注,而精力还是应该放在基本的写作水平的练习上?
从ETS的专家捣乱实验也可以看得出来,为什么GMAT最多是“one of them MAY be an e-rater”,为什么绝对不可能是两个E-rater. 机器只能从人的设计出发来为人服务,人的判断标准总是最根本的。
