- 最后登录
- 2021-6-18
- 在线时间
- 4685 小时
- 寄托币
- 6214
- 声望
- 912
- 注册时间
- 2006-2-26
- 阅读权限
- 50
- 帖子
- 2367
- 精华
- 4
- 积分
- 8271
- UID
- 2191404
- 声望
- 912
- 寄托币
- 6214
- 注册时间
- 2006-2-26
- 精华
- 4
- 帖子
- 2367
|
发表于 2014-11-3 22:58:21
|显示全部楼层
关于E-rater
本帖最后由 tesolchina 于 2014-11-7 17:46 编辑
GRE作文评卷是人改和机改相结合的,据说ETS开发的e-rater和人改的结果吻合率高达90%以上。关于e-rater的具体运作,有ETS发表过几篇论文
等我有时间及版友对我的博客表现出足够的兴趣时我会详细讨论一下erater的运作原理以及我们可以如何应对。
Attali & Burstein (2006)对e-rater做了比较详细的介绍,指出评分时会考虑考生作文几方面的特征:
Grammar, Usage, Mechanics, and Style Measures
The writing analysis tools identify five main types of grammar, usage, and mechanics errors – agreement errors, verb formation errors, wrong word use, missing punctuation, and typographical errors. The approach to detecting violations of general English grammar is corpus based and statistical, and can be explained as follows. The system is trained on a large corpus of edited text, from which it extracts and counts sequences of adjacent word and part-of-speech pairs called bigrams. The system then searches student essays for bigrams that occur much less often than would be expected based on the corpus frequencies (Chodorow & Leacock, 2000).
其中包括五种主要的语法、用法和风格上的错误
- 主谓一致错误
- 动词形式错误
- 错误的用词
- 标点符号缺失
- 拼写错误
The writing analysis tools also highlight aspects of style that the writer may wish to revise, such as the use of passive sentences, as well as very long or very short sentences within the essay. Another feature of undesirable style that the system detects is the presence of overly repeti- tious words, a property of the essay that might affect its rating of overall quality (Burstein & Wolska, 2003).
其中在辨别用词错误时,e-rater用的是语料库方法,查看文章中相邻两个词在语料库中出现的频率。因此,我在61楼介绍的语料库对修改用词方面的错误会很有帮助。而风格方面,e-rater会找出太长或太短的句子、被动语态以及反复使用的词语。这就要求我们在写作时不能写太长或太短的句子、多用主动语态以及在用词上要有变化,比如用同义词或者其他的指代词。
Organization and Development
Finally, the writing analysis tools provide feedback about discourse elements present or absent in the essay (Burstein, Marcu, and Knight, 2003). The discourse analysis approach is based on a linear representation of the text. It assumes the essay can be segmented into sequences of discourse elements, which include introductory material (to provide the context or set the stage), a thesis statement (to state the writer’s position in rela- tion to the prompt), main ideas (to assert the author’s main message), supporting ideas (to provide evidence and support the claims in the main ideas, thesis, or conclusion), and a conclusion (to summarize the essay’s entire argument). In order to identify the various discourse elements, the system was trained on a large corpus of human annotated essays (Burstein, Marcu, and Knight, 2003). Figure 1 (next page) presents an example of an annotated essay.
e-rater采用一种话语分析的进路,假设文章可以分作一些话语元素的序列。然后根据已经做好标注的文章作为数据,系统通过机器学习,学会辨认哪些句子属于主旨句、主题句、支持句、结论句以及无关的句子。具体的算法还要再看其他的文献进一步研究。
The overall organization score (referred to in what follows as organiza- tion) was designed for these genres of writing. It assumes a writing strategy that includes an introductory paragraph, at least a three-paragraph body with each paragraph in the body consisting of a pair of main point and supporting idea elements, and a concluding paragraph. The organization score measures the difference between this minimum five-paragraph essay and the actual discourse elements found in the essay. Missing elements could include supporting ideas for up to the three expected main points or a missing introduction, conclusion, or main point. On the other hand, identification of main points beyond the minimum three would not contribute to the score. This score is only one possible use of the identified discourse elements, but was adopted for this study.
这里说明我们至少要写5段,包括开头、结尾和三个中间段。基本上和我提出的1+3模型是吻合的。 当然,我的模型对coherence有更高的要求(见本帖54楼)。
The second feature derived from Criterion’s organization and devel- opment module measures the amount of development in the discourse elements of the essay and is based on their average length (referred to as development).
这个有点坑,好像是看中间段的长度来打分。但是仅仅字数多是不够的,因为与主题无关的句子会被标为irrelevant。
Lexical Complexity (2 features)
Two features in e-rater V.2 are related specifically to word-based characteristics. The first is a measure of vocabulary level (referred to as vocabulary) based on Breland, Jones, and Jenkins’ (1994) Standardized Frequency Index across the words of the essay. The second feature is based on the average word length in characters across the words in the essay (referred to as word length).
这里主要是看词汇的深度和长度。也就是说用的词比较少见以及单词比较长就好些。
http://www.usingenglish.com/resources/text-statistics.php
这个在线工具可以分析你的文章的难词比例
Prompt-Specific Vocabulary Usage (2 features)
E-rater evaluates the lexical content of an essay by comparing the words it contains to the words found in a sample of essays from each score category (usually six categories). It is expected that good essays will resemble each other in their word choice, as will poor essays. To do this, content vector analysis (Salton, Wong, & Yang, 1975) is used, where the vocabulary of each score category is converted to a vector whose elements are based on the frequency of each word in the sample of essays.
这两个特点是基于每道不同的题目的不同的范文库,然后将学生的作文转换成矢量,和范文的矢量进行对比,长得像几分,就是几分。这个听起来有点匪夷所思,难道6分的文章用的词都差不多么?考虑到现在的GRE作文要求已经很具体,我想至少argument的词还是很接近的。比如一道关于assumption的题目,你肯定要用相关的词吧。至于issue,估计很难预计6分的文章会用什么词,但至少要尽量做到切题,这样就能比较接近。
总的来说,e-rater对文章的结构有非常明确地要求,中间三段要支持主旨句,而主题句也要有足够的细节支持;同时在用词上要尽可能地道,就是在语料库中可以查到的搭配。另外,要根据题目的要求来写作,这样写出来的文章用的词才会和范文接近。
当然,e-rater也不是完全可靠的,我们也没有必要为了迎合它而做出什么很夸张的事情,因为GRE作文还是会有human rater来看的。
Further Reading
Attali, Y., & Burstein, J. (2006). Automated essay scoring with e-rater® V. 2. The Journal of Technology, Learning and Assessment, 4(3). Retrieved from http://napoleon.bc.edu/ojs/index.php/jtla/article/view/1650
Burstein, J. (2003). The E-rater® scoring engine: Automated essay scoring with natural language processing. Retrieved from http://psycnet.apa.org/psycinfo/2003-02475-007
Lee, Y.-W., Gentile, C., & Kantor, R. (2008). Analytic scoring of TOEFL CBT essays: Scores from humans and e-rater. ETS TOEFL Research Reports, 81. Retrieved from http://144.81.87.152/Media/Research/pdf/RR-08-01.pdf
Monaghan, W., & Bridgeman, B. (2005). E-rater as a Quality Control on Human Scores. Retrieved from http://www.researchgate.net/publ ... 49528baf70d5947.pdf
Powers, D. E., Burstein, J. C., Chodorow, M., Fowles, M. E., & Kukich, K. (2002). Stumping< i> e-rater</i>: challenging the validity of automated essay scoring. Computers in Human Behavior, 18(2), 103–134.
|
-
总评分: 声望 + 4
查看全部投币
|