Courtney Napoles

I am a final-year PhD candidate in Computer Science at Johns Hopkins University in Baltimore, MD. My advisors are Chris Callison-Burch and Benjamin Van Durme, and I have been supported by the NSF GRFP.

My interests lie in natural language processing and machine learning, especially as they apply to educational applications such as grammatical error correction, automatic writing assessment, and investigating the impact of teacher feedback on student writing. For my dissertation, I am investigating monolingual sentence rewriting (specifically sentence compression, simplification, and grammatical error correction), and determining best practices for evaluating system output for these tasks.

Between college and grad school, I worked in the trade publishing industry, editing "helpful" non-fiction books. This experience inspired me to learn how to develop automatic, statistical methods to evaluate and improve text, and so I began a doctoral degree in computer science, specializing in natural language processing. Additionally, it taught me how to be deadline-oriented, pay attention to details, and assess market needs and package products to meet the intended audience.

The same motivation has fueled my careers in both publishing and computer science: to increase the accessibility of information that can help people and to enable people to effectively communicate. After finishing my degree, I hope to find a position at a mission-oriented company where I can use my skills to help advance education in the tech age.

Publications

Online news platforms curate high-quality content for their readers and, in many cases, users can post comments in response. While comment threads routinely contain unproductive banter, insults, or users shouting" over each other, there are often good discussions buried among the noise. In this paper, we define a new task of identifying "good" conversations, which we call ERICs—Engaging, Respectful, and/or Informative Conversations. Our model successfully identifies ERICs posted in response to online news articles with F1 = 0.73 and F1 = 0.91 in debate forums.
@inproceedings{napoles2017automatically,
  title     = {Automatically Identifying Good Conversations Online (Yes, They Do Exist!)},
  author    = {Napoles, Courtney and Pappu, Aasish and Tetreault, Joel},
  booktitle = {Eleventh International AAAI Conference on Web and Social Media},
  year      = {2017}
}
copy
This work presents a dataset and annotation scheme for the new task of identifying "good" conversations that occur online, which we call ERICs: Engaging, Respectful, and/or Informative Conversations. We develop a taxonomy to reflect features of entire threads and individual comments which we believe contribute to identifying ERICs; code a novel dataset of Yahoo News comment threads (2.4k threads and 10k comments) and 1k threads from the Internet Argument Corpus; and analyze the features characteristic of ERICs. This is one of the largest annotated corpora of online human dialogues, with the most detailed set of annotations. It will be valuable for identifying ERICs and other aspects of argumentation, dialogue, and discourse.
@InProceedings{napoles-EtAl:2017:LAW,
  author    = {Napoles, Courtney  and  Tetreault, Joel  and  Pappu, Aasish  and  Rosato, Enrica  and  Provenzale, Brian},
  title     = {Finding Good Conversations Online: {The Yahoo News Annotated Comments Corpus}},
  booktitle = {Proceedings of the 11th Linguistic Annotation Workshop},
  month     = {April},
  year      = {2017},
  address   = {Valencia, Spain},
  publisher = {Association for Computational Linguistics},
  pages     = {13--23},
  url       = {http://www.aclweb.org/anthology/W17-0802}
}
copy
We present a new parallel corpus, JHU FLuency-Extended GUG corpus (JFLEG) for developing and evaluating grammatical error correction (GEC). Unlike other corpora, it represents a broad range of language proficiency levels and uses holistic fluency edits to not only correct grammatical errors but also make the original text more native sounding. We describe the types of corrections made and benchmark four leading GEC systems on this corpus, identifying specific areas in which they do well and how they can improve. JFLEG fulfills the need for a new gold standard to properly assess the current state of GEC.
@InProceedings{napoles-sakaguchi-tetreault:2017:EACLshort,
  author    = {Napoles, Courtney  and  Sakaguchi, Keisuke  and  Tetreault, Joel},
  title     = {{JFLEG}: A Fluency Corpus and Benchmark for Grammatical Error Correction},
  booktitle = {Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers},
  month     = {April},
  year      = {2017},
  address   = {Valencia, Spain},
  publisher = {Association for Computational Linguistics},
  pages     = {229--234},
  url       = {http://www.aclweb.org/anthology/E17-2037}
}
copy
Current methods for automatically evaluating grammatical error correction (GEC) systems rely on gold-standard references. However, these methods suffer from penalizing grammatical edits that are correct but not in the gold standard. We show that reference-less grammaticality metrics correlate very strongly with human judgments and are competitive with the leading reference-based evaluation metrics. By interpolating both methods, we achieve state-of-the-art correlation with human judgments. Finally, we show that GEC metrics are much more reliable when they are calculated at the sentence level instead of the corpus level. We have set up a CodaLab site for benchmarking GEC output using a common dataset and different evaluation metrics.
@InProceedings{napoles-sakaguchi-tetreault:2016:EMNLP2016,
  author    = {Napoles, Courtney  and  Sakaguchi, Keisuke  and  Tetreault, Joel},
  title     = {There's No Comparison: Reference-less Evaluation Metrics in Grammatical Error Correction},
  booktitle = {Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing},
  month     = {November},
  year      = {2016},
  address   = {Austin, Texas},
  publisher = {Association for Computational Linguistics},
  pages     = {2109--2115},
  url       = {https://aclweb.org/anthology/D16-1228}
}
copy
The field of grammatical error correction (GEC) has grown substantially in recent years, with research directed at both evaluation metrics and improved system performance against those metrics. One unvisited assumption, however, is the reliance of GEC evaluation on error-coded corpora, which contain specific labeled corrections. We examine current practices and show that GEC's reliance on such corpora unnaturally constrains annotation and automatic evaluation, resulting in (a) sentences that do not sound acceptable to native speakers and (b) system rankings that do not correlate with human judgments. In light of this, we propose an alternate approach that jettisons costly error coding in favor of unannotated, whole-sentence rewrites. We compare the performance of existing metrics over different gold-standard annotations, and show that automatic evaluation with our new annotation scheme has very strong correlation with expert rankings (rho = 0.82). As a result, we advocate for a fundamental and necessary shift in the goal of GEC, from correcting small, labeled error types, to producing text that has native fluency.
@article{tacl-gec-eval-2016,
  author   = {Sakaguchi, Keisuke  and Napoles, Courtney  and Post, Matt and Tetreault, Joel },
  title    = {Reassessing the Goals of Grammatical Error Correction: Fluency Instead of Grammaticality},
  journal  = {Transactions of the Association for Computational Linguistics},
  volume   = {4},
  year     = {2016},
  issn     = {2307-387X},
  url      = {https://tacl2013.cs.columbia.edu/ojs/index.php/tacl/article/view/800},
  pages    = {169--182}
}
copy
The Automated Evaluation of Scientific Writing, or AESW, is the task of identifying sentences in need of correction to ensure their appropriateness in a scientific prose. The data set comes from a professional editing company, VTeX, with two aligned versions of the same text—before and after editing—and covers a variety of textual infelicities that proofreaders have edited. While previous shared tasks focused solely on grammatical errors (Dale and Kilgarriff, 2011; Dale et al., 2012; Ng et al., 2013; Ng et al., 2014), this time edits cover other types of linguistic misfits as well, including those that almost certainly could be interpreted as style issues and similar “matters of opinion”. The latter arise because of different language editing traditions, experience, and the absence of uniform agreement on what “good” scientific language should look like. Initiating this task, we expected the participating teams to help identify the characteristics of “good” scientific language, and help create a consensus of which language improvements are acceptable (or necessary). Six participating teams took on the challenge.
@InProceedings{daudaravicius-EtAl:2016:BEA11,
  author    = {Daudaravicius, Vidas  and  Banchs, Rafael E.  and  Volodina, Elena  and  Napoles, Courtney},
  title     = {A Report on the Automatic Evaluation of Scientific Writing Shared Task},
  booktitle = {Proceedings of the 11th Workshop on Innovative Use of NLP for Building Educational Applications},
  month     = {June},
  year      = {2016},
  address   = {San Diego, CA},
  publisher = {Association for Computational Linguistics},
  pages     = {53--62},
  url       = {http://www.aclweb.org/anthology/W16-0506}
}
copy
In this work, we estimate the deterioration of NLP processing given an estimate of the amount and nature of grammatical errors in a text. From a corpus of essays written by English-language learners, we extract ungrammatical sentences, controlling the number and types of errors in each sentence. We focus on six categories of errors that are commonly made by English-language learners, and consider sentences containing one or more of these errors. To evaluate the effect of grammatical errors, we measure the deterioration of ungrammatical dependency parses using the labeled F-score, an adaptation of the labeled attachment score. We find notable differences between the influence of individual error types on the dependency parse, as well as interactions between multiple errors.
@InProceedings{napoles-cahill-madnani:2016:BEA11,
  author    = {Napoles, Courtney  and  Cahill, Aoife  and  Madnani, Nitin},
  title     = {The Effect of Multiple Grammatical Errors on Processing Non-Native Writing},
  booktitle = {Proceedings of the 11th Workshop on Innovative Use of NLP for Building Educational Applications},
  month     = {June},
  year      = {2016},
  address   = {San Diego, CA},
  publisher = {Association for Computational Linguistics},
  pages     = {1--11},
  url       = {http://www.aclweb.org/anthology/W16-0501}
}
copy
We present a simple, prepackaged solution to generating paraphrases of English sentences. We use the Paraphrase Database (PPDB) for monolingual sentence rewriting and provide machine translation language packs: prepackaged, tuned models that can be downloaded and used to generate paraphrases on a standard Unix environment. The language packs can be treated as a black box or customized to specific tasks. In this demonstration, we will explain how to use the included interactive web-based tool to generate sentential paraphrases.
@InProceedings{napoles-callisonburch-post:2016:N16-3,
  author    = {Napoles, Courtney  and  Callison-Burch, Chris  and  Post, Matt},
  title     = {Sentential Paraphrasing as Black-Box Machine Translation},
  booktitle = {Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations},
  month     = {June},
  year      = {2016},
  address   = {San Diego, California},
  publisher = {Association for Computational Linguistics},
  pages     = {62--66},
  url       = {http://www.aclweb.org/anthology/N16-3013}
}
copy
Most recent sentence simplification systems use basic machine translation models to learn lexical and syntactic paraphrases from a manually simplified parallel corpus. These methods are limited by the quality and quantity of manually simplified corpora, which are expensive to build. In this paper, we conduct an in-depth adaptation of statistical machine translation to perform text simplification, taking advantage of large-scale paraphrases learned from bilingual texts and a small amount of manual simplifications with multiple references. Our work is the first to design automatic metrics that are effective for tuning and evaluating simplification systems, which will facilitate iterative development for this task.
@article{xu2016optimizing,
  author  = {Xu, Wei  and Napoles, Courtney  and Pavlick, Ellie  and Chen, Quanze  and Callison-Burch, Chris },
  title   = {Optimizing Statistical Machine Translation for Text Simplification},
  journal = {Transactions of the Association for Computational Linguistics},
  volume  = {4},
  year    = {2016},
  issn    = {2307-387X},
  url     = {https://transacl.org/ojs/index.php/tacl/article/view/741},
pages   = {401--415}
}
copy
How do we know which grammatical error correction (GEC) system is best? A number of metrics have been proposed over the years, each motivated by weaknesses of previous metrics; however, the metrics themselves have not been compared to an empirical gold standard grounded in human judgments. We conducted the first human evaluation of GEC system outputs, and show that the rankings produced by metrics such as MaxMatch and I-measure do not correlate well with this ground truth. As a step towards better metrics, we also propose GLEU, a simple variant of BLEU, modified to account for both the source and the reference, and show that it hews much more closely to human judgments.
@InProceedings{napoles-EtAl:2015:ACL-IJCNLP,
  author    = {Napoles, Courtney  and  Sakaguchi, Keisuke  and  Post, Matt  and  Tetreault, Joel},
  title     = {Ground Truth for Grammatical Error Correction Metrics},
  booktitle = {Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Pape  },
  month     = {July},
  year      = {2015},
  address   = {Beijing, China},
  publisher = {Association for Computational Linguistics},
  pages     = {588--593},
  url       = {http://www.aclweb.org/anthology/P15-2097}
}
copy
In this work, we explore applications of automatic essay scoring (AES) to a corpus of essays written by college freshmen and discuss the challenges we faced. While most AES systems evaluate highly constrained writing, we developed a system that handles open-ended, long-form writing. We present a novel corpus for this task, containing more than 3,000 essays and drafts written for a freshman writing course. We describe statistical analysis of the corpus and identify problems with automatically scoring this type of data. Finally, we demonstrate how to overcome grader bias by using a multi-task setup, and predict scores as well as human graders on a different dataset. Finally, we discuss how AES can help teachers assign more uniform grades.
@InProceedings{napoles-callisonburch:2015:bea,
  author    = {Napoles, Courtney  and  Callison-Burch, Chris},
  title     = {Automatically Scoring Freshman Writing: A Preliminary Investigation},
  booktitle = {Proceedings of the Tenth Workshop on Innovative Use of NLP for Building Educational Applications},
  month     = {June},
  year      = {2015},
  address   = {Denver, Colorado},
  publisher = {Association for Computational Linguistics},
  pages     = {254--263},
  url       = {http://www.aclweb.org/anthology/W15-0629}
}
copy
Simple Wikipedia has dominated simplification research in the past 5 years. In this opinion paper, we argue that focusing on Wikipedia limits simplification research. We back up our arguments with corpus analysis and by highlighting statements that other researchers have made in the simplification literature. We introduce a new simplification dataset that is a significant improvement over Simple Wikipedia, and present a novel quantitative-comparative approach to study the quality of simplification data resources.
@article{xu2015problems,
  author  = {Xu, Wei  and Callison-Burch, Chris  and Napoles, Courtney },
  title   = {Problems in Current Text Simplification Research: New Data Can Help},
  journal = {Transactions of the Association for Computational Linguistics},
  volume  = {3},
  year    = {2015},
  issn    = {2307-387X},
  url     = {https://transacl.org/ojs/index.php/tacl/article/view/549},
  pages   = {283--297}
}
copy
We have created layers of annotation on the English Gigaword v.5 corpus to render it useful as a standardized corpus for knowledge extraction and distributional semantics. Most existing large-scale work is based on inconsistent corpora which often have needed to be re-annotated by research teams independently, each time introducing biases that manifest as results that are only comparable at a high level. We provide to the community a public reference set based on current state-of-the-art syntactic analysis and coreference resolution, along with an interface for programmatic access. Our goal is to enable broader involvement in large-scale knowledge-acquisition efforts by researchers that otherwise may not have had the ability to produce such a resource on their own.
@InProceedings{napoles-gormley-vandurme:2012:AKBC-WEKEX,
  author    = {Napoles, Courtney  and  Gormley, Matthew  and  Van Durme, Benjamin},
  title     = {Annotated {G}igaword},
  booktitle = {Proceedings of the Joint Workshop on Automatic Knowledge Base Construction and Web-scale Knowledge Extraction (AKBC-WEKEX)},
  month     = {June},
  year      = {2012},
  address   = {Montr{\'e}al, Canada},
  publisher = {Association for Computational Linguistics},
  pages     = {95--100},
  url       = {http://www.aclweb.org/anthology/W12-3018}
}
copy
Previous work has shown that high quality phrasal paraphrases can be extracted from bilingual parallel corpora. However, it is not clear whether bitexts are an appropriate resource for extracting more sophisticated sentential paraphrases, which are more obviously learnable from monolingual parallel corpora. We extend bilingual paraphrase extraction to syntactic paraphrases and demonstrate its ability to learn a variety of general paraphrastic transformations, including passivization, dative shift, and topicalization. We discuss how our model can be adapted to many text generation tasks by augmenting its feature set, development data, and parameter estimation routine. We illustrate this adaptation by using our paraphrase model for the task of sentence compression and achieve results competitive with state-of-the-art compression systems.
@InProceedings{ganitkevitch-EtAl:2011:EMNLP,
  author    = {Ganitkevitch, Juri  and  Callison-Burch, Chris  and  Napoles, Courtney  and  Van Durme, Benjamin},
  title     = {Learning Sentential Paraphrases from Bilingual Parallel Corpora for Text-to-Text Generation},
  booktitle = {Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing},
  month     = {July},
  year      = {2011},
  address   = {Edinburgh, Scotland, UK.},
  publisher = {Association for Computational Linguistics},
  pages     = {1168--1179},
  url       = {http://www.aclweb.org/anthology/D11-1108}
}
copy
This work surveys existing evaluation methodologies for the task of sentence compression, identifies their shortcomings, and proposes alternatives. In particular, we examine the problems of evaluating paraphrastic compression and comparing the output of different models. We demonstrate that compression rate is a strong predictor of compression quality and that perceived improvement over other models is often a side effect of producing longer output.
@InProceedings{napoles-vandurme-callisonburch:2011:T2TW-2011,
  author    = {Napoles, Courtney  and  Van Durme, Benjamin  and  Callison-Burch, Chris},
  title     = {Evaluating Sentence Compression: Pitfalls and Suggested Remedies},
  booktitle = {Proceedings of the Workshop on Monolingual Text-To-Text Generation},
  month     = {June},
  year      = {2011},
  address   = {Portland, Oregon},
  publisher = {Association for Computational Linguistics},
  pages     = {91--97},
  url       = {http://www.aclweb.org/anthology/W11-1611}
}
copy
We present a substitution-only approach to sentence compression which “tightens” a sentence by reducing its character length. Replacing phrases with shorter paraphrases yields paraphrastic compressions as short as 60% of the original length. In support of this task, we introduce a novel technique for re-ranking paraphrases extracted from bilingual corpora. At high compression rates, paraphrastic compressions outperform a state-of-the-art deletion model in an oracle experiment. For further compression, deleting from oracle paraphrastic compressions preserves more meaning than deletion alone. In either setting, paraphrastic compression shows promise for surpassing deletion-only methods.
@InProceedings{napoles-EtAl:2011:T2TW-2011,
  author    = {Napoles, Courtney  and  Callison-Burch, Chris  and  Ganitkevitch, Juri  and  Van Durme, Benjamin},
  title     = {Paraphrastic Sentence Compression with a Character-based Metric: Tightening without Deletion},
  booktitle = {Proceedings of the Workshop on Monolingual Text-To-Text Generation},
  month     = {June},
  year      = {2011},
  address   = {Portland, Oregon},
  publisher = {Association for Computational Linguistics},
  pages     = {84--90},
  url       = {http://www.aclweb.org/anthology/W11-1610}
}
copy
Text simplification is the process of changing vocabulary and grammatical structure to create a more accessible version of the text while maintaining the underlying information and content. Automated tools for text simplification are a practical way to make large corpora of text accessible to a wider audience lacking high levels of fluency in the corpus language. In this work, we investigate the potential of Simple Wikipedia to assist automatic text simplification by building a statistical classification system that discriminates simple English from ordinary English. Most text simplification systems are based on hand-written rules (e.g., PEST (Carroll et al., 1999) and its module SYSTAR (Canning et al., 2000)), and therefore face limitations scaling and transferring across domains. The potential for using Simple Wikipedia for text simplification is significant; it contains nearly 60,000 articles with revision histories and aligned articles to ordinary English Wikipedia. Using articles from Simple Wikipedia and ordinary Wikipedia, we evaluated different classifiers and feature sets to identify the most discriminative features of simple English for use across domains. These findings help further understanding of what makes text simple and can be applied as a tool to help writers craft simple text.
@InProceedings{napoles-dredze:2010:CLW,
  author    = {Napoles, Courtney  and  Dredze, Mark},
  title     = {Learning {Simple Wikipedia}: A Cogitation in Ascertaining Abecedarian Language},
  booktitle = {Proceedings of the NAACL HLT 2010 Workshop on Computational Linguistics and Writing: Writing Processes and Authoring Aids},
  month     = {June},
  year      = {2010},
  address   = {Los Angeles, CA, USA},
  publisher = {Association for Computational Linguistics},
  pages     = {42--50},
  url       = {http://www.aclweb.org/anthology/W10-0406}
}
copy

Education

  • Johns Hopkins University

    PhD Candidate, Computer Science expected 2017
    Thesis: Best Practices for Monolingual Sentence Rewriting: Generation and Evaluation
    Advisors: Chris Callison-Burch and Benjamin Van Durme

    MSE, Computer Science2009–2012
    Thesis: Computational Approaches to Shortening and Simplifying Text
    Advisors: Chris Callison-Burch and Benjamin Van Durme

    Supported by the National Science Foundation Graduate Research Fellowship 2012–2015

  • Columbia University

    Post-baccalaureate Studies in Computer Science 2008–2009

  • Princeton University

    AB, Psychology 2001–2005
    Certification in Linguistics
    Thesis: Conceptual Combination of Nominal Compounds by French and English Speakers

Research Experience

  • Johns Hopkins University

    PhD Researcher 2009–present
    Advisors: Chris Callison-Burch and Benjamin Van Durme
    Project: Sentential paraphrasing

  • Yahoo Research

    Research Intern Summer 2016
    Mentors: Aasish Pappu and Joel Tetreault
    Project: Automatically identifying constructive sub-dialogues

  • Educational Testing Service (ETS)

    Research Intern Summer 2015
    Mentors: Aoife Cahill and Nitin Madnani
    Project: Grammatical errors in non-native English writing and instructor feedback

  • WordNet @ Princeton Cognitive Science Laboratory

    Summer Intern Summer 2014
    Mentor: Christiane Fellbaum
    Project: Word-sense disambiguation

Other Professional Experience

  • Da Capo Press, Persus Books Group

    Assistant Editor 2007–2008

    Editorial Assistant (Marlowe Books) 2006–2007

  • Marianne Strong Literary Agency

    Assistant Agent 2005–2006

  • Freelance Editor

    Line and developmental editing, layout, copyediting, indexing, and book packaging 2005–2009

  • Community

    • Co-organizer

      First Automatic Evaluation of Scientific Writing Shared Task (AESW), 2016

    • Reviewing

      ACL, AAAI, EACL, EMNLP, COLING, NAACL, ACL-SRW, BEA, NLP-TEA, Computational Linguistics (as secondary reviewer)

    • Outreach

      Mentor, STEM Achievement in Baltimore City Schools (SABES), 2015–2016

      Mentor, Incentive Mentoring Program, 2012–2013

      Site organizer/recruiter, North American Computational Linguistics Olympiad (NACLO), 2009–2012

      School visits and presentations with JHU Center for Educational Outreach, 2009–2012