English Style Linter

Link to code

The purpose of this project is to build a linter that checks a writing sample's compliance with the guidelines in Strunk and White's Elements of Style. It also compares a number of metrics calculated on the input text with ranges of acceptable values, generated from a list of classics. It is implemented as a node webapp.

This project is also incomplete. While the core frameworks are all in place, there are still many independent rules to be implemented. This project also contains a context-free grammar parser, which currently uses a very weak set of rules. This needs to be improved over time to accomodate many of the remaining style guidelines.

Logic Overview

There are 4 primary categories of checks in this linter:

Word Match Checks
Validate a single word token by looking for matches in a trie-style modified Deterministic Finite State Automata (DFA). This structure makes all the matching checks, the most common type of context free check, run in O(word length) rather than O(count words that trigger errors, a.k.a. "key words"). Since we expect the number of key words to be significantly higher than the lengths of words run through the DFA, this should provide a significant speedup.
Context Free Checks
Validate a single word token by operating on the token. This type of check provides access to the word's part-of-speech and allows for complex operations on the word string. (e.g. error saying "house-boat" should be "houseboat")
Context Dependent Checks
Validate a sequence of word tokens. This type of check provides access to everything context free checks have, as well as the parsed grammar tree of the sentence. (e.g. error saying 10/10/15 should be written as December 10, 2015)
Overall Metrics
Quantitative metrics used to compare overall style of the input text with examples of "good" writing.

These checks do not exist to say, definitively, that the error-tagged word is bad or wrong. Instead, they serve as guidelines for generally good writing. There are many exceptions to every rule. In fact, it would be fair to say the best writing should trigger at least a few of these errors. In E.B. White's own words:

“ Style rules of this sort are, of course, somewhat a matter of individual preference, and even the established rules of grammar are open to challenge. Professor Strunk, although one of the most inflexible and choosy of men, was quick to acknowledge the fallacy of inflexibility and the danger of doctrine. ”

— E.B. White, The Elements of Style

That said, it is still prudent to regard each warning carefully.

“ It is an old observation that the best writers sometimes disregard the rules of rhetoric. When they do so, however, the reader will usually find in the sentence some compensating merit, attained at the cost of the violation. Unless he is certain of doing as well, he will probably do best to follow the rules. ”

— William Strunk Jr., The Elements of Style

Standards for Acceptable Metric Values

We determine the acceptable range of values for each metric using those values drawn from the following list of 9 "good" texts. These are texts generally accepted to be strong writing:

Jonathan Swift - A Modest Proposal (1729)
Emily Bronte - Wuthering Heights (1847)
Charles Dickens - A Tale of Two Cities (1859)
Mark Twain - Huckleberry Finn (1884)
Jack London - The Call of the Wild (1903)
F. Scott Fitzgerald - The Great Gatsby (1925)
J.R.R. Tolkien - The Hobbit (1937)
E.B. White - Charlotte's Web (1952)
Chuck Palahniuk - Fight Club (1996)

This list was not generated using any specific formula. Rather, it was softly determined with the goal of covering a relatively wide range of good writing. I did, however, take care to include Charlotte's Web, so as to have a sample of E.B. White's own writing in the acceptable corpus.

Current Checks and Metrics

Word Match Checks

Exclamations
All exclamation points are labeled with the error: "Do not attempt to emphasize simple statements by using a mark of exclamation. The exclamation mark is to be reserved for use after true exclamations or commands."
Parenthesis
If a word token is either a left or right parenthesis, it is labeled with the error: "Enclose parenthetic expressions between commas."
First Person
Formal writing is generally advised to be in the 3rd person, so any first person pronoun will be labeled with the error: "Do not use the first person in formal writing."

Context Free Checks

In-Word Dashes
Whenever a word contains a dash, the validator checks if removing the dash leaves behind an intact word. If so, it is labeled with the error: "Do not use a hyphen between words that can be better written as one word."

Context Dependent Checks

Singular Possessive
If a noun is followed by an apostrophe but not an "s" afterwards, it is labeled with the error: "Form the possessive singular of nouns by adding 's."
"As _ or _ than"
The validator finds sequences containing the pattern "as x or y than", where x and y are adjectives, and labels "as" with the error: Expressions of this type should be corrected by rearranging the sentences. e.g. "My opinion is as good or better than his." -> "My opinion is as good as his, if not better."
"As to whether"
The "as" in these expressions is labeled with the error: "Do not use "as to whether". "Whether" is sufficient".
"As yet"
The "as" in these expressions is labeled with the error: "Do not use "as yet". "Yet" nearly always is as good, if not better".
Oxford Comma
The oxford comma check searches for the pattern <X> <COMMA> <X> <CC> <X> where:
- <L> denotes a slice of a sentence that has been labeled L in its parse tree
- CC is a coordinating conjunction
- X is in the CFGParser's label space
If the pattern is found, the coordinating conjunction is labeled with the error: "In a series of three or more terms with a single conjunction, use a comma after each term except the last."
Date Format
This check ensures dates are in the format Month DD/D, YYYY or DD/D Month YYYY, but never MM/DD/YY, MM/DD/YYYY, DD/MM/YY, or DD/MM/YYYY. Incorrectly formatted dates are labeled with the error: "You should write dates as 'January 1, 2000' or '1 January 2000'."
Omit Needless Words
This check looks for phrases that contain oftseen unneccessary words. It labels these phrases with error: "Vigorous writing is concise. Omit needless words by replacing <bad phrase> with <good phrase>." Though the "as to whether" and "as yet" checks logically fit here, they have been segregated to preserve the structure of the rules as listed in The Elements of Style.
Loose Sentences
This check finds loose sentences consisting of two clauses, the second introduced by a conjunction or a relative. If three or more of these sentences occur in a row, the first word of each sentence is labeled with the error: "Avoid a succession of loose sentences."

Overall Metrics

These are the corpus-wide metrics that I have implemented as of the first iteration of this project:

Exclamations
The percentage of sentence terminators that are exclamation points.
Average Paragraph Length
The average paragraph length in the corpus, including dialog.

Implementation Details

Organization

Upon inputing a sample of text and clicking "Go", the linter creates an instance of InputCorpus, which performs two calculations on the raw text. First, it stores a part-of-speech tagged copy of the text. Then, it generates the context-free grammar parse trees for each sentence using the CKY algorithm. CKY is implemented in CFGParser and its rules (in Chomsky Normal Form) are given in CFGRules.

This InputCorpus instance is passed to a new Validator, which runs all the checks and metric calculations. ContextFreeChecks, ContextDepChecks, and Metrics hold their respective lists of checks and metrics. WordMatchChecks is an implementation of MDFA, a modified DFA such that:

The transition function δ is trie-like, so δ(q_i, w.charAt(i)) = q_i+1 where w in Σ*.
The input alphabet Σ is not strictly defined...
... because any undefined transitions are directed to a generic, simulated reject state R at runtime, such that (∀x in Σ. δ(R, x) = R).
Transitions are case insensitive. Consequently, state names (given by the sequence of input characters that reaches them) are also case insensitively matched.
Returns a list of validation errors on acceptance, where acceptance means matching the input word to one on the list of words that trigger validation errors.

Finally, CorpusViewer is used to render the contents of the validated InputCorpus. In addition to rendering the input corpus and errors, CorpusViewer draws plots (with a toggle to switch between box-and-whisker to 1-d scatter) relating the metrics of the input corpus to the values extracted from the "good" texts. These distributions (and their normalizations) are calculated in BookMetricDistributions, which pulls the cached metric values of each book from the BookMetrics constants file.

Node Modules

For testing: chai, sinon, and mocha.

For UI: jquery and d3plus.

For NLP: pos.

Additional toolage: browserify.