English Style Linter
Link to code
The purpose of this project is to build a linter that checks a writing
sample's compliance with the guidelines in
Strunk and White's Elements of Style.
It also compares a number of metrics calculated on the input text with
ranges of acceptable values, generated from a list of classics. It is
implemented as a node webapp.
This project is also incomplete. While the core frameworks are all in
place, there are still many independent rules to be implemented. This
project also contains a context-free grammar parser, which currently
uses a very weak set of rules. This needs to be improved over time to
accomodate many of the remaining style guidelines.
There are 4 primary categories of checks in this linter:
Word Match Checks
Validate a single word token by looking for matches in a trie-style modified Deterministic Finite State Automata (DFA). This structure makes all the matching checks, the most common type of context free check, run in O(word length) rather than O(count words that trigger errors, a.k.a. "key words"). Since we expect the number of key words to be significantly higher than the lengths of words run through the DFA, this should provide a significant speedup.
Context Free Checks
Validate a single word token by operating on the token. This type of check provides access to the word's part-of-speech and allows for complex operations on the word string. (e.g. error saying "house-boat" should be "houseboat")
Context Dependent Checks
Validate a sequence of word tokens. This type of check provides access to everything context free checks have, as well as the parsed grammar tree of the sentence. (e.g. error saying 10/10/15 should be written as December 10, 2015)
Quantitative metrics used to compare overall style of the input text with examples of "good" writing.
These checks do not exist to say, definitively, that the error-tagged word is bad or wrong. Instead, they serve as guidelines for generally good writing. There are many exceptions to every rule. In fact, it would be fair to say the best writing should trigger at least a few of these errors. In E.B. White's own words:
Style rules of this sort are, of course, somewhat a matter of individual preference, and even the established rules of grammar are open to challenge. Professor Strunk, although one of the most inflexible and choosy of men, was quick to acknowledge the fallacy of inflexibility and the danger of doctrine.
— E.B. White, The Elements of Style
That said, it is still prudent to regard each warning carefully.
It is an old observation that the best writers sometimes disregard the rules of rhetoric. When they do so, however, the reader will usually find in the sentence some compensating merit, attained at the cost of the violation. Unless he is certain of doing as well, he will probably do best to follow the rules.
— William Strunk Jr., The Elements of Style
Standards for Acceptable Metric Values
We determine the acceptable range of values for each metric using those
values drawn from the following list of 9 "good" texts. These are texts
generally accepted to be strong writing:
- Jonathan Swift - A Modest Proposal (1729)
- Emily Bronte - Wuthering Heights (1847)
- Charles Dickens - A Tale of Two Cities (1859)
- Mark Twain - Huckleberry Finn (1884)
- Jack London - The Call of the Wild (1903)
- F. Scott Fitzgerald - The Great Gatsby (1925)
- J.R.R. Tolkien - The Hobbit (1937)
- E.B. White - Charlotte's Web (1952)
- Chuck Palahniuk - Fight Club (1996)
This list was not generated using any specific formula. Rather, it was
softly determined with the goal of covering a relatively wide range of
good writing. I did, however, take care to include Charlotte's Web,
so as to have a sample of E.B. White's own writing in the acceptable corpus.
Current Checks and Metrics
Word Match Checks
All exclamation points are labeled with the error: "Do not attempt to emphasize simple statements
by using a mark of exclamation. The exclamation mark is to be reserved for use after true exclamations or commands."
If a word token is either a left or right parenthesis, it is labeled with the error: "Enclose parenthetic expressions between commas."
Formal writing is generally advised to be in the 3rd person, so any first person pronoun will be labeled with the error:
"Do not use the first person in formal writing."
Context Free Checks
Whenever a word contains a dash, the validator checks if removing the dash leaves behind an intact word. If so, it is
labeled with the error: "Do not use a hyphen between words that can be better written as one word."
Context Dependent Checks
If a noun is followed by an apostrophe but not an "s" afterwards, it is labeled with the error: "Form the possessive singular of nouns by adding 's."
"As _ or _ than"
The validator finds sequences containing the pattern "as x or y than", where x and y are adjectives, and labels "as" with the error:
Expressions of this type should be corrected by rearranging the sentences. e.g.
"My opinion is as good or better than his." -> "My opinion is as good as his, if not better."
"As to whether"
The "as" in these expressions is labeled with the error: "Do not use "as to whether". "Whether" is sufficient".
The "as" in these expressions is labeled with the error: "Do not use "as yet". "Yet" nearly always is as good, if not better".
The oxford comma check searches for the pattern <X> <COMMA> <X> <CC> <X> where:
If the pattern is found, the coordinating conjunction is labeled with the error: "In a series of three or more terms with a single conjunction, use a comma after each term except the last."
- <L> denotes a slice of a sentence that has been labeled L in its parse tree
- CC is a coordinating conjunction
- X is in the CFGParser's label space
This check ensures dates are in the format Month DD/D, YYYY or DD/D Month YYYY, but never MM/DD/YY, MM/DD/YYYY, DD/MM/YY, or DD/MM/YYYY. Incorrectly formatted dates are labeled with the error: "You should write dates as 'January 1, 2000' or '1 January 2000'."
Omit Needless Words
This check looks for phrases that contain oftseen unneccessary words. It labels these phrases with error: "Vigorous writing is concise. Omit needless words by replacing <bad phrase> with <good phrase>." Though the "as to whether" and "as yet" checks logically fit here, they have been segregated to preserve the structure of the rules as listed in The Elements of Style.
This check finds loose sentences consisting of two clauses, the second introduced by a conjunction or a relative. If three or more of these sentences occur in a row, the first word of each sentence is labeled with the error: "Avoid a succession of loose sentences."
These are the corpus-wide metrics that I have implemented as of the first iteration of this project:
The percentage of sentence terminators that are exclamation points.
Average Paragraph Length
The average paragraph length in the corpus, including dialog.
Upon inputing a sample of text and clicking "Go", the linter creates an instance of InputCorpus, which performs two calculations on the raw text. First, it stores a part-of-speech tagged copy of the text. Then, it generates the context-free grammar parse trees for each sentence using the CKY algorithm. CKY is implemented in CFGParser and its rules (in Chomsky Normal Form) are given in CFGRules.
This InputCorpus instance is passed to a new Validator, which runs all the checks and metric calculations. ContextFreeChecks, ContextDepChecks, and Metrics hold their respective lists of checks and metrics. WordMatchChecks is an implementation of MDFA, a modified DFA such that:
The transition function δ is trie-like, so δ(qi, w.charAt(i)) = qi+1 where w in Σ*.
The input alphabet Σ is not strictly defined...
... because any undefined transitions are directed to a generic, simulated reject state R at runtime, such that (∀x in Σ. δ(R, x) = R).
Transitions are case insensitive. Consequently, state names (given by the sequence of input characters that reaches them) are also case insensitively matched.
Returns a list of validation errors on acceptance, where acceptance means matching the input word to one on the list of words that trigger validation errors.
Finally, CorpusViewer is used to render the contents of the validated InputCorpus. In addition to rendering the input corpus and errors, CorpusViewer draws plots (with a toggle to switch between box-and-whisker to 1-d scatter) relating the metrics of the input corpus to the values extracted from the "good" texts. These distributions (and their normalizations) are calculated in BookMetricDistributions, which pulls the cached metric values of each book from the BookMetrics constants file.