Visual Studio Code: Spellchecking with LanguageTool

Table of Contents

[ Image by Wilhelm Gunkel (https://unsplash.com/photos/black-and-white-typewriter-on-brown-wooden-table-di8ognBauG0) ]

Introduction

Visual Studio Code (VS Code) is one my favorite tools for day-to-day use. It is an open source, cross-platform, and lightweight source code editor, which is highly customizable and extendable. While it is language-agnostic, with the right extensions it can easily rival purpose-built IDEs. Further features worth mentioning are an integrated terminal, built-in Git support, and a debugger for various programming languages. Moreover, VS Code has support for remote development on containers, remote machines, or the Windows Subystem for Linux (WSL).

Apart from all of these features, the coolest thing to me is how the developers introduce new releases to users: The way the release notes are presented actually makes it fun to try new features!

Spellchecking $\LaTeX$ with LanguageTool

By using the $\LaTeX$-Workshop and L$\TeX$ extensions, I have switched to VS Code for all of my $\LaTeX$ needs since a long time ago. $\LaTeX$-Workshop provides the core features of $\LaTeX$ typesetting to VS Code, which includes building projects by means of various recipes (i.e., sequences of commands such as: pdflatex -> biber -> 2*pdflatex), direct and reverse synctex, and IntelliSense. L$\TeX$ provides enhanced grammar and spell checking support by using the LanguageTool.

Background

An $n$-gram is a contiguous sequence of $n$ words in a text, e.g., I went to their house. consists of five 1-grams, four 2-grams (I went, went to, …) and three 3-grams (…, to their house). LanguageTool can leverage large $n$-gram datasets to detect errors within so called predefined confusion pairs, i.e., words that are likely to be confused with each other, e.g., their and there.

LanguageTool can additionally (or alternatively) use a Word2vec neural network model for spellchecking based on creating a word embedding – however, based on my personal experience, this is less reliable and introduces a large false positive rate.

Installation

These dataset are huge (approximately 8G) and therefore not included in the default installation. Up-to-date $n$-gram and Word2vec data is available at https://languagetool.org/download/ngram-data/ and https://languagetool.org/download/word2vec/.

Once the $n$-gram or word2vec data have been downloaded and placed on the local machine, all that remains is to update the paths in the VS Code preferences for: Ltex>Additional Rules:**Language Model** Ltex>Additional Rules:**Word2 Vec Model**

If everything worked correctly, then given the following sentence: Don’t forget to put on the breaks., VS Code should report that ‘breaks’ (interruptions) seems less likely than ‘brakes’ (mechanical device to stop motion).