= Java autosuggest engine for ANTLR4 grammars Oran Epelbaum :groupid: com.intigua :version: 0.0.1-SNAPSHOT :source-highlighter: pygments
To get started, have a look at the unit tests.
image:https://circleci.com/gh/oranoran/antlr4-autosuggest.svg?style=shield&circle-token=:circle-token[link="https://circleci.com/gh/oranoran/antlr4-autosuggest"] image:https://codecov.io/gh/oranoran/antlr4-autosuggest/branch/master/graph/badge.svg[link="https://codecov.io/gh/oranoran/antlr4-autosuggest"] image:https://api.codacy.com/project/badge/Grade/496e60ce37e64674a32551590a8135e9[link="https://www.codacy.com/app/oranoran/antlr4-autosuggest?utm_source=github.com&utm_medium=referral&utm_content=oranoran/antlr4-autosuggest&utm_campaign=Badge_Grade"]
== Getting antlr-autosuggest
For now, you need to download and build the library from source code. We intend to publish into Maven Central soon.
A - Generating suggestions for a generated grammar
LexerAndParserFactory lexerAndParserfactory = new ReflectionLexerAndParserFactory( MyGeneratedLexer.class, MyGeneratedParser.class); Collection<String> suggestions = new AutoComplete(lexerAndParserFactory, "Complete this tex").suggestCompletions();
== How It Works ANTLR4 grammars get translated into two engines: the lexer and the parser. Before we go into auto-suggestions, let's quickly review what these engines do and how they interact.
The low-level piece is the lexer, configured (suprisingly) by the "lexer rules" part of the grammar. That's the rules at the bottom of a .g4 grammar file, that have all-uppercase names.
The lexer also uses tokens defined directly inside parser rules, so if you have parser rules like
myrule: 'AA' | 'BB', the lexer will care about the
'BB' parts of this rule, but not about other parts.
The lexer is in charge of breaking the input text into a list of word-like sequences known as tokens. Tokens are similar to words in a sentence - they are building blocks used to build larger constructs. To get to the next level of understanding the text, we need to look at the relationship between tokens. That is where the parser comes in.
The parser part of ANTLR4 is configured (again, surprise!) by the "parser rules" part of the grammar. That's the rules at the top, that have lowercase names. Parser rules define how tokens may be combined and ordered, and what is the role of each token in the combination. Parser rules can also refer to other parser rules to create complex, hierarchical structures, similar to phrases that combine to form a sentence.
To process a text, ANTLR4 first runs it through the lexer to cut it into tokens, and then feeds the resulting tokens into a parser. The parser can then describe, using callbacks or a tree walk, the structure of the text - including the role of each token in the text.
Now that we've covered the basics of ANTLR4 language processing, let's go back to autosuggest. The core idea is to process the input text, see where it breaks, and then backtrack to the last valid token. From there, the parser and lexer state machines are rolled forward to produce valid completion suggestions. Here is a more detailed breakdown of the process:
. Run the text through a lexer, and see if the entire text can be lexed successfully. For example, if we get an input like
Please complete this tex, the English lexer would fail at
tex, because that's just part of a word. So the autosuggest engine needs to kick in after the tokens
tex part if not important right now, we'll get back to it later. By the way, if the lexer can parse the whole text, that's also fine. This means the autosuggest engine should kick in after the very last token.
. Now run the tokens from the previous step through the parser's internal state machine, known as the ATN. The ATN is really the heart of the ANTLR4 runtime. It's a simple state machine that describes, given a current state, what are the possible next tokens, and the next state to go to for each allowed token. +
For example, at the start of a sentence (let's call the start of a sentence "state 0"), the token
Please is allowed. When
Please is received, it transitions the ATN to a new state (let's call it "state 1"). In this state, the parser is now expecting a verb token like
complete. After receiving such a verb token, it will move again to a new state, where it will expect a noun again, such as
text, or maybe a pronoun such as
this. And so on. What we care about for the purpose of autosuggest, is the final parser state after consuming all the tokens from the last step. This will tell us what we're really interested in - which tokens can come next after the current text.
. Now comes the magic ingredient: The lexer also has an ATN of its own, which it uses to read a stream of individual characters and turn them into tokens. The lexer ATN and the parser ATN have an overlap - meaning some states appear in both ATNs. Think of this as the states where one token ends and another one starts - both the parser and the lexer care about token beginning and end, though for different purposes. The parser uses these states to look at token relationships, while the lexer uses them to identify token boundaries. Anyway, the autosuggest engine now runs through the lexer's ATN, starting at the last parser state, to figure out what tokens can come next. These are going to be the auto-suggest results.
. Now back to the partial token
tex we dropped ealier. It's not enough for the auto-suggest results to be valid tokens for their place, they also need to complete what has already been typed. So if the lexer analysis yielded completions like
text matches the
tex partial token, and so all the other options are dropped. In fact, these irrelevent tokens are already dropped during the lexer ATN processing.
. The final step is to validate found suggestions against the parser's ATN.. The lexer knows almost nothing about token order and relationships, and it's possible that the suggestions built from its ATN will include valid tokens that don't actually make sense given previous tokens. For example,
Please complete this textile is technically a valid completion that the lexer might suggest, but it doesn't make any sense. So we run the parser one last time on each of the suggested completions, and filter out those suggestions that make the parser fail.
There are some more subtlties to the algorithm, such as avoiding infinite loops while traversing the lexer ATN, in case there are rules with wildcards. The entire code is not long and was written in "clean code" style using test-driven development (TDD), so it's recommended to just read it to get the full details.
. Clone the antlr-autosuggest repository.
. Install Java 8 and Maven if not already available
. Run the command
== Credits Written by Oran Epelbaum at Intigua. When starting to write this, studied the following blog posts (though much has changed since):
- https://blog.rapid7.com/2015/06/29/how-to-implement-antlr4-autocomplete/[RAPID7]: How to Implement ANTLR4 Autocomplete
- https://tomassetti.me/autocompletion-editor-antlr/[TOMASETTI]: Building autocompletion for an editor based on ANTLR by Federico Tomassetti