How to add support for a language to loccount

What we can support

Synactically, most computer languages fall into the following major families: C-like, Algol-like, Pascal-like, Lisp-Like, assembler-like. In loccount all these are handled by a common parser, a DFSA that knows about comment boundaries and string literal delimiters and a few other related things.

If your language is in this group, answering about 20 questions will be sufficient to tell loccount how to count source lines (SLOC) and maybe logical lines (LLOC).

A very few languages - mostly very old or very academic ones - have syntax that is peculiar; that is, the generic parser in loccount can’t handle it gracefully. Some notable examples are Algol 68, APL, and INTERCAL. If you want support for a peculiar language you will probably have to write your own parser, or extend the existing one in a non-trivial way. The rest of this document assumes your language is not peculiar.

Key language features

To support your (non-peculiar) language we need to know the following things about it:

What file extension(s) does it use?
If "loccount -s" shows that this extension is already in use, what pattern-match can be performed on a program source to check that it is in your language?
Does the language have block comments? If so, what are the start and end delimiters of a block comment? Are these delimiters required to start at beginning of line?
Do block comments nest? In most versions of Pascal, comments nest, and:
```
(* This is (* a legal *) block comment *)
```
On the other hand, in C block comments do not nest, and:
```
/* this will throw a syntax error /*  */ <- here after the first end delimiter */
```
The most common block-comment delimiters are /* and */. Pascal-style languages, including the ML family, use (* *). Lisps may use #! !#.
Does the language have winged comments, starting with a delimiter and ending with the following linefeed? If so, what is the winged-comment starter? Common ones are "//", "#", ";", and "--". (We need to know this so we can avoid false-matching on unbalanced string delimiters.)
If the language is interpreted and the interpreter can appear in a Unix hashbang line, does the interpreter have an alternate or variant name that should get its own line item in a LOC report?
Does the language permit a leading hashbang line even though its native winged-comment leader is not #?
The program assumes that ordinary strings are delimited by ASCII double quotes. (It’s important to know this so we don’t false-match on block-comment leaders in strings.) Are linefeeds permitted in strings, or does this raise a syntax error?
Does the language have single quotes as character literals or alternate string delimiters? If so, do they permit embedded newlines?
Does this language use the C backslash convention for escaping string quotes and comment delimiters in string literals? If so, and it has single-quote strings, does the backslash convention also apply in those?
Does the language have an explicit statement terminator? Usually if there is such a character it is an ASCII semicolon. If there is a statement terminator (or separator) we can report LLOC as well as SLOC.
Some languages allow explicit termination with ";" but do not require it because the compiler can deduce them from ends of line and other syntactic information. Go is a notable example. If this is true, please specify it.
If there is no statement terminator, does the language have a statement separator? (That is like a statement terminator except that it is not required before an end of block.) If present, as it is in Pascal and other Algol-like languages, this is usually ";". Note that in languages with this quirk LLOC is reported but may be somewhat under- or over-counted, depending on whether end-of-block itself requires a following semicolon.
Does the language have explicit delimiter tokens for start and end of block? In C-like languages these are "{" and "}"; in (other) Algol- and Pascal-like languages they are usually "begin" and "end".
Does the language support Python-style multiline string literals with triple single quotes, or triple double quotes or both? If so, describe what escape conventions (if any) these support, and whether they may contain single string delimiters.
Does your language use the Pascal convention of { } as additional block comment delimiters?
Does your language have Python-like staircase syntax with significant indents?
Is your language an assembler? This implies winged comments led with ";", "#", or "*".
Does the language have a syntax like Perl or Ruby _END_ that terminates code interpretation in a source file?
Does your language have regular-expression literals? If so, are they begun by a ~/ or an / alone, or by some completely different syntax?
If the language has any other syntactic quirks that could interfere with line counting, please specify them. Unusual forms of block commenting such as Perl-like here-docs qualify. So do additional block comment or string-literals syntaxes that don’t quite fit a standard model. Any kind of multiline literal is especially likely to be a trouble spot we should know about.
We need a pointer to a definitive reference on the language’s syntax.

Note: Most of these features are interpreted now. A few aren’t yet but probably will be in the future.

Other useful features

More patterns for recognizing generated files that should be ignored are always useful for reducing noise in the line counts.

Submitting a patch

The languages this program supports are described in a markup file, part of the source distribution. Comments in the file, set off with a double slash, describe the semantics of the JSON payload.

The best way to get support is to send a patch to the maintainer that provides two things: a modification to the JSON traits file and a test source file. Don’t forget to run "make testbuild" after you’ve installed the test source in tests/ but before you commit and push.

The test program can be as simple as "Hello World" but should contain at least one block comment, one winged comment, and one statement. It should exhibit string syntax as well. If the language has odd syntactic quirks, an example of each is a good idea to include.

Ideally, an early comment in the example will contain a line that looks something like this:

// FooLang: SLOC=18 LLOC=12

Adjust the name, numbers and comment leader as required. If LLOC is not a meaningful concept for your language, put a zero there.

To write your patch, you will need to answer the feature questions above, then read the header comment in the traits.json file to work out how to turn that information into declarations and code.

We prefer patches to be submitted as MRs in the Git repository, but will accept them by email.

Requesting support

To request support, open a bug on the project’s issue tracker. The bug should answer all the questions given above and include either the source of a test load or a pointer to test source.