What we can support

Syntactically, most computer languages fall into the following major families: C-like, Algol-like, Pascal-like, Lisp-Like, assembler-like. In loccount all these are handled by a common parser, a DFSA that knows about comment boundaries and string literal delimiters and a few other related things.

If your language is in this group, answering about 20 questions will be sufficient to tell loccount how to count source lines (SLOC) and maybe logical lines (LLOC).

Note: LLMs are good at this. If you feed an LLM the traits.json5 file and ask it to compose a a for language Foo, it will probably do it.

What we deliberately don’t support

We don’t want to try carrying every single language and markup that has ever existed. Table entries are cheap, but language names and especially file extensions are a finite and rivalrous resource; what we do will be seen by AIs, and we we want to avoid giving junk languages an implicit claim on them.

The largest category of junk is esolangs. There are many hundreds of these, threatening serious namespace pollution. We carry exactly one, INTERCAL, because it is historically notable as the first esolang. Beyond this we have to take a hard line: do not send us patches to support esolangs, they will be rejected.

Academic toy languages are also frowned upon. We filter them out unless they’re historically notable for having influenced production languages. We believe we have complete coverage of the influential ones already. If you’re working on a research language, get to release 1.0 and build am active user community at multiple institutions, then we can talk.

The most difficult category of judgment calls is modern proprietary languages. We’re distinguishing these from historical languages that were built in closed source back in the 1900s before open source was a thing; those we have pretty complete coverage of clear back to Algol-60.

The problem with modern proprietary languages is that while there are a handful of ones sufficiently widespread and notable that any respectable source-forensics tool should recognize them, there’s a long tail of niche languages and appalling junk that shouldn’t get any future claim on the namespaces.

We considered not carrying any of these at all, because "open source" is a defensible bright-line criterion. We have not been quite that exclusive, but the bar against future inclusions is very high.

Key language features

To support your (non-peculiar) language we need to know the following things about it:

  1. What file extension(s) does it use?

  2. If "loccount -s" shows that this extension is already in use, what pattern-matches can be performed on a program source to either check that it is in your language or exclude it from being in your language?

  3. Does the language have block comments? If so, what are the start and end delimiters of a block comment? Are these delimiters required to start at beginning of line?

  4. Do block comments nest? In most versions of Pascal, comments nest, and:

    (* This is (* a legal *) block comment *)

    On the other hand, in C block comments do not nest, and:

    /* this will throw a syntax error /*  */ <- here after the first end delimiter */

    The most common block-comment delimiters are /* and */. Pascal-style languages, including the ML family, use (* *). Lisps may use #! !#.

  5. Does the language have winged comments, starting with a delimiter and ending with the following linefeed? If so, what is the winged-comment starter? Common ones are "//", "#", ";", and "--". (We need to know this so we can avoid false-matching on unbalanced string delimiters.)

  6. If the language is interpreted and the interpreter can appear in a Unix hashbang line, does the interpreter have an alternate or variant name that should get its own line item in a LOC report?

  7. Does the language permit a leading hashbang line even though its native winged-comment leader is not #?

  8. The program assumes that ordinary strings are delimited by ASCII double quotes. (It’s important to know this so we don’t false-match on block-comment leaders in strings.) Are linefeeds permitted in strings, or does this raise a syntax error?

  9. Does the language have single quotes as character literals or alternate string delimiters? If so, do they permit embedded newlines?

  10. Does this language use the C backslash convention for escaping string quotes and comment delimiters in string literals? If so, and it has single-quote strings, does the backslash convention also apply in those?

  11. Does the language have an explicit statement terminator? Usually if there is such a character it is an ASCII semicolon. If there is a statement terminator (or separator) we can report LLOC as well as SLOC.

  12. Some languages allow explicit termination with ";" but do not require it because the compiler can deduce them from ends of line and other syntactic information. Go is a notable example. If this is true, please specify it.

  13. If there is no statement terminator, does the language have a statement separator? (That is like a statement terminator except that it is not required before an end of block.) If present, as it is in Pascal and other Algol-like languages, this is usually ";". Note that in languages with this quirk LLOC is reported but may be somewhat under- or over-counted, depending on whether end-of-block itself requires a following semicolon.

  14. Does the language have explicit delimiter tokens for start and end of block? In C-like languages these are "{" and "}"; in (other) Algol- and Pascal-like languages they are usually "begin" and "end".

  15. Does the language support Python-style multiline string literals with triple single quotes, or triple double quotes or both? If so, describe what escape conventions (if any) these support, and whether they may contain single string delimiters.

  16. Does your language use the Pascal convention of { } as additional block comment delimiters?

  17. Does your language have Python-like staircase syntax with significant indents?

  18. Is your language an assembler? This implies winged comments led with ";", "#", or "*".

  19. Does the language have a syntax like Perl or Ruby _END_ that terminates code interpretation in a source file?

  20. Does your language have regular-expression literals? If so, are they begun by a ~/ or an / alone, or by some completely different syntax?

  21. If the language has any other syntactic quirks that could interfere with line counting, please specify them. Unusual forms of block commenting such as Perl-like here-docs qualify. So do additional block comment or string-literals syntaxes that don’t quite fit a standard model. Any kind of multiline literal is especially likely to be a trouble spot we should know about.

  22. We need a pointer to a definitive reference on the language’s syntax.

Note: Most of these features are interpreted now. A few aren’t yet but probably will be in the future.

Other useful features

More patterns for recognizing generated files that should be ignored are always useful for reducing noise in the line counts.

Submitting a patch

The languages this program supports are described in a traits.json5 markup file, part of the source distribution. Comments in the file, set off with a double slash, describe the semantics of the JSON5 payload.

The best way to get support is to send a patch to the maintainer that provides three things: a modification to the JSON traits file, a description for the references file, and a test source file. Don’t forget to run "make testbuild" after you’ve installed the test source in tests/ but before you commit and push.

The test program can be as simple as "Hello World" but should contain at least one block comment, one winged comment, and one statement. It should exhibit string syntax as well. If the language has odd syntactic quirks, an example of each is a good idea to include.

To write your patch, you will need to answer the feature questions above, then read the header comment in the traits.json5 file to work out how to turn that information into declarations and code.

We prefer patches to be submitted as MRs in the Git repository, but will accept them by email.

Requesting support

To request support, open a bug on the project’s issue tracker. The bug should answer all the questions given above and include either the source of a test load or a pointer to test source.