SYNOPSIS
loccount [-cdegijlinsuV>] [-x regexp] file-or-dir…
DESCRIPTION
This program counts physical source lines of code (SLOC) and logical lines of code (LLOC) in one or more files or directories given on the command line.
A line of code is counted in SLOC if it includes non-whitespace characters outside the scope of a comment. LLOC is counted by tallying SLOCs with statement-terminating punctuation.
LLOC reporting is not available in all supported languages, as the concept may not fit the langage’s syntax (e.g. the Lisp family) or its line-termination rules would require full parsing. In these case LLOC will always be reported as 0. On the other hand, LLOC reporting is reliably consistent in languages with C-like statement termination by semicolon.
These definitions are simplistic and arguably lead to undercounting if LLOC is being used as a complexity measure; the author considers it a particular problem that most C macro definitions won’t be counted. However, they have the advantage that they improve comparability of results across broad swathes of different languages.
Certain kinds of syntactic errors in source code - notably unbalanced comment and string literal delimiters - make this program likely to produce wrong counts and spurious errors.
It is advisable to run "make clean" or equivalent in your source directory before running this program, though it knows how to detect some common kinds of generated files (such as yacc and lex output and manual pages or HTML generated by asciidoc) and will ignore them. You can explicitly tag a fileas generated by putting the string "GENERATED" somewhere in the first few lines.
Optionally, this program can perform a cost-to-replicate estimation using the COCOMO I and (if LLOC count is nonzero) COCOMO II models. It uses the "organic" profile of COCOMO, which is generally appropriate for open-source projects.
SLOC/LLOC figures should be used with caution. While they do predict project costs and defect incidence reasonably well, they are not appropriate for use as 'productivity' measures; good code is often less bulky than bad code. Comparing SLOC across languages from different familes (for example, Algol-descended vs. Lisp-descended) is also dubious, as these can have can have greatly differing complexity per line.
With these qualifications, SLOC/LLOC does have some other uses. It is quite effective for tracking changes in complexity and attack surface as a codebase evolves over time.
All languages in common use on Unix-like operating systems are supported. For a full list of supported languages, run "loccount -s"; "loccount -l" lists languages for which LLOC computation is available.
The program also emits counts for build recipes - Makefiles, autoconf specifications, scons recipes, and waf scripts. Generated Makefiles are recognized and ignored. An installed copy of waf and any waf build directory is ignored, but a wscript file is not.
Counts for the configuration languages JSON, YAML, TOML, and INI are reported.
The program emits counts for well-known documentation markups as well, including man-page, asciidoc, Markdown, Tex, and others. There is no equivalent of LLOC for these. The -n option disables this feature.
PostScript is a special case. It is usually generated from some other markup and thus not source code, but not always. This program looks for "!PS-Adobe" early in the fire as an indication that it was generated, and ignores such files.
Languages are recognized by file extension or filename pattern; executable filenames without an extension are mined for #! lines identifying an interpreter. Files that cannot be classified in this way are skipped, but a list of files skipped in this way is available with the -u option.
Some file types are identified and silently skipped without being reported by -u; these include symlinks, .o, .a, and .so object files, various kinds of image and audio files, and the .pyc/.pyo files produced by the Python interpreter. All files and directories named with a leading dot are also silently skipped (in particular, this ignores metadata associated with version-control systems).
LIMITATIONS
There are some sources of error and confusion that no amount of clever code in this program can abolish.
One has to do with comment nesting in Pascal. ISO 7185:1990, the standard for the language, specifies that comments do not nest; however important historical and current Pascal compilers support comment nesting. This program assumes that if a block comment start is within the scope of a block comment, the programmer is working with such a compiler and did that deliberately.
Python detection is slightly flaky. Anything with a .py extension will be classified simply as "Python", not distinguishing between Python 2 and Python 3. Python files without an extension will be correctly detected only when they have a hashbang line containing "python" or "python3"; end-of-lifed versions such as 2 and 1.5 won’t be picked up.
There is a conflict among Objective-C, MATLAB, MUMPS, and ntroff/troff over the extensions .m and .mm; this may lead to misidentification of files with these extensions. To avoid problems, ensure that every MATLAB file contains at least one %-led winged comment or %{-led block comment.
What is reported as "ML" includes its dialects Caml and Ocaml, which are not readily distinguishable, but unlikely to be mixed in the same source tree. Standard ML and Concurrent ML have distinguishing file extensions and can therefore be reported separately (as "SML" and "CML" respectively).
The syntax of Algol 60 was not carefully specified. Variants in which keywords are distinguished from variable and function names by either being uppercase or being quoted like string exist. This program assumes an Algol dialect with all-caps unquoted keywords. The sticking point here is that COMMENT (uppercase, no quotes) is used to recognize comments.
This program assumes that Lisp and Scheme interpret backslash as C does, that is as an escape for a following string delimiter. While this is true in Common Lisp, Scheme, Emacs Lisp, and Guile, it may not be true in other Lisp dialects.
Manual pages sometimes have idiosyncratic extensions (that is, other than ".man" or a single section digit) which this program will not recognize. Older manual pages sometimes abuse nroff to achieve commenting in ways this program does not recognize, resulting in some overcounting of source lines.
The language attribution "shell" includes bash, dash, ksh, and other similar variants descended from the Bourne shell.
ECMAScript6/es6 files with a .js extension will be reported as Javascript.
OPTIONS
- -?
-
Display usage summary and quit.
- -c
-
Report COCOMO cost estimates. Use the coefficients for the "organic" project type, which fits most open-source projects. An EAF of 1.0 is assumed.
- -d n
-
Set debug level. At > 0, displays various progress messages. Mainly of interest to loccount developers.
- -e
-
Show the association between languages and file extensions.
- -g
-
List files normally excluded by the autogeneration filter; do not emit line counts.
- -i
-
Report file path, line count, and type for each individual path.
- -j
-
Dump SLOC and LLOC counts as self-describing JSON records for postprocessing.
- -l
-
List languages for which we can report LLOC and exit. Combine with -i to list languages one per line.
- -n
-
Do not tally documentation SLOC.
- -s
-
List languages for which we can report SLOC and exit.
- -u
-
List paths of files that could not be classified into a known source type or as autogenerated.
- -x regexp
-
Ignore paths matching the specified Go regular expression.
- -V
-
Show program version and exit.
Arguments following options may be either directories or files. Directories are recursed into. The report is generated on all paths specified on the command line.
EXIT VALUES
Normally 0. 1 in -s or -e mode if a non-duplication check on file extensions or hashbangs fails.
HISTORY AND COMPATIBILITY
The algorithms in this code originated with David A. Wheeler’s sloccount utility, version 2.26 of 2004. It is, however, faster than sloccount, and handles many languages that sloccount does not.
Generally it will produce identical SLOC figures to sloccount for a language supported by both tools; the differences in whole-tree reports will mainly be due to better detection of some files sloccount left unclassified. Notably, for individual C and Perl files you can expect both tools to produce identical SLOC. However, Python counts are different, because sloccount does not recognize and ignore single-quote multiline literals.
A few of sloccount’s tests have been simplified in cases where the complexity came from a rare or edge case that the author judges to have become extinct since 2004.
The reporting formats of loccount 2.x are substantially different from those in the 1.x versions due to absence of any LLOC fields in 1.x.
The base salary used for cost estimation will differ between these tools depending on time of last release.
BUGS
Eiffel indexing comments are counted as code, not text. (This is arguably a feature.)
Literate Haskell (.lhs) is not supported. (This is a regression from sloccount).
LLOC counts in languages that use a semicolon as an Algol-like statement separator, rather than a terminator, will be a bit low. This group includes Algol, Pascal, Modula, Oberon, and Perl.
Dylan LOC will be a bit high due to its use of semicolon as a terminator for classes and methods as well as statements.
If a Factor program defines words containing embedded ! or ", loccount will be confused.
Fantom documentation comments (led with **) are counted as code.
Comment detection in Forth can be confused by tabs or unusual whitespace following a \\ or (, or by strings containing unbalanced parens.
User-facing comment lines in Pkl are counted as code.
REPORTING BUGS
Report bugs to Eric S. Raymond <esr@thyrsus.com>.