Replace recursive depth-first help_resolve with BFS queue + fork-based
parallelism (up to num_cores workers). Workers marshal results back via
pipes; discovered subcommands are enqueued for the next wave.
Also fix usage positional extraction to match "USAGE" without colon
(Go/Cobra style), and skip empty-result check to consider positionals.
Rewrote nushell-integration.md to cover the three-strategy pipeline,
module wrapping, and the generate command. Simplified
runtime-completions.md. Added nixos.md with detailed explanation of
the build-time generation phases, module options, and troubleshooting.