Replace recursive depth-first help_resolve with BFS queue + fork-based
parallelism (up to num_cores workers). Workers marshal results back via
pipes; discovered subcommands are enqueued for the next wave.
Also fix usage positional extraction to match "USAGE" without colon
(Go/Cobra style), and skip empty-result check to consider positionals.