[PATCH v2 1/3] proc_pid_fdinfo.5: Reduce indent for most of the page
Colin Watson
cjwatson at debian.org
Sat Nov 2 19:06:53 UTC 2024
On Sat, Nov 02, 2024 at 05:08:37AM -0500, G. Branden Robinson wrote:
> On GNU/Linux systems, the only man page indexer I know of is Colin
> Watson's man-db--specifically, its mandb(8) program. But it's nicely
> designed so that the "topic and summary description extraction" task is
> delegated to a standalone tool, lexgrog(1), and we can use that.
>
> $ lexgrog /tmp/proc_pid_fdinfo_mini.5
> /tmp/proc_pid_fdinfo_mini.5: parse failed
>
> Oh, damn. I wasn't expecting that. Maybe this is what defeats Michael
> Kerrisk's scraper with respect to groff's man pages.[1]
How embarrassing. Could somebody please file a bug on
https://gitlab.com/man-db/man-db/-/issues to remind me to fix that? (Of
course there'll be a lead time for fixes to get into distributions.)
> Well, I can find a silver lining here, because it gives me an even
> better reason than I had to pitch an idea I've been kicking around for a
> while. Why not enhance groff man(7) to support a mode where _it_ will
> spit out the "Name"/"NAME" section, and only that, _for_ you?
>
> This would be as easy as checking for an option, say '-d EXTRACT=Name',
> and having the package's "TH" and "SH" macro definitions divert
> (literally, with the `di` request) everything _except_ the section of
> interest to a diversion that is then never called/output. (This is
> similar to an m4 feature known as the "black hole diversion".)
>
> All of the features necessary to implement this[2] were part of troff as
> far as back as the birth of the man(7) package itself. It's not clear
> to me why it wasn't done back in the 1980s.
>
> lexgrog(1) itself will of course have to stay around for years to come,
> but this could take a significant distraction off of Colin's plate--I
> believe I have seen him grumble about how much *roff syntax he has to
> parse to have the feature be workable, and that's without upstart groff
> maintainers exploring up to every boundary that existed even in 1979 and
> cheerfully exercising their findings in man pages.
lexgrog(1) is a useful (if oddly-named, sorry) debugging tool, but if
you focus on that then you'll end up with a design that's not very
useful. What really matters is indexing the whole system's manual
pages, and mandb(8) does not do that by invoking lexgrog(1) one page at
a time, but rather by running more or less the same code in-process. I
already know that getting acceptable performance for this requires care,
as illustrated by one of the NEWS entries for man-db 2.10.0:
* Significantly improve `mandb(8)` and `man -K` performance in the common
case where pages are of moderate size and compressed using `zlib`: `mandb
-c` goes from 344 seconds to 10 seconds on a test system.
... so I'm prepared to bet that forking nroff one page at a time will be
unacceptably slow. (This also combines with the fact that man-db
applies some sandboxing when it's calling nroff just in case it might
happen that a moderately-sized C++ project has less than 100% perfect
security when doing text processing, which I'm sure everyone agrees
would never happen.)
If it were possible to run nroff over a whole batch of pages and get
output for each of them in one go, then maaaaybe. man-db would need a
reliable way to associate each line (or sometimes multiple lines) of
output with each source file, and of course care would be needed around
error handling and so on. I can see the appeal, in terms of processing
the actual language rather than a pile of hacks that try to guess what
to do with it - but on the other hand this starts to feel like a much
less natural fit for the way nroff is run in every other situation,
where you're processing one document at a time.
Cheers,
--
Colin Watson (he/him) [cjwatson at debian.org]
More information about the dri-devel
mailing list