[PATCH i-g-t v4 1/2] lib/igt_kmemleak: library to interact with kmemleak

Wed Feb 12 10:01:57 UTC 2025

On Wed, Feb 12, 2025 at 10:18:17AM +0100, Peter Senna Tschudin wrote:
> 
> 
> On 12.02.2025 09:36, Zbigniew Kempczyński wrote:
> > On Tue, Feb 11, 2025 at 08:12:16PM +0100, Peter Senna Tschudin wrote:
> >> Hi Zbigniew,
> >>
> >> Thank you for your comments, please see my answers below.
> >>
> >> On 11.02.2025 17:17, Zbigniew Kempczyński wrote:
> >>> On Mon, Feb 03, 2025 at 10:23:00AM +0100, Peter Senna Tschudin wrote:
> >>>> Hi Kamil,
> >>>>
> >>>> Thank you for your review. Please see below.
> >>>>
> >>>> On 31.01.2025 13:34, Kamil Konieczny wrote:
> >>>>
> >>>> [...]
> >>>>
> >>>>>> +#include <stdio.h>
> >>>>>> +#include <string.h>
> >>>>>> +#include <unistd.h>
> >>>>>> +#include <fcntl.h>
> >>>>>> +#include <errno.h>
> >>>>>
> >>>>> Sort this alphabetically with exception of unistd.h - this
> >>>>> header could be first.
> >>>>
> >>>> That is picky, but ok.
> >>>>
> >>>>
> >>>>>
> >>>>>> +
> >>>>>> +#include "igt_core.h"
> >>>>>> +#include "igt_kmemleak.h"
> >>>>>> +
> >>>>>> +/* We can change the path for unit testing, see igt_kmemleak_init() */
> >>>>>> +static char igt_kmemleak_file[256] = "/sys/kernel/debug/kmemleak";
> >>>>>> +
> >>>>>> +#define MAX_WRITE_RETRIES 5
> >>>>>> +/**
> >>>>>> + * igt_kmemleak_write - Writes the buffer to the file descriptor retrying when
> >>>>>> + * possible.
> >>>>>> + * @fd: The file descriptor to write to.
> >>>>>> + * @buf: Pointer to the data to write.
> >>>>>> + * @count: Total number of bytes to write.
> >>>>>> + *
> >>>>>> + * Returns the total number of bytes written on success, or -1 on failure.
> >>>>>> + */
> >>>>>> +static bool igt_kmemleak_write(int fd, const void *buf, size_t count)
> >>>>>> +{
> >>>>>> +	const char *ptr = buf;
> >>>>>> +	size_t remaining = count;
> >>>>>> +	ssize_t written;
> >>>>>> +	int retries = 0;
> >>>>>> +
> >>>>>> +	while (remaining > 0) {
> >>>>>> +		written = write(fd, ptr, remaining);
> >>>>>> +		if (written > 0) {
> >>>>>> +			ptr += written;
> >>>>>> +			remaining -= written;
> >>>>>> +		} else if (written == -1) {
> >>>>>> +			if (errno == EINTR || errno == EAGAIN) {
> >>>>>> +				/* Retry for recoverable errors */
> >>>>>> +				if (++retries > MAX_WRITE_RETRIES) {
> >>>>>> +					igt_warn("igt_write: Exceeded retry limit\n");
> >>>>>
> >>>>> Do not use test prints in lib, especially those used by runner.
> >>>>
> >>>> Oh, I will. kmemleak is about writing to disk. If that fails, I will tell the user.
> >>>> Why does this restriction applies only to me? See:
> >>>>
> >>>> ---
> >>>> psennats at friendship7-home:~/UPSTREAM/igt-gpu-tools/lib$ git grep igt_warn|wc -l
> >>>> 164
> >>>> psennats at friendship7-home:~/UPSTREAM/igt-gpu-tools/lib$ cd ..
> >>>> psennats at friendship7-home:~/UPSTREAM/igt-gpu-tools$ git grep igt_warn|wc -l
> >>>> 324
> >>>> ---
> >>>>
> >>>> About 50% of all references to igt_warn are in lib, what is the problem with my code?
> >>>
> >>> I'm looking at the code and I have impression this should be part of
> >>> the runner, not igt library. Do I understand correctly that your
> >>> code will be used by the runner only and not by the tests?
> >>
> >> It is difficult to know for now. We have one test that will use the lib
> >> as is: `tests/amdgpu/amd_mem_leak.c`. But that is not the main potential
> >> use case.
> >>
> >> As I describe in the cover letter, transient leaks are kind of tricky,
> >> and we may require to request a scan from a test itself for better
> >> coverage. So in short I honestly do not know if we will have test
> >> integration in the future or not.
> > 
> > Fundamental question is - where do you want to scan/clear - on the runner
> > level or on the test? What will happen if you start to clear-scan/clear
> > both from runner (-k) and from the test itself. I've intentionally
> > added clear first to limit catched leaks to period where test was
> > executed.
> 
> I do not know know. There may be uses for both. Currently I went for the
> runner for two reasons: it is simple, and produced good results.
> 
> However for transient leaks the key seems to be timing and then igt_runner
> does not offer enough granularity. We may need clears and scans at very
> specific places in the code.
> 
> My goal is to have CI experiment with kmemleak on igt_runner, collect
> developer feedback, and then decide if the next step of trying to figure
> it out how to have a more granular test-level scan makes sense and if is
> suitable for our use case.
> 
> Detecting transient leaks is an open problem.
> 
> > 
> > On the runner level we would collect for all tests, on the test level
> > I imagine command line argument passed to the test like -k which would
> > enable the scan on the very beginning. But this would require to rewrite
> > all tests which are using pure main() instead of igt_main.*().
> > If really there's leak and someone wants to catch it, he can always
> > simply wrap around script execution by kmemleak clear-scan/clear.
> 
> Currently we have -keach and -konce with -keach scanning between each test.
> As I said, this produces results, but is not effective for transient leaks.
> 
> Again, I do not know. If the initial evaluation indicates that there may be
> a benefit on extending the tests to trigger kmemleak scans, I see no reason
> for not doing it.
> 
> 
> > 
> > Resume - I would stick to the runner level, I see no judgement to change
> > igt_core / tests to use this library directly.
> 
> Resume: I prefer to wait until the initial feedback from CI and developers.
> I will be happy to move this to runner later if we decide to not walk the
> unknown road of trying to catch as many transient leaks as we can.
> 
> That being said, let me know how to move forward.

Question is to CI folks, for me this scan should be executed as
dedicated CI run with kmemleak on. And I repeat - I don't see reason
we should add this support to the tests, as few line script which
wraps around test by kmemleak clear/scan is enough. We already have
scripts directory so it is good place for such wrapper.

--
Zbigniew

> 
> > 
> >>
> >>>
> >>> Regarding your above question, I think runner should use as much as
> >>> minimal from lib/ directory, because this is collection of functions
> >>> designed for tests, not for runner. I think generic code like data
> >>> structures (lists) are fine, but not the others. Imagine you'll
> >>> incidentally try to use igt_require() - follow the call and see
> >>> the pitfall there.
> >>
> >> Currently the list of lib includes in the runner directory is not
> >> small (I used a dumb regex that looks for includes starting with igt_):
> >>
> >> #include "igt_aux.h"
> >> #include "igt_core.h"
> >> #include "igt_facts.h"
> >> #include "igt_hook.h"
> >> #include "igt_list.h"
> >> #include "igt_taints.h"
> >> #include "igt_vec.h"
> >>
> >>
> >>>
> >>> So if my understanding is correct and this code should be part of the
> >>> runner only your test might be moved to the runner_tests.c.
> >>
> >> It is difficult to know at this stage. At least the amd test I mentioned
> >> earlier will use it.
> > 
> > I've explored runner code even if it is not in the center of my
> > interests. I've seen there're few places where it is tainted by igt
> > library code which might call logging from igt domain like igt_warn()
> > [igt_gettime()], or igt_assert() [igt_vec_elem()]. I've experimented
> > how runner code behaves with those calls, igt_warn() just logs but
> > output format differs than executor.c __logf__(). I think we should
> > be consistent in this case. Asserting __igt_fail_assert() unfortunately
> > uses igt logging but if we ever assert in the runner this will have
> > fatal consequences. IGT library contains a lot of asserts, so functions
> > from them should be meticulously reviewed before using in the runner.
> 
> I see, but there is a very important detail here. Enabling kmemleak scans
> come at a very high runtime cost. The slowdown is expected to be 4X. So
> my judgement tells me that if CI is enduring the pains of a 4X slowdown
> to get kmemleaks results, the minimum we can do is tell them when
> something goes wrong.
> 
> On top of that, those write errors are rare, very rare I would say. I can
> use any other function you want. I am just not ok to fail silently in this
> case.
> 
> A big thank you for your comments.
> 
> > 
> > --
> > Zbigniew
> > 
> >>
> >>>
> >>> BTW please rebase on top of current master, there were changes in
> >>> the runner/settings.c file that affects this series.
> >>
> >> As soon as I am confident this will be merged, I surely will.
> >>
> >> Thanks!
> >>
> >>>
> >>> --
> >>> Zbigniew
> >>>
> >>>>
> >>>>
> >>>>>
> >>>>>> +					return false;
> >>>>>> +				}
> >>>>>> +				continue;
> >>>>>> +			} else {
> >>>>>> +				/* Log unrecoverable error */
> >>>>>> +				igt_warn("igt_write: unrecoverable write error");
> >>>>>
> >>>>> Same here.
> >>>> We should not remove this one.
> >>>>
> >>>>>
> >>>>>> +				return false;
> >>>>>> +			}
> >>>>>> +		} else if (written == 0) {
> >>>>>> +			if (++retries > MAX_WRITE_RETRIES) {
> >>>>>> +				igt_warn("igt_write: Exceeded retry limit\n");
> >>>>>
> >>>>> Same here.
> >>>> We should not remove this one.
> >>>>
> >>>> [...]
> >>>>
> >>>>>> +	FILE *fp;
> >>>>>
> >>>>> Why FILE* and fopen? Could you just use open/read?
> >>>>
> >>>> Why not?
> >>>>
> >>>> ---
> >>>> psennats at friendship7-home:~/UPSTREAM/igt-gpu-tools$ git grep fopen|wc -l
> >>>> 97
> >>>> ---
> >>>>
> >>>>
> >>>> [...]
> >>>>
> >>>>>> +bool igt_kmemleak_init(const char *unit_test_kmemleak_file)
> >>>>>> +{
> >>>>>> +	FILE *fp;
> >>>>>
> >>>>> Same here, open/read.
> >>>>
> >>>> Same here, why not?
> >>>> ---
> >>>> psennats at friendship7-home:~/UPSTREAM/igt-gpu-tools$ git grep fopen|wc -l
> >>>> 97
> >>>> ---
> >>>>
> >>>> [...]
> >>>>
> >>>>>> +#define KMEMLEAKRESFILENAME "kmemleak.txt"
> >>>>>
> >>>>> imho better: IGT_KMEMLEAK_RESFILENAME
> >>>>
> >>>> It reads better indeed. Thanks.
> >>>>
> >>>> [...]
> >>
>