[PATCH i-g-t v4 1/2] lib/igt_kmemleak: library to interact with kmemleak

Wed Feb 12 09:18:17 UTC 2025

On 12.02.2025 09:36, Zbigniew Kempczyński wrote:
> On Tue, Feb 11, 2025 at 08:12:16PM +0100, Peter Senna Tschudin wrote:
>> Hi Zbigniew,
>>
>> Thank you for your comments, please see my answers below.
>>
>> On 11.02.2025 17:17, Zbigniew Kempczyński wrote:
>>> On Mon, Feb 03, 2025 at 10:23:00AM +0100, Peter Senna Tschudin wrote:
>>>> Hi Kamil,
>>>>
>>>> Thank you for your review. Please see below.
>>>>
>>>> On 31.01.2025 13:34, Kamil Konieczny wrote:
>>>>
>>>> [...]
>>>>
>>>>>> +#include <stdio.h>
>>>>>> +#include <string.h>
>>>>>> +#include <unistd.h>
>>>>>> +#include <fcntl.h>
>>>>>> +#include <errno.h>
>>>>>
>>>>> Sort this alphabetically with exception of unistd.h - this
>>>>> header could be first.
>>>>
>>>> That is picky, but ok.
>>>>
>>>>
>>>>>
>>>>>> +
>>>>>> +#include "igt_core.h"
>>>>>> +#include "igt_kmemleak.h"
>>>>>> +
>>>>>> +/* We can change the path for unit testing, see igt_kmemleak_init() */
>>>>>> +static char igt_kmemleak_file[256] = "/sys/kernel/debug/kmemleak";
>>>>>> +
>>>>>> +#define MAX_WRITE_RETRIES 5
>>>>>> +/**
>>>>>> + * igt_kmemleak_write - Writes the buffer to the file descriptor retrying when
>>>>>> + * possible.
>>>>>> + * @fd: The file descriptor to write to.
>>>>>> + * @buf: Pointer to the data to write.
>>>>>> + * @count: Total number of bytes to write.
>>>>>> + *
>>>>>> + * Returns the total number of bytes written on success, or -1 on failure.
>>>>>> + */
>>>>>> +static bool igt_kmemleak_write(int fd, const void *buf, size_t count)
>>>>>> +{
>>>>>> +	const char *ptr = buf;
>>>>>> +	size_t remaining = count;
>>>>>> +	ssize_t written;
>>>>>> +	int retries = 0;
>>>>>> +
>>>>>> +	while (remaining > 0) {
>>>>>> +		written = write(fd, ptr, remaining);
>>>>>> +		if (written > 0) {
>>>>>> +			ptr += written;
>>>>>> +			remaining -= written;
>>>>>> +		} else if (written == -1) {
>>>>>> +			if (errno == EINTR || errno == EAGAIN) {
>>>>>> +				/* Retry for recoverable errors */
>>>>>> +				if (++retries > MAX_WRITE_RETRIES) {
>>>>>> +					igt_warn("igt_write: Exceeded retry limit\n");
>>>>>
>>>>> Do not use test prints in lib, especially those used by runner.
>>>>
>>>> Oh, I will. kmemleak is about writing to disk. If that fails, I will tell the user.
>>>> Why does this restriction applies only to me? See:
>>>>
>>>> ---
>>>> psennats at friendship7-home:~/UPSTREAM/igt-gpu-tools/lib$ git grep igt_warn|wc -l
>>>> 164
>>>> psennats at friendship7-home:~/UPSTREAM/igt-gpu-tools/lib$ cd ..
>>>> psennats at friendship7-home:~/UPSTREAM/igt-gpu-tools$ git grep igt_warn|wc -l
>>>> 324
>>>> ---
>>>>
>>>> About 50% of all references to igt_warn are in lib, what is the problem with my code?
>>>
>>> I'm looking at the code and I have impression this should be part of
>>> the runner, not igt library. Do I understand correctly that your
>>> code will be used by the runner only and not by the tests?
>>
>> It is difficult to know for now. We have one test that will use the lib
>> as is: `tests/amdgpu/amd_mem_leak.c`. But that is not the main potential
>> use case.
>>
>> As I describe in the cover letter, transient leaks are kind of tricky,
>> and we may require to request a scan from a test itself for better
>> coverage. So in short I honestly do not know if we will have test
>> integration in the future or not.
> 
> Fundamental question is - where do you want to scan/clear - on the runner
> level or on the test? What will happen if you start to clear-scan/clear
> both from runner (-k) and from the test itself. I've intentionally
> added clear first to limit catched leaks to period where test was
> executed.

I do not know know. There may be uses for both. Currently I went for the
runner for two reasons: it is simple, and produced good results.

However for transient leaks the key seems to be timing and then igt_runner
does not offer enough granularity. We may need clears and scans at very
specific places in the code.

My goal is to have CI experiment with kmemleak on igt_runner, collect
developer feedback, and then decide if the next step of trying to figure
it out how to have a more granular test-level scan makes sense and if is
suitable for our use case.

Detecting transient leaks is an open problem.

> 
> On the runner level we would collect for all tests, on the test level
> I imagine command line argument passed to the test like -k which would
> enable the scan on the very beginning. But this would require to rewrite
> all tests which are using pure main() instead of igt_main.*().
> If really there's leak and someone wants to catch it, he can always
> simply wrap around script execution by kmemleak clear-scan/clear.

Currently we have -keach and -konce with -keach scanning between each test.
As I said, this produces results, but is not effective for transient leaks.

Again, I do not know. If the initial evaluation indicates that there may be
a benefit on extending the tests to trigger kmemleak scans, I see no reason
for not doing it.

> 
> Resume - I would stick to the runner level, I see no judgement to change
> igt_core / tests to use this library directly.

Resume: I prefer to wait until the initial feedback from CI and developers.
I will be happy to move this to runner later if we decide to not walk the
unknown road of trying to catch as many transient leaks as we can.

That being said, let me know how to move forward.

> 
>>
>>>
>>> Regarding your above question, I think runner should use as much as
>>> minimal from lib/ directory, because this is collection of functions
>>> designed for tests, not for runner. I think generic code like data
>>> structures (lists) are fine, but not the others. Imagine you'll
>>> incidentally try to use igt_require() - follow the call and see
>>> the pitfall there.
>>
>> Currently the list of lib includes in the runner directory is not
>> small (I used a dumb regex that looks for includes starting with igt_):
>>
>> #include "igt_aux.h"
>> #include "igt_core.h"
>> #include "igt_facts.h"
>> #include "igt_hook.h"
>> #include "igt_list.h"
>> #include "igt_taints.h"
>> #include "igt_vec.h"
>>
>>
>>>
>>> So if my understanding is correct and this code should be part of the
>>> runner only your test might be moved to the runner_tests.c.
>>
>> It is difficult to know at this stage. At least the amd test I mentioned
>> earlier will use it.
> 
> I've explored runner code even if it is not in the center of my
> interests. I've seen there're few places where it is tainted by igt
> library code which might call logging from igt domain like igt_warn()
> [igt_gettime()], or igt_assert() [igt_vec_elem()]. I've experimented
> how runner code behaves with those calls, igt_warn() just logs but
> output format differs than executor.c __logf__(). I think we should
> be consistent in this case. Asserting __igt_fail_assert() unfortunately
> uses igt logging but if we ever assert in the runner this will have
> fatal consequences. IGT library contains a lot of asserts, so functions
> from them should be meticulously reviewed before using in the runner.

I see, but there is a very important detail here. Enabling kmemleak scans
come at a very high runtime cost. The slowdown is expected to be 4X. So
my judgement tells me that if CI is enduring the pains of a 4X slowdown
to get kmemleaks results, the minimum we can do is tell them when
something goes wrong.

On top of that, those write errors are rare, very rare I would say. I can
use any other function you want. I am just not ok to fail silently in this
case.

A big thank you for your comments.

> 
> --
> Zbigniew
> 
>>
>>>
>>> BTW please rebase on top of current master, there were changes in
>>> the runner/settings.c file that affects this series.
>>
>> As soon as I am confident this will be merged, I surely will.
>>
>> Thanks!
>>
>>>
>>> --
>>> Zbigniew
>>>
>>>>
>>>>
>>>>>
>>>>>> +					return false;
>>>>>> +				}
>>>>>> +				continue;
>>>>>> +			} else {
>>>>>> +				/* Log unrecoverable error */
>>>>>> +				igt_warn("igt_write: unrecoverable write error");
>>>>>
>>>>> Same here.
>>>> We should not remove this one.
>>>>
>>>>>
>>>>>> +				return false;
>>>>>> +			}
>>>>>> +		} else if (written == 0) {
>>>>>> +			if (++retries > MAX_WRITE_RETRIES) {
>>>>>> +				igt_warn("igt_write: Exceeded retry limit\n");
>>>>>
>>>>> Same here.
>>>> We should not remove this one.
>>>>
>>>> [...]
>>>>
>>>>>> +	FILE *fp;
>>>>>
>>>>> Why FILE* and fopen? Could you just use open/read?
>>>>
>>>> Why not?
>>>>
>>>> ---
>>>> psennats at friendship7-home:~/UPSTREAM/igt-gpu-tools$ git grep fopen|wc -l
>>>> 97
>>>> ---
>>>>
>>>>
>>>> [...]
>>>>
>>>>>> +bool igt_kmemleak_init(const char *unit_test_kmemleak_file)
>>>>>> +{
>>>>>> +	FILE *fp;
>>>>>
>>>>> Same here, open/read.
>>>>
>>>> Same here, why not?
>>>> ---
>>>> psennats at friendship7-home:~/UPSTREAM/igt-gpu-tools$ git grep fopen|wc -l
>>>> 97
>>>> ---
>>>>
>>>> [...]
>>>>
>>>>>> +#define KMEMLEAKRESFILENAME "kmemleak.txt"
>>>>>
>>>>> imho better: IGT_KMEMLEAK_RESFILENAME
>>>>
>>>> It reads better indeed. Thanks.
>>>>
>>>> [...]
>>