[PATCH i-g-t v4 1/2] lib/igt_kmemleak: library to interact with kmemleak

Wed Feb 12 13:12:44 UTC 2025

On 12.02.2025 11:01, Zbigniew Kempczyński wrote:
> On Wed, Feb 12, 2025 at 10:18:17AM +0100, Peter Senna Tschudin wrote:
>>
>>
>> On 12.02.2025 09:36, Zbigniew Kempczyński wrote:
>>> On Tue, Feb 11, 2025 at 08:12:16PM +0100, Peter Senna Tschudin wrote:
>>>> Hi Zbigniew,
>>>>
>>>> Thank you for your comments, please see my answers below.
>>>>
>>>> On 11.02.2025 17:17, Zbigniew Kempczyński wrote:
>>>>> On Mon, Feb 03, 2025 at 10:23:00AM +0100, Peter Senna Tschudin wrote:
>>>>>> Hi Kamil,
>>>>>>
>>>>>> Thank you for your review. Please see below.
>>>>>>
>>>>>> On 31.01.2025 13:34, Kamil Konieczny wrote:
>>>>>>
>>>>>> [...]
>>>>>>
>>>>>>>> +#include <stdio.h>
>>>>>>>> +#include <string.h>
>>>>>>>> +#include <unistd.h>
>>>>>>>> +#include <fcntl.h>
>>>>>>>> +#include <errno.h>
>>>>>>>
>>>>>>> Sort this alphabetically with exception of unistd.h - this
>>>>>>> header could be first.
>>>>>>
>>>>>> That is picky, but ok.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>> +
>>>>>>>> +#include "igt_core.h"
>>>>>>>> +#include "igt_kmemleak.h"
>>>>>>>> +
>>>>>>>> +/* We can change the path for unit testing, see igt_kmemleak_init() */
>>>>>>>> +static char igt_kmemleak_file[256] = "/sys/kernel/debug/kmemleak";
>>>>>>>> +
>>>>>>>> +#define MAX_WRITE_RETRIES 5
>>>>>>>> +/**
>>>>>>>> + * igt_kmemleak_write - Writes the buffer to the file descriptor retrying when
>>>>>>>> + * possible.
>>>>>>>> + * @fd: The file descriptor to write to.
>>>>>>>> + * @buf: Pointer to the data to write.
>>>>>>>> + * @count: Total number of bytes to write.
>>>>>>>> + *
>>>>>>>> + * Returns the total number of bytes written on success, or -1 on failure.
>>>>>>>> + */
>>>>>>>> +static bool igt_kmemleak_write(int fd, const void *buf, size_t count)
>>>>>>>> +{
>>>>>>>> +	const char *ptr = buf;
>>>>>>>> +	size_t remaining = count;
>>>>>>>> +	ssize_t written;
>>>>>>>> +	int retries = 0;
>>>>>>>> +
>>>>>>>> +	while (remaining > 0) {
>>>>>>>> +		written = write(fd, ptr, remaining);
>>>>>>>> +		if (written > 0) {
>>>>>>>> +			ptr += written;
>>>>>>>> +			remaining -= written;
>>>>>>>> +		} else if (written == -1) {
>>>>>>>> +			if (errno == EINTR || errno == EAGAIN) {
>>>>>>>> +				/* Retry for recoverable errors */
>>>>>>>> +				if (++retries > MAX_WRITE_RETRIES) {
>>>>>>>> +					igt_warn("igt_write: Exceeded retry limit\n");
>>>>>>>
>>>>>>> Do not use test prints in lib, especially those used by runner.
>>>>>>
>>>>>> Oh, I will. kmemleak is about writing to disk. If that fails, I will tell the user.
>>>>>> Why does this restriction applies only to me? See:
>>>>>>
>>>>>> ---
>>>>>> psennats at friendship7-home:~/UPSTREAM/igt-gpu-tools/lib$ git grep igt_warn|wc -l
>>>>>> 164
>>>>>> psennats at friendship7-home:~/UPSTREAM/igt-gpu-tools/lib$ cd ..
>>>>>> psennats at friendship7-home:~/UPSTREAM/igt-gpu-tools$ git grep igt_warn|wc -l
>>>>>> 324
>>>>>> ---
>>>>>>
>>>>>> About 50% of all references to igt_warn are in lib, what is the problem with my code?
>>>>>
>>>>> I'm looking at the code and I have impression this should be part of
>>>>> the runner, not igt library. Do I understand correctly that your
>>>>> code will be used by the runner only and not by the tests?
>>>>
>>>> It is difficult to know for now. We have one test that will use the lib
>>>> as is: `tests/amdgpu/amd_mem_leak.c`. But that is not the main potential
>>>> use case.
>>>>
>>>> As I describe in the cover letter, transient leaks are kind of tricky,
>>>> and we may require to request a scan from a test itself for better
>>>> coverage. So in short I honestly do not know if we will have test
>>>> integration in the future or not.
>>>
>>> Fundamental question is - where do you want to scan/clear - on the runner
>>> level or on the test? What will happen if you start to clear-scan/clear
>>> both from runner (-k) and from the test itself. I've intentionally
>>> added clear first to limit catched leaks to period where test was
>>> executed.
>>
>> I do not know know. There may be uses for both. Currently I went for the
>> runner for two reasons: it is simple, and produced good results.
>>
>> However for transient leaks the key seems to be timing and then igt_runner
>> does not offer enough granularity. We may need clears and scans at very
>> specific places in the code.
>>
>> My goal is to have CI experiment with kmemleak on igt_runner, collect
>> developer feedback, and then decide if the next step of trying to figure
>> it out how to have a more granular test-level scan makes sense and if is
>> suitable for our use case.
>>
>> Detecting transient leaks is an open problem.
>>
>>>
>>> On the runner level we would collect for all tests, on the test level
>>> I imagine command line argument passed to the test like -k which would
>>> enable the scan on the very beginning. But this would require to rewrite
>>> all tests which are using pure main() instead of igt_main.*().
>>> If really there's leak and someone wants to catch it, he can always
>>> simply wrap around script execution by kmemleak clear-scan/clear.
>>
>> Currently we have -keach and -konce with -keach scanning between each test.
>> As I said, this produces results, but is not effective for transient leaks.
>>
>> Again, I do not know. If the initial evaluation indicates that there may be
>> a benefit on extending the tests to trigger kmemleak scans, I see no reason
>> for not doing it.
>>
>>
>>>
>>> Resume - I would stick to the runner level, I see no judgement to change
>>> igt_core / tests to use this library directly.
>>
>> Resume: I prefer to wait until the initial feedback from CI and developers.
>> I will be happy to move this to runner later if we decide to not walk the
>> unknown road of trying to catch as many transient leaks as we can.
>>
>> That being said, let me know how to move forward.
> 
> Question is to CI folks, for me this scan should be executed as
> dedicated CI run with kmemleak on. And I repeat - I don't see reason
> we should add this support to the tests, as few line script which
> wraps around test by kmemleak clear/scan is enough. We already have
> scripts directory so it is good place for such wrapper.

So I am moving the code to the runner directory, and using printf() instead
of igt_warn(). Any other request for the series?

[...]