[systemd-devel] [PATCH v3 2/3] cxgb4: use module_long_probe_init()

Sat Aug 16 22:02:41 PDT 2014

On Fri, Aug 15, 2014 at 11:15:00AM -0700, Casey Leedom wrote:
>
> On 08/14/2014 12:53 PM, Luis R. Rodriguez wrote:
>> On Thu, Aug 14, 2014 at 09:42:49AM -0700, Casey Leedom wrote:
>>> On 08/13/2014 04:33 PM, Anish Bhatt wrote:
>>>> Adding Casey who's actually incharge of this code and missing from the CC list
>>>    Thanks Anish!
>>>
>>>    As I mentioned to Anish, there are fundamentally two problems here in the
>>> time being consumed by the cxgb4 PCI probe() function:
>>>
>>>   1. When various firmware files aren't present, request_firmware()
>>>      can take a very long time.  This is easily solved by using
>>>      request_firmware_direct() and I certainly have no objection to that.
>> I sent a patch for this a while ago, since there is no objection if
>> you'd like to apply the patch:
>>
>> http://patchwork.ozlabs.org/patch/363756/
>
>   I think I'll leave that to Hariprasad since he's been doing all the 
> direct kernel.org work recently.  Also, I honestly don't understand the 
> differences between request_firmware() and request_firmware_direct().  It 
> sounds like request_firmware() will invoke some user-level process if it 
> can't find the specified firmware.  What that user-level process will do is 
> unknown and of unknown value (and undocumented as far as I can tell).

The userspace helper ended up being a bad idea so its being phased out
and will eventually be removed. udev used to be the default helper
and it would write firmware using the sysfs interface, and it would
have a timeout of 60 seconds. Other drivers could provide their own
userspace helper and last I checked we only had one Dell device driver
left which still required this kernel feature. The goal is to phase
that out and simply nuke it, in favor for the direct kernel firmware
loading, the kernel essentially opens the file and reads its contents
directly form the filesystem, this removes the silly timeout. udev
is part of systemd and a patch in May tried to remove the udev firmware
loader [0], it however was decided it won't be until the dell rbu
driver gets fixed to not require kernel side support for the userspace
firmware helper [1]. I would not be surprised if this happens soon and
you should expect Linux distributions avoiding enabling the kernel
CONFIG_FW_LOADER_USER_HELPER, as it essentially has an incurred
built-in timeout of 60 seconds every single time request_firmware()
is used. request_firmware_direct() enables device drivers that know
that the firmwares being requested are optional to skip the usermode
helper and therefore skips the 60 second timeout all together.
Once the usermode helper is removed request_firmware() will behave
the same as request_firmware_direct() with the only difference
being a print of failure if the firmware was expected to be present.
Right now using request_firmware_direct() will help speed things up
for drivers that have optional firmware as the usermode helper is
still present, that's all.

[0] http://lists.freedesktop.org/archives/systemd-devel/2014-May/019587.html
[1] http://lists.freedesktop.org/archives/systemd-devel/2014-May/019631.html

>> Apart from that you also want to use asynch firmware loading but
>> to use that properly (I also had sent some basic initial patches
>> for asynch firmware loading but without moving out other logic
>> yet) you want to also let driver initalization complete
>> asynchronously later.
>
>   So it looked like the module_init() function is what the various patches 
> have moved into an asynchronous thread, right? 

Its a temporary hack to fix drivers which are taking over 30 seconds on
init, but it also helps us annotate drivers that are not adhering to the
new 30 second requirement imposed on us by systemd and now supported
as reasonable kernel policy. I would in no way consider it asynchronous
functionality, you should call it a hack and annotation of broken
drivers with long probes that need work.

> That should address the 
> worry that a failing PCI probe() wouldn't be able to return the failure 
> status to the PCI infrastructure (and ensure that the PCI remove() function 
> wouldn't be erroneously called for a device which failed probe()).

Right, that should not happen with this hack, the only difference
here is that loading the driver will return immediately and return
a success, while behidn the scenes it goes on running the driver's
own old probe. For this reason the module loading will return
sucessfully immediately unless there is no memory for the driver,
for example. This is acceptable given that driver's init sequence
should initialize the device driver, probing for devices should be
handled asynchronously behind the scenes. Proper asynchronous probing
will vary depending on the subsystem and architecture of the driver,
using asynchronous firmware loading is just one of the strategies
drivers should be using. Your driver obviously needs much more work.

>>>   2. When there are multiple adapters present in a system which
>>>      need firmware downloaded, each one individually may not take
>>>      a ton of time but together they can exceed simple Module Load
>>>      Timeouts.  There's not a simple answer here.
>> I had originally recommended to write your own platform driver for
>> this and have each port probe but Greg has provided the last advice
>> for this on the patch I sent to add deferred probe support from
>> init, his recommendation was for you to write your own bus code for
>> the driver:
>>
>> http://www.spinics.net/lists/linux-scsi/msg76695.html
>
>   I'm unfamiliar with the term "platform driver"

drivers/base/platform.c
include/linux/platform_device.h

> but the "probe()" is of 
> the adapter, not ports.  Our driver attaches to a single PCI Physical 
> Function and manages all adapter resources (including instantiating 
> multiple Network Devices) from that single PF.  This allows the driver to 
> manage adapter resources as a allocation pools, etc.

The idea is to have each network device be initialized through an internal
bus logic.

>   The probe() function verifies that the PCI Device is what we want to 
> manage and performs all adapter initialization.  This has been what "Device 
> Probe" routine have been doing for more than forty years (predating UNIX).  
> Some OSes have separately abstracted the concept of Device Initialization 
> but UNIX/Linux have mostly stayed with the single Device Probe function ... 
>  So I'm not sure where people are asking us to move the Device 
> Initialization code to ...

There's a difference between driver initialization and device driver probing.
On Linux when device drivers have autoprobe enabled (most) the driver's own
probe will be run if a device is found to have a match right after the driver
is initialization is completed, this in turn means that *both* init and probe
will be called upon driver initialization right now.

You therefore are right to yell bloody murder as cxgb4_init_module() doesn't
do anything other than register the PCI driver with pci_driver_register().
Its your driver's PCI probe which takes eons.

pci_driver_register() --> bus_add_driver() --> driver_attach()

Now, one could argue and IMHO it would be reasonable that the driver's
probing should be dealt with separately, it at least seems that's what
we want, given that its not driver's init which is long, its the driver's
own probe which we wish to start behaving asynchronously. We could however
generalize offloading probe on a kthread as well, but its not clear to me
if this is desirable, Greg? I tried this on my system and it did seem to
help with boot time, systemd-analyze yield this before and after:

Startup finished in 4.271s (kernel) + 2.547s (initrd) + 30.868s (userspace) = 37.686s
Startup finished in 3.483s (kernel) + 2.455s (initrd) + 31.580s (userspace) = 37.519s

A patch for this is supplied at the bottom in case folks would like to try.
I did run into an issue with my built in keyboard not working though but
didn't find why, a USB keyboard however did the trick.

>   For us, this Device Probe adapter initialization includes downloading the 
> main chip firmware (if needed when the adapter contains old unsupported 
> firmware) and possibly external PHY firmware as well for 10Gb/s "BT" cards 
> ...  Even without the request_firmware() user-level process issues, simply 
> FLASH'ing both firmware images on the BT cards can take quite a while.

Yup, yeah this is what Greg recommended could be dealt with through a device
specific bus.

>>>    Part of the problem here is that it's a Module Load Timeout instead of a
>>> per-device Probe Timeout.
>> Seems like you can fix this with a bus driver.
>
>   I apologize, again I'm unfamiliar with this term with respect to Linux.  
> (Not surprising since I'm still a neophyte with regard to Linux.)

drivers/ssb/ is an example but perhaps Greg can provide a better more
suitable example ?

>>
>>>    Part of the problem is that the current
>>> architecture has Device Probe happening out of the Module Initialization
>>> when we call pci_register_driver() with our PCI Device ID Table.
>>>
>>>    Running the Device Probes asynchronously has been discussed but that has
>>> the problem that it's then impossible to return the Device Probe Status.
>>> This is a problem for Driver Fallback and, if the probe fails, we're not
>>> supposed to call the Device Remove function. To make this work, the
>>> synchronous/asynchronous boundary would really need to be up in the PCI
>>> Infrastructure layer so the Device Probe status could be captured in the
>>> normal logic.  This would be a moderately large change there ...
>> Some maintainers consider most of the work to get what you need done
>> simple, I've tried to explain it ain't so, so glad you provided a bit
>> of details here. To be clear its not just about asynch firmware loading,
>> you need a bit more work. Can you evaluate using a bus driver?
>
>   Hhmmm, a second reference to this term.  Obviously I need a pointer here 
> and an understanding of how a Bus Driver differs from "Normal Drivers" ...

Sure, start looking at ssb and I can see if there are other examples.

>>>    Deferring the Device Initialization till the first "ifup" has also been
>>> discussed and is certainly possible, though a moderately large
>>> architectural change to every driver which needs it.  It also has the
>>> unfortunate effect of introducing random large delays directly on user
>>> commands.  From a User Experience perspective I would tend to want such
>>> large delays in the Device Probe
>> You should just use asynch firmware loading there and only once your
>> driver is done loading firmware start exposing the device(s) as you
>> see fit with your bus driver.
>
>   The problem here is that our Device Initialization depends intimately on 
> the main firmware.  Our T3 chip also had firmware but with T4/T5 (and 
> beyond) we've moved a lot more functionality into the main firmware — 
> including Chip initialization — which has allowed us to simplify the Host 
> Drivers and also hid most of the chip version differences from the Host 
> Driver.

OK. Regardless you should use asynch firmware loading.

>>> .  But that's something that really
>>> deserves a real User Interaction study rather than throwing a dart.
>>>
>>>    On the whole, I think that introducing these Module Load Timeouts hasn't
>>> been well thought out with respect to the repercussions and I'd be more
>>> inclined to back that out till a well thought out design is developed.  But
>>> I'm here for the discussion.
>> The way that the 30 second timeout was introduced as a new driver
>> initialization requirement was certainly not ideal specially since
>> the respective systemd patch that intended to trigger the SIGKILL on
>> kmod module loading only took effect once kernel commit 786235ee
>> went in about a year later, and since the original systemd commit
>> was only addressing asynchronous firmware loading as a possible
>> issue that drivers may need to fix. The cxgb4 driver is a good
>> example that needs quite a bit of more work. Regardless systemd
>> folks are right -- but again, having this be introduced as a new
>> requirement that otherwise simply kills drivers seems a bit too
>> aggressive specially if its killing boot on some systems due to
>> delays on storage drivers. What's done is done -- and we need to
>> move on. We already reviewed twice now reverting 786235ee and that
>> won't happen, as a compromise we're looking for an easy agreeable
>> general driver work around that would both circumvent the issue
>> and let us easily grep for broken drivers. The deferred probe trick
>> was the first approach and this series addresses the more agreeable
>> solution. This 2 line patch then is what we are looking as work
>> around until your driver gets properly fixed.
>>
>> Apart from these kernel changes there are systemd changes we've
>> looked at modifying, Hannes' patch 9719859c07a, now merged upstream on
>> systemd lets you override the timeout value through the kernel command
>> line. This will only help for all systems if you use a high enough
>> large timeout value, or on a case by case basis for each system.
>> I recently proposed replacing a kill for a warn only for udev
>> kmod built in commands, that's unacceptable for systemd's architecture
>> though so the last thing I proposed instead to use *for now* is a
>> multiplier for each different type of udev built-in command and
>> for kmod we'd use a high enough value, the timeout therefore would
>> be really large for module loading for now, but we'd still want to
>> collect logs of drivers taking long to probe. That's still being
>> discussed [0] but my hope is that with this series and that other
>> systemd discussion we'll have covered both areas affected and have
>> a good strategy to move forward with this new driver requirement.
>>
>> [0] http://comments.gmane.org/gmane.comp.sysutils.systemd.devel/21689
>
>   Okay, obviously a lot of history here.  I think I'm stuck with the basic 
> desire to return a Device Probe failure if the adapter fails initialization 
> but also being told that we can't take "too long" performing the Device 
> Initialization portion of the Device Probe ...  Where should Device 
> Initialization be performed if not in the Device Probe routine?

Yeah its a good point, its important we highlight that all along its
not driver's init sequence that is taking long, its the probe that
is taking long and its because we trigger that right after we
initialize the driver, immediately if autoprobe is enabled on the
bus. Regardless though things like asynch firmware loading is
something folks should strive for -- that no one is going to argue
with, its the other aspects of probing that you also need to work
on such as a bus driver but -- now that we've reached this
clarification I do wonder if it makes sense to *not* have drivers
probe calls spawned immediately and instead have them run through
kthreads. Example patch on how to do that below.

  Luis

---
 drivers/base/dd.c      | 20 +++++++++++++++++++-
 include/linux/device.h |  1 +
 2 files changed, 20 insertions(+), 1 deletion(-)

diff --git a/drivers/base/dd.c b/drivers/base/dd.c
index e4ffbcf..b0f54db 100644
--- a/drivers/base/dd.c
+++ b/drivers/base/dd.c
@@ -482,6 +482,12 @@ static int __driver_attach(struct device *dev, void *data)
 	return 0;
 }
 
+int driver_attach_all(void *arg)
+{
+	struct device_driver *drv = (struct device_driver *) arg;
+	return bus_for_each_dev(drv->bus, NULL, drv, __driver_attach);
+}
+
 /**
  * driver_attach - try to bind driver to devices.
  * @drv: driver.
@@ -493,7 +499,13 @@ static int __driver_attach(struct device *dev, void *data)
  */
 int driver_attach(struct device_driver *drv)
 {
-	return bus_for_each_dev(drv->bus, NULL, drv, __driver_attach);
+	drv->attach_thread = kthread_create(driver_attach_all, drv, "%s_%s",
+					    "driver_attach", drv->name);
+	if (IS_ERR(drv->attach_thread))
+		return PTR_ERR(drv->attach_thread);
+	get_task_struct(drv->attach_thread);
+	wake_up_process(drv->attach_thread);
+	return 0;
 }
 EXPORT_SYMBOL_GPL(driver_attach);
 
@@ -562,6 +574,11 @@ void driver_detach(struct device_driver *drv)
 {
 	struct device_private *dev_prv;
 	struct device *dev;
+	int err;
+
+	err = kthread_stop(drv->attach_thread);
+	if (err)
+		return;
 
 	for (;;) {
 		spin_lock(&drv->p->klist_devices.k_lock);
@@ -586,4 +603,5 @@ void driver_detach(struct device_driver *drv)
 			device_unlock(dev->parent);
 		put_device(dev);
 	}
+	put_task_struct(drv->attach_thread);
 }
diff --git a/include/linux/device.h b/include/linux/device.h
index dc7c0ba7..fa2bced 100644
--- a/include/linux/device.h
+++ b/include/linux/device.h
@@ -235,6 +235,7 @@ struct device_driver {
 
 	bool suppress_bind_attrs;	/* disables bind/unbind via sysfs */
 
+	struct task_struct *attach_thread;
 	const struct of_device_id	*of_match_table;
 	const struct acpi_device_id	*acpi_match_table;
 
-- 
2.0.3