[PATCH] accel/qaic: Add crashdump to Sahara

Bjorn Andersson bjorn.andersson at oss.qualcomm.com
Tue Oct 15 19:04:09 UTC 2024


On Tue, Oct 15, 2024 at 12:34:29PM -0600, Jeffrey Hugo wrote:
> On 10/14/2024 3:52 PM, Bjorn Andersson wrote:
> > On Wed, Sep 18, 2024 at 09:52:54AM -0600, Jeffrey Hugo wrote:
[..]
> > >   struct sahara_context {
> > >   	struct sahara_packet		*tx[SAHARA_NUM_TX_BUF];
> > >   	struct sahara_packet		*rx;
> > > -	struct work_struct		work;
> > > +	struct work_struct		fw_work;
> > > +	struct work_struct		dump_work;
> > >   	struct mhi_device		*mhi_dev;
> > >   	const char			**image_table;
> > >   	u32				table_size;
> > >   	u32				active_image_id;
> > >   	const struct firmware		*firmware;
> > > +	u64				dump_table_address;
> > > +	u64				dump_table_length;
> > > +	size_t				rx_size;
> > > +	size_t				rx_size_requested;
> > > +	void				*mem_dump;
> > > +	size_t				mem_dump_sz;
> > > +	struct sahara_dump_table_entry	*dump_image;
> > > +	u64				dump_image_offset;
> > > +	void				*mem_dump_freespace;
> > > +	u64				dump_images_left;
> > 
> > That's a lot of images; and it's just a natural number. How about
> > "unsigned int" instead to convey that the size "doesn't matter"?
> 
> Hmm, this is derived from a 64-bit value that comes from the device.  If we
> downgrade this to uint (32-bit) we'd need to do some overflow checking.
> Having a matched type and not worrying about the conversion makes me feel
> better.
> 
> You still prefer uint?
> 

I'm fine with this motivation.

> > 
> > > +	bool				is_mem_dump_mode;
> > >   };
> > >   static const char *aic100_image_table[] = {
> > > @@ -153,6 +236,8 @@ static void sahara_send_reset(struct sahara_context *context)
> > >   {
> > >   	int ret;
> > > +	context->is_mem_dump_mode = false;
> > > +
> > >   	context->tx[0]->cmd = cpu_to_le32(SAHARA_RESET_CMD);
> > >   	context->tx[0]->length = cpu_to_le32(SAHARA_RESET_LENGTH);
> > > @@ -186,7 +271,8 @@ static void sahara_hello(struct sahara_context *context)
> > >   	}
> > >   	if (le32_to_cpu(context->rx->hello.mode) != SAHARA_MODE_IMAGE_TX_PENDING &&
> > > -	    le32_to_cpu(context->rx->hello.mode) != SAHARA_MODE_IMAGE_TX_COMPLETE) {
> > > +	    le32_to_cpu(context->rx->hello.mode) != SAHARA_MODE_IMAGE_TX_COMPLETE &&
> > > +	    le32_to_cpu(context->rx->hello.mode) != SAHARA_MODE_MEMORY_DEBUG) {
> > >   		dev_err(&context->mhi_dev->dev, "Unsupported hello packet - mode %d\n",
> > >   			le32_to_cpu(context->rx->hello.mode));
> > >   		return;
> > > @@ -320,9 +406,70 @@ static void sahara_end_of_image(struct sahara_context *context)
> > >   		dev_dbg(&context->mhi_dev->dev, "Unable to send done response %d\n", ret);
> > >   }
> > > +static void sahara_memory_debug64(struct sahara_context *context)
> > > +{
> > > +	int ret;
> > > +
> > > +	dev_dbg(&context->mhi_dev->dev,
> > > +		"MEMORY DEBUG64 cmd received. length:%d table_address:%#llx table_length:%#llx\n",
> > > +		le32_to_cpu(context->rx->length),
> > > +		le64_to_cpu(context->rx->memory_debug64.table_address),
> > > +		le64_to_cpu(context->rx->memory_debug64.table_length));
> > > +
> > > +	if (le32_to_cpu(context->rx->length) != SAHARA_MEM_DEBUG64_LENGTH) {
> > > +		dev_err(&context->mhi_dev->dev, "Malformed memory debug64 packet - length %d\n",
> > > +			le32_to_cpu(context->rx->length));
> > 
> > Any particular reason why you choose not to attempt a reset in these two
> > error cases?
> 
> Undefined behavior.  The spec doesn't address this condition - I guess the
> assumption is that the device provides valid values.
> 
> The spec places limits on reset -
> "The host sends a reset packet to reset the target. The target services a
> reset request only if it is in a state where reset requests are valid.
> 
> If the target receives an invalid reset request, the target sends an error
> in an end of image transfer packet"
> 
> From there the spec doesn't clarify how to proceed.
> 
> So, it seems possible that we send a reset here, the device rejects it, we
> get an error back, and cannot really proceed, which feels like a "back to
> square 1" situation.  Less complex to not send a reset and end up in the
> same place.
> 

No concerns with this.

> > 
> > > +		return;
> > > +	}
> > > +
> > > +	context->dump_table_address = le64_to_cpu(context->rx->memory_debug64.table_address);
> > > +	context->dump_table_length = le64_to_cpu(context->rx->memory_debug64.table_length);
> > > +
[..]
> > >   static void sahara_processing(struct work_struct *work)
> > >   {
> > > -	struct sahara_context *context = container_of(work, struct sahara_context, work);
> > > +	struct sahara_context *context = container_of(work, struct sahara_context, fw_work);
> > >   	int ret;
> > >   	switch (le32_to_cpu(context->rx->cmd)) {
> > > @@ -338,6 +485,12 @@ static void sahara_processing(struct work_struct *work)
> > >   	case SAHARA_DONE_RESP_CMD:
> > >   		/* Intentional do nothing as we don't need to exit an app */
> > >   		break;
> > > +	case SAHARA_RESET_RESP_CMD:
> > > +		/* Intentional do nothing as we don't need to exit an app */
> > 
> > For this patch I don't have any concern, but should we do something to
> > track that we should not handle any further requests?
> 
> The spec says that the device will (warm) reset after sending this.  It is
> actually a bit of a race condition as the AIC100 implementation will send
> this packet, and then force a watchdog bite to reset.  If that watchdog
> processing is very quick, we won't even see this.
> 

I see the same race on the MSM side...

> Another spec ambigiuity on what happens if the device doesn't reset after
> this command.  Tracking this feels like complexity for little gain.
> 

Fair enough.

> > 
> > > +		break;
> > > +	case SAHARA_MEM_DEBUG64_CMD:
> > > +		sahara_memory_debug64(context);
> > > +		break;
> > >   	default:
> > >   		dev_err(&context->mhi_dev->dev, "Unknown command %d\n",
> > >   			le32_to_cpu(context->rx->cmd));
> > > @@ -350,6 +503,223 @@ static void sahara_processing(struct work_struct *work)
> > >   		dev_err(&context->mhi_dev->dev, "Unable to requeue rx buf %d\n", ret);
> > >   }
> > > +static void sahara_parse_dump_table(struct sahara_context *context)
> > > +{
[..]
> > > +	image_out_table_entry = (struct sahara_dump_table_entry *)(context->mem_dump +
> > > +								sizeof(*dump_meta));
> > 
> > Isn't this cast unnecessary?
> 
> I thought there was a compiler warning, but I'm not seeing it.  Will fix.
> 

mem_dump is void *, so you should be fine an implicit cast.

> > 
> > > +	dev_table_entry = (struct sahara_debug_table_entry64 *)(context->rx);
> > > +	for (i = 0; i < table_nents; ++i, ++image_out_table_entry, ++dev_table_entry) {
> > > +		image_out_table_entry->type = le64_to_cpu(dev_table_entry->type);
> > > +		image_out_table_entry->address = le64_to_cpu(dev_table_entry->address);
> > > +		image_out_table_entry->length = le64_to_cpu(dev_table_entry->length);
> > > +		strscpy(image_out_table_entry->description, dev_table_entry->description,
> > > +			SAHARA_TABLE_ENTRY_STR_LEN);
> > > +		strscpy(image_out_table_entry->filename,
> > > +			dev_table_entry->filename,
> > > +			SAHARA_TABLE_ENTRY_STR_LEN);
> > > +	}
> > > +
> > > +	context->mem_dump_freespace = image_out_table_entry;
> > > +
> > > +	/* Done parsing the table, switch to image dump mode */
> > > +	context->dump_table_length = 0;
> > > +
> > > +	/* Request the first chunk of the first image */
> > > +	context->dump_image = (struct sahara_dump_table_entry *)(context->mem_dump +
> > > +								sizeof(*dump_meta));
> > 
> > I would have preferred to see this (and above) written such that it's
> > explicitly clear that you're filling out an array of entries and then
> > point this to the first entry in that array.
> 
> I'm not sure I understand what you would like to see here.  Can you perhaps
> give an example?
> 

Per your devcoredump definition at the top, image_out_table_entry is an
array of struct sahara_dump_table_entry, which you fill out by sliding a
pointer starting at mem_dump + sizeof(*dump_meta).

You then have context->dump_image to be a pointer to each element in
this array, except that it's not expressed as an array...

But it took me a minute to understand that this was what the code is
doing.

If you instead wrote it as:

  for (i = 0..table_nents) {
  	image_out_table[i].foo = bar;
	...;
  }

  context->dump_image = &image_out_table[0];

(Or perhaps even make dump_image an index into image_out_table)

It would have been obvious to me when I looked at the code that you're
setting up an array and then looping over each entry in the array.


So, I don't see anything wrong with the logic, but it would have been
easier for me if the code manifested this array, as an array...

Perhaps I'm missing some detail which complicates things, as far as I
can tell the logic presented is reasonable.

Regards,
Bjorn


More information about the dri-devel mailing list