[RFC] TTY port locked problems when using PPP
mstarr at hedonline.com
Wed Jun 19 14:27:01 UTC 2019
> Date: Tue, 18 Jun 2019 14:43:10 +0200
> From: Aleksander Morgado <aleksander at aleksander.es>
> In the past months we've seen several reports of issues with the TTY
> port "forced closed" and not being able to be reopened after that and
> things like that. The root cause from this problem is that MM tries to
> reuse the TTY port after MM itself has detected a disconnection (via
> +CGEV URCs or +CGACT? polling). While trying to reuse the TTY, the
> port is flagged as disconnected and reused, and so the TTY reading
> method may end up getting a HUP because modem control lines are still
> in effect (CLOCAL settings managed by pppd).
> We have been discussing this on IRC and I think there are several ways
> to try to solve this issue.
> #1) The first way is an attempt to keep on using +CGEV and +CGACT?
> polling and then do our best to reuse the port safely after we detect
> a disconnection. This would mean that we try to reopen the port and
> apply our settings even if pppd is still using the port. Also, we
> could just ignore HUPs happening in the port in case we're using the
> port with unexpected CLOCAL settings, and we would trigger a full port
> reopen ourselves to set the settings we expect. The problem I see with
> this solution is that we truly don't know what pppd is doing at the
> same time; it may already be stopping after NM stop sees the bearer
> object disconnected in MM, but we cannot be totally sure. Maybe the
> procedure could be extended to check if other processes are using the
> port before trying to reuse it ourselves (kind of lsof), but that
> logic is not in place yet. The changes suggested for now are here:
> #2) The second way is to skip using +CGEV URCs or +CGACT polling
> completely if the data port is a TTY. This would recover the logic
> that was in place before MM 1.8.0 (+CGACT polling was added in 1.8.0
> and +CGEV URCs in 1.10.0). Without these two things, the bearer
> disconnections would exclusively happen after NM (or whatever other
> client) runs Disconnect() on the bearer object, and that would only
> happen once the port is available for MM (i.e. pppd already stopped).
> In this way, all MM/NM/pppd would be in sync again and MM would not
> get any HUP in the TTY (as CLOCAL settings would be the expected
> ones). But, we lose the ability to report the disconnection when the
> context/bearer disconnection is detected with AT commands, and we
> would rely only on pppd detecting the disconnection, as it was before
> 1.8.0. The changes suggested are here:
> #3) A third way, for which there is no suggested MR yet would be to
> keep +CGEV URCs and +CGACT polling, *but* instead of blindly reopening
> the port to reuse it right away, only emit e.g. a DBus signal or
> property update reporting that the bearer object "needs
> disconnection", and so we would leave the task of running Disconnect()
> to the upper layers (once the port is no longer used by pppd). MM
> would not automatically try to reuse the port until that Disconnect()
> is run by the client. This has the benefit that we keep all features
> in place and we're sure we'll be in sync as well, but it would require
> additional changes in NM. The new 'needs disconnection' signal could
> be treated as completely optional, if other connection managers don't
> want to use it, they can keep on relying on pppd-detected
> disconnections exclusively without any additional change.
> My opinion is a bit open on this, I liked the first way because we
> would try to be as robust as possible and totally avoid using the port
> with wrong CLOCAL settings, but I don't think we can have that
> robustness unless we're truly in sync with NM and pppd, and for that
> we need either #2 or #3. And if we target to be in sync, then I don't
> think #1 is needed, we should always try to enforce being in sync
> instead. I'd vote to have #2 in MM 1.8 and MM 1.10 and #3 for the next
> MM 1.12?
> Comments welcome!
Have you seen the following commit from NetworkManager for pppd? It has helped solve many of the issues of being in sync with pppd and MM that I have seen:
Even with that patch, there was still an occasional instance when the forced closed issue would occur, but it was much less often.
I have been adding work arounds of my own to get around this issue since MM 1.8.0. I think what you laid out using #2 for the existing versions of MM (1.8.X and 1.10.X) works for those versions since it doesn't make breaking changes.
For the next version of MM, #3 will work, but I would suggest falling back to #2 if the version of NetworkManager is before the version that added the usage of the Disconnect(). Without a check on the version or checking if the Disconnect() support is implemented in NetworkManager, then MM might make things worse in a configuration with a new MM and older NM. In this case the serial port would never recover from a data session since there is nothing to call Disconnect().
If you could implement #3 with detection of if Disconnect() is implemented, and if not fall back to #2. Then you could implement this in all currently supported versions of MM (1.8.X and 1.10.X). This would support situations where an older MM is present with a newer NM.
More information about the ModemManager-devel