Unreliable DSL (PPP) connection

I experienced this issue also during the first DSL test about half year ago and @roburka was also unable to set up DSL connection on MOX with 4.0 beta... So I think this is not a unique situation and won't mark this Unconfirmed

Ugly hack to make the router connection stable is to include this code in /etc/rc.local:

(
	sleep 60
	/etc/init.d/network restart
) &

UPDATE: ok, unplugging and plugging back the ethernet cable between modem and MOX seems to establish the PPP daemon as well.

Have tested it with HBL as well? Does it exhibit same behavior?

Not yet. I have discover it before we discuss TOS 5.0 release

@vmyslivec Can I ask you to try it on Turris OS 5.0 release? You can find it in the HBL branch for now. It was reported to me that one of our users, who is using Turris 1.1 on it has a similar (or even same) issue.

I have this in To Do. But I need to go physically to a remote site as I don't want to do the upgrade remotely.

switch-branch hbk and upgrade to %Turris OS 5.0 does not help. What should we do next?

UPDATE: ok, I am still in %Turris OS 4.0.3. switch-branch did not finished upgrade well
stay tuned

%Turris OS 5.0 installed and nothing changed

changed the description

From the logs:

Terminating on signal 15

Is apparently a SIGTERM for the PPPD process which may or may not be related to ^**[1]**

There is a new option, child-timeout, which sets the length of time that pppd will wait for child processes (such as the command specified with the pty option) to exit before exiting itself. It defaults to 5 seconds. After the timeout, pppd will send a SIGTERM to any remaining child processes and exit. A value of 0 means no timeout.

To get a more verbose debug output for PPP you could enable in /etc/ppp/options the debug flag (and if convenient set a logfile path with logfile)

^**[1]** https://github.com/paulusmack/ppp/

Thanks for the tips.

These lines repeates in the log after reboot:

Plugin rp-pppoe.so loaded.
RP-PPPoE plugin version 3.8p compiled against pppd 2.4.7
Send PPPOE Discovery V1T1 PADI session 0x0 length 4
 dst ff:ff:ff:ff:ff:ff  src d8:58:d7:00:b4:be
 [service-name]

And when I restart the netork, it goes:

Plugin rp-pppoe.so loaded.
RP-PPPoE plugin version 3.8p compiled against pppd 2.4.7
Send PPPOE Discovery V1T1 PADI session 0x0 length 4
 dst ff:ff:ff:ff:ff:ff  src d8:58:d7:00:b4:be
 [service-name]
Send PPPOE Discovery V1T1 PADI session 0x0 length 4
 dst ff:ff:ff:ff:ff:ff  src d8:58:d7:00:b4:be
 [service-name]
Recv PPPOE Discovery V1T1 PADO session 0x0 length 43
 dst d8:58:d7:00:b4:be  src 40:7c:7d:ea:db:03
 [service-name] [AC-name PA77B01PRAHVL02] [AC-cookie  c0 ff ee ba be ..]
Send PPPOE Discovery V1T1 PADR session 0x0 length 24
 dst 40:7c:7d:ea:db:03  src d8:58:d7:00:b4:be
 [service-name] [AC-cookie  c0 ff ee ba be ..]
Recv PPPOE Discovery V1T1 PADS session 0x1 length 4
 dst d8:58:d7:00:b4:be  src 40:7c:7d:ea:db:03
 [service-name]
PADS: Service-Name: ''
PPP session is 1
Connected to 40:7c:7d:ea:db:03 via interface eth0
using channel 1
Using interface pppoe-wan
Connect: pppoe-wan <--> eth0
...

so it seems PPP does not receive a reply from DSLAM/BRAS (?) after reboot for unknown reason

Is it possible that modem thinks that link is down or something?

Can you just try to do up and down link in Omnia instead of network restart? You know to nor restart service.

The ISP's PPPoE server may fail to respond if it does not receive the discovery packet in the first place, which could be do to the physical link state between modem and router being down at that point of time. PPPoE discovery is unaware of the physical link state, not sure how often it re-tries before giving up.

The log still showing

pppd: Terminating on signal 15

?

@kkoci if I unplug and plug in again the physical cable, it starts working normally. I ecxpect ip link down/ip link up would be the same. I would try it next time.

@n8v8R it's in infinite loop. Link diodes keep flashing on and off. I am out of ideas.

The log still showing

Not sure what I should do about it...

Anyway, I am waiting for HBK build become ready so I could test TOS 5.0.

I reckon that OpenWrt (netifd) suffers some issue with the link state when connected to external (incl. SFP) modems/module which are not likely to go away with 19.07 (5.x) but might improve with the code in the development branch.

The TOS forum has various reports about link state issue with external modems (since 18.06.x) and PPPoE seems to be suffering the most, perhaps because it unaware of the link state.

Log excerpt from my node (HBD) upon boot

kernel: [ 54.649804] mvneta f1034000.ethernet eth2: Link is Down
kernel: [ 54.668505] mvneta f1034000.ethernet eth2: configuring for 802.3z/1000base-x link mode
kernel: [ 54.668561] mvneta f1034000.ethernet eth2: Link is Up - 1Gbps/Full - flow control off
kernel: [ 54.674523] IPv6: ADDRCONF(NETDEV_UP): eth2: link is not ready
insmod: module is already loaded - ppp_generic
insmod: module is already loaded - pppox
insmod: module is already loaded - pppoe
pppd[4865]: Plugin rp-pppoe.so loaded.
pppd[4865]: RP-PPPoE plugin version 3.8p compiled against pppd 2.4.7
pppd[4865]: pppd 2.4.7 started by root, uid 0
pppd[4865]: Timeout waiting for PADO packets
pppd[4865]: Unable to complete PPPoE Discovery
pppd[4865]: Exit.
netifd: Interface 'wan' is now down
netifd: Interface 'wan' is disabled
netifd: Interface 'wan' is enabled
netifd: Interface 'wan' is setting up now
insmod: module is already loaded - slhc
insmod: module is already loaded - ppp_generic
insmod: module is already loaded - pppox
insmod: module is already loaded - pppoe
pppd[5038]: Plugin rp-pppoe.so loaded.
pppd[5038]: RP-PPPoE plugin version 3.8p compiled against pppd 2.4.7
pppd[5038]: pppd 2.4.7 started by root, uid 0
pppd[5038]: Timeout waiting for PADO packets
pppd[5038]: Unable to complete PPPoE Discovery
pppd[5038]: Exit.
netifd: Interface 'wan' is now down
netifd: Interface 'wan' is disabled
netifd: Interface 'wan' is enabled
netifd: Interface 'wan' is setting up now
insmod: module is already loaded - slhc
insmod: module is already loaded - ppp_generic
insmod: module is already loaded - pppox
insmod: module is already loaded - pppoe
pppd[5188]: Plugin rp-pppoe.so loaded.
pppd[5188]: RP-PPPoE plugin version 3.8p compiled against pppd 2.4.7
pppd[5188]: pppd 2.4.7 started by root, uid 0
pppd[5188]: PPP session is 8978
pppd[5188]: Connected to 78:ba:f9:73:f5:74 via interface eth2
pppd[5188]: Renamed interface ppp0 to pppoe-wan
pppd[5188]: Using interface pppoe-wan
pppd[5188]: Connect: pppoe-wan <--> eth2
pppd[5188]: PAP authentication succeeded pppd[5188]: peer from calling number 78:BA:F9:73:F5:74 authorized

I reckon that OpenWrt (netifd) suffers some issue with the link state when connected to external (incl. SFP) modems/module which are not likely to go away with 19.07 (5.x) but might improve with the code in the development branch.

The TOS forum has various reports about link state issue with external modems (since 18.06.x) and PPPoE seems to be suffering the most, perhaps because it unaware of the link state.

Yop, I am suspicious about OpenWRT/netifd.

If we could isolate the issue, we should push some fixes to upstream or at least patch them at our side.

@kkoci FYI

ip link set down dev eth0
ip link set up dev eth0

Does not help to get out of crash loop.

@n8v8R

Is apparently a SIGTERM for the PPPD process which may or may not be related to child-timeout

I think it's not related as this option affect only PPPD exit behavior, not a connection setup or so...

But is the SIGTERM still exhibited or not, considering you are mentioning "crash loop"? If it does it would seem somewhat unique. Did you try with a 5.x medkit from scratch?

changed milestone to %Turris OS 4.0.6

changed milestone to %Turris OS 5.0

changed due date to March 13, 2020

added Doing label and removed To Do label

changed milestone to %Turris OS 5.0.1

any idea @kkoci ?

My only idea is to get some non-Turris router and install plain OpenWrt on it and test the PPP connection. If it work, then test on Turris router with plain OpenWrt. It can help to isolate the problem.

@mmatejek can borrow you mine Xiaomi router, which has plain OpenWrt. But if you have a spare Turris Omnia router, you can flash it with plain OpenWrt as well.

But most likely the problem should be there too. But I might be wrong. Package PPP comes from OpenWrt as it without further modifications.

yes. this is what I mean: Isolate to problem and find out if it is an upstream issue, or Turris OS, or Turris hw.

I could test in on Alix APU and Xiomi router (OpenWrt 19.07) to rule out hardware issue, but I'm not much familiar with PPP and not sure if I can create ppp testing environment at home.

assigned to @kkoci

removed due date

`lcp-echo-adaptive`

New findings, probably related

In newer OpenWrt (TOS 4.0+), default lcp-echo-failure and lcp-echo-interval pppd options moved from /etc/ppp/options file to uci network config (handled via /lib/netif/proto/ppp.sh) keepalive option.

This change also enables keepalive_adaptive default value 1 (true) to be passed as pppd option lcp-echo-adaptive, which is Debian/OpenWrt pppd enhancment (not included in original pppd).

This option means ppp will send lcp-echo packets only if the link is idle

`sleep` in `ppp.sh`

Also probably related new findings.

In TOS 3.x, there was a "hack" introduced in our OpenWrt fork: openwrt@ad71556f .

This hack seems similar to my workaround above and I will further investigate the reason for this patch.

OMFG it works!

When I apply the patch mentioned above (sleep 10). PPP starts and the connection is established.

In my case, it works with sleep 5 as well. But I feel like a some cargo cult member when I try to tune this "parameter". In only 1 out of about 6 testing reboots, IPv6 does not established correctly. IPv4 come up and works in all cases.

We should really find out why this sleep helps. cc @kkoci @mhrusecky

In the meantime as it works, I'm in favor to apply that sleep to fix the issues, which some of our users might have. And then we can investigate it more.

If this would be a permanent or long-term solution, I would introduce an UCI config option to control the length of this sleep. Users who don't experience this issue want to turn it off to speed up Internet connection after a reboot

I think that we should look more deep in to it. I would go with 10 seconds as previously used for now and I would include it in TOS 5.0.

I would consider that as clear hack. This clearly is because something has to happen and that sleep should be replaced with at least busy loop wait for that. The question we have to investigate for is why it helps and what exactly happens in the meantime.

We will see. I would still aim for 5.0.1 or even 5.1 with better fix but let's prepare and hopefully merge sleep 10 hack to 5.0 now.

Probably known, if however not for sake of good order, netifd provides also debug output, best set probably in /etc/init.d/network

procd_set_param command /sbin/netifd -l 5
procd_set_param stderr 1

levels enum https://git.openwrt.org/?p=project/netifd.git;a=blob;f=main.c;hb=cfccdc22ca6d8f28d70a2546a495c9ead4bbb765#l41

pppd provides a phalanx of options https://www.freebsd.org/cgi/man.cgi?query=pppd&sektion=8&manpath=FreeBSD+4.7-RELEASE, which may not all be supported through the OpenWrt implementation and some options are mutually exclusive.

Those can be set in /etc/config/network via option pppd_options, e.g. option pppd_options 'passive debug kdebug 7'

Happy hunting.

changed milestone to %Turris OS 5.0

mentioned in merge request !167 (merged)

Vodafone CZ VDSL with Comtrend VR-3031eu modem in bridge mode: I’m completely unable to connect to my ISP without the mentioned workaround. I tried /etc/init.d/network restart, but even that did not help without the sleep.

Thanks for the info. Just to make it clear: does the connection work well with mentioned sleep?

Yes, it does.

Unreliable DSL (PPP) connection

Designs

Child items ...

Activity

`lcp-echo-adaptive`

`sleep` in `ppp.sh`

Admin message

Unreliable DSL (PPP) connection

Activity

lcp-echo-adaptive

sleep in ppp.sh

`lcp-echo-adaptive`

`sleep` in `ppp.sh`