Is Netbalancer/QoS patch included in ZS 0.1b14

Home Page Forums Network Management ZeroShell Is Netbalancer/QoS patch included in ZS 0.1b14

This topic contains 35 replies, has 0 voices, and was last updated by  mrevjd 6 years, 1 month ago.

Viewing 15 posts - 1 through 15 (of 37 total)
  • Author
    Posts
  • #42846

    mrevjd
    Member

    Hi All,

    Before I upgrade a few machines, I was wondering if the patch/fix for the QoS and Netbalancer are included in ZS 0.1b14.

    @fulvio – Thank you for a great product.

    Thanks
    Evan

    #51537

    mrevjd
    Member

    Quick update: from my initial testing it looks as though the QoS/Netbalancer patch/fix has NOT been included in this update.
    Good news the current patch/fix can still be used.

    #51538

    atheling
    Member

    @mrevjd wrote:

    Quick update: from my initial testing it looks as though the QoS/Netbalancer patch/fix has NOT been included in this update.
    Good news the current patch/fix can still be used.

    My first cut at applying the changes in the kerbynet.cgi/scripts modified by the QoS/Netbalancer patch indicate two things:

    1) The QoS/Netbalancer is NOT in b14
    2) There are merge conflicts in failoverd and fw_initrules

    The conflicts in fw_initrules are straight forward. But I will have to wrap my head around how the ip tables and rules work again before I can make my mind up about the conflicts in failoverd.

    End result: I don’t think it is safe to apply the b12 QoS/Netbalancer patch to b14.

    Unfortunately, I can’t say how long it will be before I can get time to look at this. Too many other irons in the fire…

    By the way, if I recall correctly, Fulvio’s hesitation on incorporating my patch into his code base was issues with falsely declaring an interface as failed. If anyone using my patch has experience with that problem perhaps we can figure out what I did wrong.

    #51539

    mrevjd
    Member

    I used the pre-patched files from a b13 installation and this seems to work
    I have only tested it briefly.

    If you need help wit the patch let me know.

    Cheers
    Evan

    #51540

    atheling
    Member

    @mrevjd wrote:

    I used the pre-patched files from a b13 installation and this seems to work
    I have only tested it briefly.

    If you need help wit the patch let me know.

    Cheers
    Evan

    How many and what type of WAN interfaces do you have and have you noticed issues with either detecting a WAN failure or recovering from a failure?

    The conflicting changes in failoverd have to do with the logic for issuing pings our of the desired interface. b14 is using a TOS indication on the ping packet whilst my patch classifies the ping using IP Tables… There are to merge conflicts in the file that look the same:



    {{{{{{{ HEAD
    ip ru add tos 0x04 to ${IP[$I]} ta 1$GW
    if ping -Q 0x04 -c 1 -w $TO ${IP[$I]} ; then
    ip ru del tos 0x04 2>/dev/null
    SUCCESS[1$GW]=$I
    =======
    iptables -t mangle --flush NB_FO_PRE
    iptables -t mangle -A NB_FO_PRE -p icmp --icmp-type echo-request -d $IPP -j MARK --set-mark 1$GW
    if ping -c 1 -w $TO ${IP[$I]} ; then
    SUCCESS[$GW]=$I
    >>>>>>> nb_qos

    (Sorry about the weird display of the snippet from the diff, this bulletin board code does not seem to like a string of less than characters.)

    It has been over a year since I looked at either the ZS scripts or the documentation on Linux iptables and routing, so I’ll have to get back up to speed. But I am working my day job for pretty long hours and I have volunteer work on weekends until late spring so anything more than a few minutes is too much time for me at the moment.

    #51541

    mrevjd
    Member

    I’m using 2 ethernet WANs via DMZed routers. As I said I have only briefly tried it.
    The patch put the right packets in the QoS classifiers. I thought I’d checked the WAN failover but I will double check this.
    Thanks for the heads up.

    #51542

    Pit
    Member

    Hello Atheling,

    here are my experiences with your patch. Perhaps it helps to figure out what is wrong with failoverd.

    My System: 1 line over ETH to a router
    3 lines direct over ppp

    – balancing does not work without the patch at all. Answer packages do not know where the origin is.

    – with patch: 3 ppp lines alone hang the whole system. 1 ETH line is necessary.

    – with patch: with disabled failoverd in GUI 1eth + 3 ppp lines work fine.
    Only ipsec connections are scrambeld and have to be send to one dedicated line by balancing rules. QoS needs very much cpu time but works perfectly. Tested with videostreaming bound to one line and as preferred service.

    -with patch and failoverd enabled in GUI and 1eth+3ppp lines: the ppp lines are destroyed
    after the daily disconnect. As i could see the remote station gets repeating signals to synchronise although the line is up and running. After a while the modem/remote station tilts. Only resetting of the modems bring the ppp lines up again.

    -Test environment with patch and failoverd enabled an two eth lines:
    All works fine !!
    Log files show that some code is not able to fetch the faulty device name and ip.
    This is especially for the recover message.

    beta14: The changes in failoverd and fw_initrules do not influence the patch code. Patching by hand from the .rej files should work. I will test this in production environment.

    Regards
    Pit

    #51543

    atheling
    Member

    Thank you Pit for your observations.

    Do you have the “Immediately restart PPPoE and 3G Mobile” option on the Net Balancer page set to “yes”? If not, then I don’t see why your PPP links would behave differently than your ETH links…

    Sounds like the PPP restart logic, not touched by my patch, might need some looking into…

    #51544

    Pit
    Member

    The “Immediately restart PPPoE and 3G Mobile” option on my Net Balancer page is set to “yes”.
    I do not understand the code on the whole, but it looks like that more than one “helping hand in the background” tries to restart the lines or there is an unintendet loop or the “Immediately restart PPPoE and 3G Mobile” option is not logic itself and redundant.

    My boxes are delivered preinstalled and the ppp lines are working a few seconds after the cables are connected. No start is necessary. With disabled failoverd my lines have the daily disconnect and are up again after a few milliseconds without intervention. So i think failoverd will work properly with “Immediately restart PPPoE and 3G Mobile” option set to “no”.

    The more i think over this i ask me for what reason this option is at all.
    It is confusing.

    I can not understand Fulvio’s hesitation on incorporating your patch into his code base too. In my test environment there was no false declaring at all. And i testet over weeks. By the adjustment of failoverd you have to bear in mind that sometimes the lines have a latency of more than 1 second. Smaler adjustments have false faulty declarations of course.
    But this is not a cause of your patch.

    Please be so kind and talk with Fulvio together about your patch and ask him what exactly is not fine with your code. I am sure you both will find a solution.

    #51545

    micampo
    Member

    Fulvio when you add the patch?

    I am presently b13, 4 lines without QOS and failover does not work.

    Thanks

    #51546

    atheling
    Member

    @micampo wrote:

    Fulvio when you add the patch?

    I am presently b13, 4 lines without QOS and failover does not work.

    Thanks

    I’ve just posted a status update with links to a new version of the patch at http://www.zeroshell.net/eng/forum/viewtopic.php?p=9460#9460

    #51547

    mrevjd
    Member

    Hi All,

    After comparing the various files and the patches by atheling, I have managed to modify the files for b14 so they handle QoS and Netbalancing in the same way that the original b13 ones do. Alot of the files had been re-writen by fulvio to suite the new version of Zeroshell.

    I have done limited testing on the patches but they seem to work for me on 2 Ethernet based internet connections.

    These are the patched files.

    Happy for other people to try them out.

    http://dl.dropbox.com/u/28224983/Zeroshell/NB_QoS.zip

    Cheers

    #51548

    atheling
    Member

    Please forgive my long post, I’ll try to break it up into pieces so that you can skip over sections that are not of interest to you.


    The updated patch posted by mrevjd

    With the exception of changes in failoverd and new changes in nb_testfo, mrevjd’s patch matches what I have in my beta14 working area. So if you are just looking for QoS fixes and having incoming traffic to servers exit on the appropriate interface you should be fine with his patch. Or, for that matter, one of my older patches.

    I am still running beta 12 but have updated my patch for the beta 14 release and am actually running this proposed beta 14 patch on my beta 12 system. It may work for you but is untested on a beta 14 system. You can download it from:

    http://dl.dropbox.com/u/19663978/ZS_nb_qos_b14_a.zip

    It has significant changes in failoverd, including a new concept of how to detect failed links.


    When should you use net balancing

    If you can possibly use a bonded interface to handle link failover and/or load balancing, I strongly suggest you do: The Linux bonding driver is robust, reasonably well documented and well supported. It can load balance traffic on a single TCP connection which net balancing can’t do.

    However bonding requires that both ends (you and your ISP) are set up for it. That may not be the case.

    If you are putting in links between offices, you could setup an OpenVPN link with it associated pseudo device on each physical link then bond those for higher throughput between offices. You would not need to coordinate with your ISP for that. I believe the documentation for this resides in these forums and on the main Zeroshell web site.


    What net balancing attempts to do and its limitations

    If, like me, you have multiple ISPs using different technologies to provide redundant Internet access then you are dealing with WAN links that cannot be bonded.

    Net balancing attempts to balance traffic at the connection level (all datagrams in the same connection go through the one interface): Any one TCP connection will be bound to a specific interface. More over, under many, perhaps most, situations all successive TCP connections to a specific IP address will be bound to that same interface.

    This “stickiness” where successive TCP connections to the same IP address use the same gateway is due to the Linux kernel finding the route in its routing cache. And for general use it is a good thing because otherwise you’d never be able to log into a site that uses HTTPS.

    If an WAN link goes down, then the normal routing entries that allow user traffic to go through that WAN’s interface are removed from the routing tables. This means that marking an interface as failed could remove the ability to ping through it to detect that the link has been restored.


    Linux routing background information

    Sending datagrams for a connection to a specific interface is done through the marking of datagrams with tags, called fwmark in the routing area and under a different name in the iptables area. Basically iptables examine each datagram, looks at a connection table and other information then tags the datagram with a fwmark to tell the router how to handle it.

    The routing cache, the routing policy database (RPDB) (a.k.a ip rules) and one or more routing tables are then used to send the datagram on its way through the appropriate interface.

    At this point you may want to go read the information on how Linux routing works at http://linux-ip.net/html/routing-selection.html. In particular you will want to look at “Example 4.4. Routing Selection Algorithm in Pseudo-code”.


    The task performed by failoverd

    failoverd has a conceptually easy job: Ping a set of destination IP addresses over each WAN (gateway) link. If the link is up and the pings fail then bring the interface down. If the link is down and the pings succeed then bring the interface up.

    To do that failoverd has to send pings through the Linux routing logic so they are sent through specific interfaces. Reading the man pages for ping indicates you can specify the interface, but when you attempt that on a downed interface you will never succeed so you can never bring the interface up. So forget the easy way to do things…

    All of the versions of failoverd I’ve seen released in Zeroshell rely on ping’s ability to set a TOS flag on each datagram. Before a ping is send a rule is added to the routing policy database that directs the routing logic to send datagrams with that TOS flag to the routing table for the specific gateway. The default route in that table then is used to send the datagram on its way.

    However it seems that every time you insert or remove a rule in the routing policy database Linux appears to flush its routing cache. Therein lies a problem. With the cache flushed, the “stickiness” for TCP connections is reset and any transactions that assume the IP address is constant from one TCP connection to the next will fail (e.g. HTTPS to a bank web site).

    This is my issue with the stock release of failoverd. If this is not a problem for you, then you need not use a modified version of failoverd.

    But if you need HTTPS to work a different method of directing pings to the desired WAN link is needed. One that won’t clear the routing cache.

    There is another issue: When a gateway is marked as failed the default route entry for it is removed from the normal (main and gateway specific) tables. So even if we can direct the ping to the gateway specific table it won’t go out the correct interface. I’m not sure how the original failoverd handled this as it appears this issue is in that code too.


    How this version of failoverd handles ping routing

    The solution this patch version of failoverd uses is to add some failoverd specific routing tables that are not re-written by the normal interface and traffic shaping logic. The tables are simple with one entry listing the default route. And then one ip rule per gateway is added using a fwmark different than the normal traffic categorization uses. The routing rules and tables are set up only when something changes (gateway added/removed, failoverd restarted, etc.) but not for each ping cycle.

    When a ping is to be sent a new iptable entry is used to mark pings from the router for the desired routing table then the ping(s) are send. Changing the iptables on a per ping gateway change does not appear to affect the routing cache so the connection to connection stickyness needed for HTTPS sessions is maintained.

    When an interface is marked down, the routing table used by failoverd for pings should not be affected so pings can still be sent through a failed gateway.


    Issues and questionable logic

    failoverd does not check gateways if their default route in their gateway specific routing table does not exist. I think this check can be removed with the new logic and should make it more robust.

    Tied into this, I’ve noticed that the pppoe interface does not always add its default route back into the routing tables. The result in ad hoc testing is that sometimes the ppp routes do not come back up. I’ve seen this perhaps 10% of the time. I suspect that removing the default route check in failoverd might paper over that issue.


    Summary

    I think this patch should work on beta 12, beta 13 and beta 14.

    This patch does a lot better job than my previous one that rolled over and died if the ppp0 connection died when the ETH connection was disabled or failed.

    If you feel adventurous please give it a try and let me know how it goes. If you provide feedback please also give some indication of your network topology (number of gateways, type (pppoe, etc.).

    Thanks!

    #51549

    micampo
    Member

    Good explanation!

    proved in b12 and b14 I have 4 lines and all are eth with fixed ip, not using QOS (QOS IN SEPARATE BOX BRIDGE) … what settings do you recommend?

    please put the link to download the patch and installation procedure

    Thanks

    #51550

    atheling
    Member

    @micampo wrote:

    Good explanation!

    proved in b12 and b14 I have 4 lines and all are eth with fixed ip, not using QOS (QOS IN SEPARATE BOX BRIDGE) … what settings do you recommend?

    please put the link to download the patch and installation procedure

    Thanks

    Link was near top of my excessively long post. Instructions on how to install are in the .zip file.

    http://dl.dropbox.com/u/19663978/ZS_nb_qos_b14_a.zip

Viewing 15 posts - 1 through 15 (of 37 total)

You must be logged in to reply to this topic.