White paper on Coblue’s Mesh Network

Remco Bloemen

TODO: * https://aphyr.com/posts/313-strong-consistency-models * http://www.ics.forth.gr/tech-reports/2013/2013.TR439_Survey_on_Consistency_Conditions.pdf * http://www.bailis.org/papers/hat-hotos2013.pdf * http://lcamtuf.coredump.cx/afl/README.txt * https://en.wikipedia.org/wiki/Mutation_testing * https://blog.cloudflare.com/dns-parser-meet-go-fuzzer/

Goal: Get close to theoretical goodput on a 10s round trip time 90% loss connection.

Goal: Low latency.

Goal: Low overhead.

Goal: High security.

Goal: Thwart deep packet inspection. The protocol should be sufficiently obfuscated that conventional tools do not recognize the traffic.

Goal: Work in third world scenarios: http://www.spice-center.org/files/publications/LEDBAT_Performance_in_Sub-packet_Regimes.pdf

Sockets

UDP

TCP

HTTP

The HTTP socket is the fall back option. It’s goal is to work under any circumstance that HTTP requests are made, potentially through badly functioning proxy servers. It will work on the simplest proxy server that will cache every request in full, before replying to the client. As such, it does not use techniques like response chunking or WebSockets. It does use http pipelining, if available.

Requests are encoded as BMP formatted images and send using a POST request to a unique one time URL. The server delays response until it has data to send. On idle, the http connections has three requests long-polling on the server, and three slots available on the client.

To offset the large per-packet overhead and the lack of TCP congestion control if a TCP session is initiated per request (i.e. no pipelining), the packet size is set to 100 kilobytes.

The technique used is similar to BOSH. Where BOSH has at most two connections open, we have a limit of six. The steady state being three at the server and three at the client. Where BOSH is limited to at most one push per RTT, ours can do three.

Handshake

Given a datagram based Socket with the following operations:

uint maxMessageSize();
void send(ByteArray message);
ByteArray read();

Handshake:

A → B: AE ‖ msg
A ← B: BE ‖ enc[seed ‖ be AE, nonce₁](B) ‖ nonce ‖ enc[be AE ‖ b AE, nonce](msg)
A ⇉ B: enc[seed ‖ ae BE ‖ ae B, nonce₃](A) ‖ *secure channel*
A ⇆ B: *secure channel*

Secure channel:
A ⇇ B: nonce ‖ enc[ae BE ‖ ae B ‖ a BE ‖ AE, nonce](msg)
A ⇇ B: nonce ‖ enc[ae BE ‖ ae B ‖ a BE ‖ BE, nonce](msg)
seed = "\360\334\010\310\147\152\250\245\333\337\277\104\050\023\230\321"
       "\074\177\107\002\317\143\241\334\367\337\171\142\303\256\046\027"
nonce₁ = "\102\353\364\202\222\060\342\221\166\114\273\343\207\031\143\330"
         "\331\262\214\115\322\255\311\372\362\316\164\330\346\322\352\206"
nonce₃ = "\125\246\262\144\224\211\063\372\027\051\375\265\037\374\053\141"
         "\042\004\166\157\053\321\145\133\156\110\014\003\212\050\152\304"

Initiator:

  1. Send AE and repeat until a correct response is received.

Responder:

  1. Receive AE.
  2. Send response.

Proposal: If the RTT is very large, do a

B sends a recent PeerInfo as it’s first message. B can not have a send buffer at this point.

A sends a recent PeerInfo, followed by anything from it’s send buffer.

A secure channel has established authentication between two publick keys representing the two endpoints. The endpoints can then bidirectionally exchange datagrams.

It can signal readable if it has one or more messages in its receive buffer.

It can signal writeable if it has room in its send buffer for one or more messages.

The API returns the maximum size of a datagram. Write calls on larger datagrams will fail.

When writing a datagram, the API returns a sequence number. This is the number of datagrams written before it.

When reading datagrams, the API returns a sequence number. This is the same as the sequence number the write call on the remote end returned.

Although the sequence numbers correspond, read calls do not receive the datagrams in order.

Datagrams may fail to be delivered. Neither end gets notified of this. The reading end can discovered a failed transmission by a missing sequence number.

Handshake Flow Control

The handshake is critical and all packets are necessary. By protocol, a handshake packet must be received before the reply can be constructed. The protocol necessarily happens in lock-step. A natural solution is to keep broadcasting the packet until the correct response is received.

Every packet in the handshake can contain a message. The first message will have no security guarantees whatsoever. The second will have all the security guarantees, except the receiver is not authenticated. The third and last message is fully secure and can carry application data. Since it is send by the initiator, it likely contains the request the initiator wants to make to the other peer.

Every internet module must be able to forward a datagram of 68 octets without further fragmentation. This is because an internet header may be up to 60 octets, and the minimum fragment is 8 octets.

Every internet destination must be able to receive a datagram of 576 octets either in one piece or in fragments to be reassembled.

RFC 796 “Internet Protocol” page 25, September 1981.

IPv6 requires that every link in the internet have an MTU of 1280 octets or greater. On any link that cannot convey a 1280-octet packet in one piece, link-specific fragmentation and reassembly must be provided at a layer below IPv6.

RFC 2460 “Internet Protocol, Version 6 (IPv6)” section 5, December 1998.

We have to decide on an MTU for the handshake. The 68 byte value is too conservative: after subtracting the minimal 28 bytes for IPV4/UDP headers only 40 bytes are left. The largest message during the handshake is 128 bytes. To prevent fragmenting handshake messages we choose a fixed 576 byte MTU during the handshake. We rely on IP fragmentation and reassembly to handle extremely small MTUs during the handshake.

Initial flow control measurements happen during the handshake. All handshake messages include a 16 bit unique random cookie, a 16 bit time stamp in milliseconds and optionally a 16 bit cookie echo. On reception of a handshake packet the cookie will be echoed in the reply. The cookie allows the the connection to measure package loss and the round-trip-time. The time stamp allows the nodes to measure clock skew.

Only the initial flow control message is in clear text, but it reveals nothing. The first cookies is entirely random. For every connection a random offset is generated and consistently added to the clock. For the first message it acts as a one-time-pad encryption.

Retransmission follows an exponential back off. The first re-transmission happens after 20 milliseconds, subsequent retransmission intervals increase by 10% to a maximum of 500 ms. If a round trip time has been measured, twice the measured round trip time is used as the initial value instead of 20 milliseconds. This transmission strategy sends out 19 packages in the first second. After 5 seconds and 36 packets it stabilizes on the 500 ms interval. After 37 seconds and 100 packets it aborts. Retransmission continues until reception is confirmed.

TODO: Check if public keys are close enough to being uniformly random.

TODO: Although all bytes are indistinguishable from random noise, the packet lengths and timings may still reveal information. We could pad the handshake packets with random bits. We could introduce jitter.

Path Maximum Transmission Size Discovery

Packetization Layer Path MTU Discovery * https://tools.ietf.org/html/rfc4821

Congestion Control Principles * https://tools.ietf.org/html/rfc2914

Byte and Packet Congestion Notification * https://tools.ietf.org/html/rfc7141

Acknowledgement

The acknowledgement scheme is similar to NR-SACK. Selective acknowledgement without re-negging. In addition, retransmissions get new sequence numbers. Sequence numbers correspond to one exact send.

uint16 timestamp, uint8 number [ uint16 seqno, uint16 timestamp ]*

A packet with data always get’s ACKed.

A packet with only ACKs gets ACKed if its ACKs have ACKed data.

Other packages do not get ACKed.

The sender retransmits package contents after 2 round trip times.

Receiving data must be idempotent.

Congestion control

Jacobson / Karels algorithm

We give every transmitted package a unique sequence number, even on retransmit. Round-trip time measurements are therefore not influenced by retransmit. We also include the remotes reception timestamps. This allows the remote to delay sending the ACK without influencing the round trip time estimate. However, if the round-trip time changes during the time an ACK is delay, the measured RTT will be an average of the old and new RTT.

Karn/Patridge algorithm Algorithm The RTT estimator R̂\hat R is updated with RTT samples RR using:

R̂i+1=(1α)R̂i+αRi \hat R_{i + 1} = (1 - \alpha ) \hat R_i + \alpha R_i

with suggests α=.1.2\alpha = .1 \ldots .2. The retransmission timeout TT is defined as

Ti+1={2Tion retransmitmin(Tmin,max(βR̂i,Tmax))otherwise T_{i+1} = \begin{cases} 2 T_i & \text{on retransmit} \\ \min (T_{\text{min}},\max(\beta \hat R_i , T_{\text{max}})) & \text{otherwise} \end{cases}

with β=1.32.0\beta = 1.3 \ldots 2.0, Tmin=1sT_{\text{min}} = 1\,\text{s} second and Tmax=60sT_{\text{max}} = 60\,\text{s}.

Jacobson/Karels algorithm:

Δi=RiR̂iR̂i+1=R̂i+δΔiσi+1=σi+δ(|Δi|σi) \begin{aligned} \Delta _i &= R_i - \hat R_i \\ \hat R_{i + 1} &= \hat R_i + \delta \Delta _i \\ \sigma _{i+1} &= \sigma _i + \delta ( |\Delta _i | - \sigma _i ) \\ \end{aligned}

Ti=μR̂i+ϕσi T_i = \mu \hat R_i + \phi \sigma _i

with μ=1\mu = 1 and ϕ=4\phi = 4.

  • http://ee.lbl.gov/papers/congavoid.pdf

LEDBAT:

  • http://www.perso.telecom-paristech.fr/~drossi/paper/rossi10icccn.pdf
  • http://tma2012.ftw.at/TMA/pres/TMA2012-talk7.pdf
  • https://www.internetsociety.org/sites/default/files/pdf/accepted/17_delay_cc_pos-v2.pdf
  • http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.296.6350&rep=rep1&type=pdf

http://netlab.caltech.edu/publications/CJindelay_ccw03.pdf

Network model

The transit time of a packet is

delay+packet size + queue sizebandwidth \text{delay} + \frac{\text{packet size } + \text{ queue size}}{\text{bandwidth}}

R=d+Qb+1bm R = d + \frac{Q}{b} + \frac{1}{b} m

We assume delay and bandwidth to be fixed. The queue size has two components, our own queued packages and other packages. The queue size contribution of other packages is modeled as a random walk.

Transients: Rerouting can suddenly change delay and bandwidth. Any algorithm should quickly recover and converge to the new parameters.

Some simplifications: package corruption is treated the same as package loss, package order is ignored.

LEDBAT

(RFC 6817)[https://tools.ietf.org/html/rfc6817] “Low Extra Delay Background Transport (LEDBAT)”

Kalman filter

Network state:

Observables:

Controls:

asd

Rules

Idea: The above can be optimized, the CMP only needs to ACK the packet containing ACKs.

When to ACK: When the expected cost of a retransmit would be greater. This cost is a function of the retransmission probability and the delay and bandwidth of the retransmit.

The sender can flag a packet as to be ACKed. Any such packet will be ACKed.

Packages with data get a 0. Packages ACKing those get a 1. Packages ACKing those get a 2. Packages with a 2 are not ACKed.

0 Request two way ACK. 1 Request one way ACK. 2 No ACK required. 3

Delayed ACK versus immediate ACK?

https://tools.ietf.org/html/rfc2581#section-4.2 TCP Delayed Acknowledgement

Packet 1 data 0 → Packet 2 data 0 → Packet 3 data 0 →

← Packet 1: 1 Ack 1,2,3

→ Packet 4 0 data + Ack 1 → Packet 5 0 data → Packet 6 0 data

← Packet 2: 1 Ack 4,5,6

→ Packet 7: 2 Ack 2

← Packet 3: 1 Ack 4,5,6

→ Packet 8: 2 Ack 2,3

How long to wait for package re-order before considering a package drop? How to retransmit segments lost in dropped packages? What if the packet is lost due to being too large? (Retransmit won’t work)

Idea 1: Retransmit the entire package using the same sequence number. Easiest to implement. Might retransmit packages that do not require it (streaming protocols). (NOTE: this blocks the line retransmitting old data while more important new data might be in the send buffer)

Idea 2: Same as idea 1, except we remove all droppable segments. (ERROR: The packet would be encrypted with the same key. It needs to be identical or it will reveal the xor of the old and new content.)

Idea 3: The freed up space is filled with new segments. (ERROR: This doesn’t work, if we then receive an ACK for the seqno. we do not know if it was for the old or new set of segments. (This is not a problem if we fill using other dropable segments))

Idea 3: Drop the seqno entirely and put the non-droppable chunks back on the send queue. (NOTE: The client might still receive the old package, and thus double receive the segments. The old and new transmission can be received in any order. Receiving chunks must be idempotent.)

No renegging: If a sequence number is acknowledged it will remain acknowledged. No renegging. Since entire datagrams are lost, a lost sequence number means loss of all segments in that sequence. http://www.eecis.udel.edu/~amer/PEL/poc/pdf/ICNP2008-natarajanNonRenegableSacks.pdf

No retransmission: Is a sequence number is lost it will not be used again. Sequence numbers emited from the sender are strictly sequentially increasing.

The sender retransmits the segment data that was lost.

Detecting Reorder versus Loss:

http://research.microsoft.com/en-us/people/lguohan/reorder-icoin04.pdf

“on average 8 of the 50 packets were identified as being out of order”

“We analyze 208 thousand connections with totally 3.3 million data pack ets, and observe that about 3.2% of all the packets are reordered.”

“Figure 5 shows the packet lag of reordering and retransmission. 86.5% of reor- dered packets left behind only 1 packet and 95.3% of reordered packets left behind within 2 packets. Packet lag of retransmission has a dispersed distribution and larger average. About 78.8% of retransmitted packets left behind with a lag of 3 or more packets”

“The common portion of packets affected by reordering is in the order of 0.01 % to 1.65 %, or 1 % to 1.5 %”

So re-ordering happens to about max 5% of packets and 95% is re-ordered within 3 packets.

https://tools.ietf.org/html/rfc5236

Long Fat Networks: 10 Gbit/s, 50 ms RTT = 59.6 MB.

Configuration Bandwidth Round-trip time
Long fat network 10 Gb it/s 50 ms 64 MB
Satellite connection 512 kb it/s 900 ms 58 kB
Voyager Deep Space Ne twork 115.2 kbit/s 36 h 1.8 GB

Assuming packets of at least 1200 bytes a 64 MB bandwidth-delay product means 55924 packets unACKed.

https://tools.ietf.org/html/rfc3393 IP Packet Delay Variation Metric for IP Performance Metrics (IPPM)

Multiplexed Sub-channels

The multiplexer transmits datagrams of arbitrary size. Datagrams are prioritized and can have optional delivery guarantees.

When the socket is writeable the multiplexer takes datagrams that need to be sent in order of priority.

It turns the datagrams into arbitrarily sized segments and constructs a maximum datagram size datagram out of them for the lower layer.

For each segment of guaranteed-delivery datagrams it stores the sequence number it was included in.

Segmentation:

uint16 channel;
uint32 datagram_size;
uint32 datagram_offset;
uint16 segment_size; // in multiples of 8 bytes
uint8[] data;

A datagram to be send is transfered in chunks.

uint16 channel
uint16 datagram_no
uint16 offset
uint16 length
uint16 channel
uint16 datagram_no
uint16 offset
uint16 length

Note: Due to packet loss and changes in available room, the ACKed information of a datagram can be complex.

Given a secure channel that provides us with out-of-order unrepeated delivery of datagrams. This means send messages can be delivered in any order, or not at all but are never repeated.

The sequence number is available to higher levels of the stack.

Bundling: The Channel does not send every datagram immediately. It waits until the event loop is idle. This is useful if efficiency is more important than latency.

Selective Acknowledgement:

There are 2¹⁶ channels numbered from zero. The zeroth channel is opened on connection establishment and speaks the control protocol.

A channel is configured by setting the protocolId

ALC operates on datagrams

Channel has:

Implementation:

For every external Peer there is a PriorityQueue.

For incoming messages there is a PriorityQueue.

Comparison with alternatives

See: https://en.wikipedia.org/wiki/Transport_layer#COMPARISON1

https://en.wikipedia.org/wiki/SCTP_packet_structure

Appendix: Clocks and time synchronization

Round Trip Time Clock:

uint32 timestamp;

https://tools.ietf.org/html/rfc1323#page-11

https://en.wikipedia.org/wiki/Jacobson/Karels_algorithm

Note: TCP seems to maintain a lot of state on the wire in stead of in the ends. Sending the timestamp for the purpose of RTT estimation is unnecessary. The remote end can’t interpret the numbers anyway.

Hybrid Logical Clock:

uint32 seconds;
uint16 fractional;
uint16 counter;

http://www.muratbuffalo.blogspot.com/2014_07_01_archive.html

http://www.cse.buffalo.edu/tech-reports/2014-04.pdf

Network Time Protocol:

uint32 seconds;
uint32 fractional;

Proposed:

uint64 seconds;
uint64 fractional;

NTP over internet has about 10 ms resolution. Under the best circumstances it can be 1 ms. Under less advantageous circumstances it can be 100 ms.

It seems that 1 ms is the maximum reasonable resolution.

If we assume an inferior clock synchronization algorithm that is able to determine the clock delta between two peers up to 10 second resolution.

The client can then sent an uint16 clock representing the least significant bits of the true time in milliseconds. This has a range of more than a minute with ms resolution.

Combined Clock and Seqno:

“TCP timestamps (RFC 1323) play a double role: they avoid ambiguities due to the 32-bit sequence number field wrapping around, and they allow more precise RTT estimation in the presence of multiple losses per RTT. With those improvements, it becomes reasonable to increase the TCP window beyond 64 kB, which can be done using the window scaling option (RFC 1323).”

Appendix: Deep Packet Inspectors

We look at the implementation of several hard to detect protocols. Comparison study

PACE: PACE

nDPI: nDPI github

libprotoident: libprotoident github

Appendix: Overhead calculation

Total handshake overhead: 259 bytes. Likely more in practice due to early retransmissions of the first handshake message until an accurate round trip time is known.

Per packet overhead: 2 byte seqno + 8 byte tag + 3 byte header + 4 byte ack = 17 bytes.