2015-07-01

White paper on Coblue's Mesh Network

TODO:

https://aphyr.com/posts/313-strong-consistency-models
http://www.ics.forth.gr/tech-reports/2013/2013.TR439_Survey_on_Consistency_Conditions.pdf
http://www.bailis.org/papers/hat-hotos2013.pdf
http://lcamtuf.coredump.cx/afl/README.txt
https://en.wikipedia.org/wiki/Mutation_testing
https://blog.cloudflare.com/dns-parser-meet-go-fuzzer/

Goal: Get close to theoretical goodput on a 10s round trip time 90% loss connection.

Goal: Low latency.

Goal: Low overhead.

Goal: High security.

Goal: Thwart deep packet inspection. The protocol should be sufficiently obfuscated that conventional tools do not recognize the traffic.

Goal: Work in third world scenarios: http://www.spice-center.org/files/publications/LEDBAT_Performance_in_Sub-packet_Regimes.pdf

Sockets

UDP

TCP

HTTP

The HTTP socket is the fall back option. It's goal is to work under any circumstance that HTTP requests are made, potentially through badly functioning proxy servers. It will work on the simplest proxy server that will cache every request in full, before replying to the client. As such, it does not use techniques like response chunking or WebSockets. It does use http pipelining, if available.

Requests are encoded as BMP formatted images and send using a POST request to a unique one time URL. The server delays response until it has data to send. On idle, the http connections has three requests long-polling on the server, and three slots available on the client.

To offset the large per-packet overhead and the lack of TCP congestion control if a TCP session is initiated per request (i.e. no pipelining), the packet size is set to 100 kilobytes.

The technique used is similar to BOSH. Where BOSH has at most two connections open, we have a limit of six. The steady state being three at the server and three at the client. Where BOSH is limited to at most one push per RTT, ours can do three.

Handshake

Given a datagram based Socket with the following operations:

uint maxMessageSize();
void send(ByteArray message);
ByteArray read();

Handshake:

A → B: AE ‖ msg
A ← B: BE ‖ enc[seed ‖ be AE, nonce₁](B) ‖ nonce ‖ enc[be AE ‖ b AE, nonce](msg)
A ⇉ B: enc[seed ‖ ae BE ‖ ae B, nonce₃](A) ‖ *secure channel*
A ⇆ B: *secure channel*

Secure channel:
A ⇇ B: nonce ‖ enc[ae BE ‖ ae B ‖ a BE ‖ AE, nonce](msg)
A ⇇ B: nonce ‖ enc[ae BE ‖ ae B ‖ a BE ‖ BE, nonce](msg)

seed = "\360\334\010\310\147\152\250\245\333\337\277\104\050\023\230\321"
       "\074\177\107\002\317\143\241\334\367\337\171\142\303\256\046\027"
nonce₁ = "\102\353\364\202\222\060\342\221\166\114\273\343\207\031\143\330"
         "\331\262\214\115\322\255\311\372\362\316\164\330\346\322\352\206"
nonce₃ = "\125\246\262\144\224\211\063\372\027\051\375\265\037\374\053\141"
         "\042\004\166\157\053\321\145\133\156\110\014\003\212\050\152\304"

Repeated reception of the same AE.
Other handshake packets are tied to the first one by the ephemeral key.

uint8 nonce, uint16 send

uint8 nonce, uint16 send, uint8 origNonce, uint16 received

Initiator:

Send AE and repeat until a correct response is received.

Responder:

Receive AE.
Send response.

Proposal: If the RTT is very large, do a

B sends a recent PeerInfo as it's first message. B can not have a send buffer at this point.

A sends a recent PeerInfo, followed by anything from it's send buffer.

A secure channel has established authentication between two publick keys representing the two endpoints. The endpoints can then bidirectionally exchange datagrams.

It can signal readable if it has one or more messages in its receive buffer.

It can signal writeable if it has room in its send buffer for one or more messages.

The API returns the maximum size of a datagram. Write calls on larger datagrams will fail.

When writing a datagram, the API returns a sequence number. This is the number of datagrams written before it.

When reading datagrams, the API returns a sequence number. This is the same as the sequence number the write call on the remote end returned.

Although the sequence numbers correspond, read calls do not receive the datagrams in order.

Datagrams may fail to be delivered. Neither end gets notified of this. The reading end can discovered a failed transmission by a missing sequence number.

Handshake Flow Control

The handshake is critical and all packets are necessary. By protocol, a handshake packet must be received before the reply can be constructed. The protocol necessarily happens in lock-step. A natural solution is to keep broadcasting the packet until the correct response is received.

Every packet in the handshake can contain a message. The first message will have no security guarantees whatsoever. The second will have all the security guarantees, except the receiver is not authenticated. The third and last message is fully secure and can carry application data. Since it is send by the initiator, it likely contains the request the initiator wants to make to the other peer.

Every internet module must be able to forward a datagram of 68 octets without further fragmentation. This is because an internet header may be up to 60 octets, and the minimum fragment is 8 octets.

Every internet destination must be able to receive a datagram of 576 octets either in one piece or in fragments to be reassembled.

— RFC 796 “Internet Protocol” page 25, September 1981.

IPv6 requires that every link in the internet have an MTU of 1280 octets or greater. On any link that cannot convey a 1280-octet packet in one piece, link-specific fragmentation and reassembly must be provided at a layer below IPv6.

— RFC 2460 “Internet Protocol, Version 6 (IPv6)” section 5, December 1998.

We have to decide on an MTU for the handshake. The 68 byte value is too conservative: after subtracting the minimal 28 bytes for IPV4/UDP headers only 40 bytes are left. The largest message during the handshake is 128 bytes. To prevent fragmenting handshake messages we choose a fixed 576 byte MTU during the handshake. We rely on IP fragmentation and reassembly to handle extremely small MTUs during the handshake.

Initial flow control measurements happen during the handshake. All handshake messages include a 16 bit unique random cookie, a 16 bit time stamp in milliseconds and optionally a 16 bit cookie echo. On reception of a handshake packet the cookie will be echoed in the reply. The cookie allows the the connection to measure package loss and the round-trip-time. The time stamp allows the nodes to measure clock skew.

Only the initial flow control message is in clear text, but it reveals nothing. The first cookies is entirely random. For every connection a random offset is generated and consistently added to the clock. For the first message it acts as a one-time-pad encryption.

Retransmission follows an exponential back off. The first re-transmission happens after 20 milliseconds, subsequent retransmission intervals increase by 10% to a maximum of 500 ms. If a round trip time has been measured, twice the measured round trip time is used as the initial value instead of 20 milliseconds. This transmission strategy sends out 19 packages in the first second. After 5 seconds and 36 packets it stabilizes on the 500 ms interval. After 37 seconds and 100 packets it aborts. Retransmission continues until reception is confirmed.

TODO: Check if public keys are close enough to being uniformly random.

TODO: Although all bytes are indistinguishable from random noise, the packet lengths and timings may still reveal information. We could pad the handshake packets with random bits. We could introduce jitter.

Path Maximum Transmission Size Discovery

Packetization Layer Path MTU Discovery

https://tools.ietf.org/html/rfc4821

Congestion Control Principles

https://tools.ietf.org/html/rfc2914

Byte and Packet Congestion Notification

https://tools.ietf.org/html/rfc7141

Acknowledgement

The acknowledgement scheme is similar to NR-SACK. Selective acknowledgement without re-negging. In addition, retransmissions get new sequence numbers. Sequence numbers correspond to one exact send.

uint16 timestamp, uint8 number [ uint16 seqno, uint16 timestamp ]*

A packet with data always get's ACKed.

A packet with only ACKs gets ACKed if its ACKs have ACKed data.

Other packages do not get ACKed.

The sender retransmits package contents after 2 round trip times.

Receiving data must be idempotent.

Congestion control

Jacobson / Karels algorithm

We give every transmitted package a unique sequence number, even on retransmit. Round-trip time measurements are therefore not influenced by retransmit. We also include the remotes reception timestamps. This allows the remote to delay sending the ACK without influencing the round trip time estimate. However, if the round-trip time changes during the time an ACK is delay, the measured RTT will be an average of the old and new RTT.

Karn/Patridge algorithm Algorithm The RTT estimator $\hat R$ is updated with RTT samples $R$ using:

$\hat R_{i + 1} = (1 - α) \hat Rᵢ + α Rᵢ$

with suggests $α = .1 \dots .2$ . The retransmission timeout $T$ is defined as

$T_{i+1} = \begin{cases} 2 Tᵢ & \text{on retransmit} \\ \min (T_{\text{min}},\max(β \hat Rᵢ , T_{\text{max}})) & \text{otherwise} \end{cases}$

with $β = 1.3 \dots 2.0$ , $T_{\text{min}} = 1\,\text{s}$ second and $T_{\text{max}} = 60\,\text{s}$ .

RFC 793 “Transmission Control Protocol”
RFC1122

Jacobson/Karels algorithm:

$\begin{aligned} Δᵢ &= Rᵢ - \hat Rᵢ \\ \hat R_{i + 1} &= \hat Rᵢ + δ Δᵢ \\ σ_{i+1} &= σᵢ + δ( |Δᵢ| - σᵢ) \\ \end{aligned}$

$Tᵢ = μ \hat Rᵢ + φ σᵢ$

with $μ = 1$ and $φ = 4$ .

http://ee.lbl.gov/papers/congavoid.pdf

LEDBAT:

http://www.perso.telecom-paristech.fr/~drossi/paper/rossi10icccn.pdf
http://tma2012.ftw.at/TMA/pres/TMA2012-talk7.pdf
https://www.internetsociety.org/sites/default/files/pdf/accepted/17_delay_cc_pos-v2.pdf
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.296.6350&rep=rep1&type=pdf

http://netlab.caltech.edu/publications/CJindelay_ccw03.pdf

Network model

The transit time of a packet is

$\text{delay} + \frac{\text{packet size } + \text{ queue size}}{\text{bandwidth}}$

$R = d + \frac{Q}{b} + \frac{1}{b} m$

Delay-dominated: delay is the primary source of RTT.
Bandwidth-dominated: the time to transmit a single packet dominates the RTT.
Congested: the queue of other packages dominates the RTT.

We assume delay and bandwidth to be fixed. The queue size has two components, our own queued packages and other packages. The queue size contribution of other packages is modeled as a random walk.

Transients: Rerouting can suddenly change delay and bandwidth. Any algorithm should quickly recover and converge to the new parameters.

Some simplifications: package corruption is treated the same as package loss, package order is ignored.

LEDBAT

(RFC 6817)[https://tools.ietf.org/html/rfc6817] “Low Extra Delay Background Transport (LEDBAT)”

http://bittorrent.org/beps/bep_0029.html

Kalman filter

Network state:

Base delay
Bandwidth
Loss model (Gilbert-Elliot loss model, 4 parameters)
Path maximum transmission unit
Queue size
Current loss state (good / bad)
Queue fill (and hence Queue full)
Competing transmissions
Clock offset

Observables:

Delay
Loss

Controls:

asd

https://lwn.net/Articles/645115/
http://caia.swin.edu.au/cv/dahayes/content/networking2011-cdg-preprint.pdf
http://www.wseas.us/journals/cc/cc-27.pdf
http://www.appexnetworks.com/white-papers/ZetaTCP.pdf

Rules

There is no ACK for packages without data.
A seqno stays in the ACK list until a CMP has been received.
A CMP is generated for entry in the ACK list.

Idea: The above can be optimized, the CMP only needs to ACK the packet containing ACKs.

There is no ACK for packages that ACK ACKs.
A seqno stays in the ACK list until a packet containing it has been ACKed.
A CMP is generated for entry in the ACK list.

When to ACK: When the expected cost of a retransmit would be greater. This cost is a function of the retransmission probability and the delay and bandwidth of the retransmit.

The sender can flag a packet as to be ACKed. Any such packet will be ACKed.

Packages with data get a 0. Packages ACKing those get a 1. Packages ACKing those get a 2. Packages with a 2 are not ACKed.

0 Request two way ACK. 1 Request one way ACK. 2 No ACK required. 3

No ACK required
Request ACK
Request immediate ACK

Delayed ACK versus immediate ACK?

https://tools.ietf.org/html/rfc2581#section-4.2 TCP Delayed Acknowledgement

Packet 1 data 0 → Packet 2 data 0 → Packet 3 data 0 →

← Packet 1: 1 Ack 1,2,3

→ Packet 4 0 data + Ack 1 → Packet 5 0 data → Packet 6 0 data

← Packet 2: 1 Ack 4,5,6

→ Packet 7: 2 Ack 2

← Packet 3: 1 Ack 4,5,6

→ Packet 8: 2 Ack 2,3

NOTE: Repeating until ACK is not a viable strategy. We could transmit the ACK 1—n times. For 10% loss, two transmissions are sufficient for >99% reception. For 10% transmission success 7 repeats are needed to get >50% chance of reception. This however assumes loss is uncorrelated, which is unlikely.
NOTE: For a given loss rate, say 10%, there will be the retransmission of the 10% plus the retransmission of the 10% of which the ACKs where lost. This works cumulatively.
retransmission occurs if either the packet itself or it's ACK is dropped. This roughly doubles

How long to wait for package re-order before considering a package drop? How to retransmit segments lost in dropped packages? What if the packet is lost due to being too large? (Retransmit won't work)

Idea 1: Retransmit the entire package using the same sequence number. Easiest to implement. Might retransmit packages that do not require it (streaming protocols). (NOTE: this blocks the line retransmitting old data while more important new data might be in the send buffer)

Idea 2: Same as idea 1, except we remove all droppable segments. (ERROR: The packet would be encrypted with the same key. It needs to be identical or it will reveal the xor of the old and new content.)

Idea 3: The freed up space is filled with new segments. (ERROR: This doesn't work, if we then receive an ACK for the seqno. we do not know if it was for the old or new set of segments. (This is not a problem if we fill using other dropable segments))

Idea 3: Drop the seqno entirely and put the non-droppable chunks back on the send queue. (NOTE: The client might still receive the old package, and thus double receive the segments. The old and new transmission can be received in any order. Receiving chunks must be idempotent.)

No renegging: If a sequence number is acknowledged it will remain acknowledged. No renegging. Since entire datagrams are lost, a lost sequence number means loss of all segments in that sequence. http://www.eecis.udel.edu/~amer/PEL/poc/pdf/ICNP2008-natarajanNonRenegableSacks.pdf

No retransmission: Is a sequence number is lost it will not be used again. Sequence numbers emited from the sender are strictly sequentially increasing.

The sender retransmits the segment data that was lost.

Detecting Reorder versus Loss:

http://research.microsoft.com/en-us/people/lguohan/reorder-icoin04.pdf

"on average 8 of the 50 packets were identified as being out of order"

"We analyze 208 thousand connections with totally 3.3 million data pack ets, and observe that about 3.2% of all the packets are reordered."

"Figure 5 shows the packet lag of reordering and retransmission. 86.5% of reor- dered packets left behind only 1 packet and 95.3% of reordered packets left behind within 2 packets. Packet lag of retransmission has a dispersed distribution and larger average. About 78.8% of retransmitted packets left behind with a lag of 3 or more packets"

"The common portion of packets affected by reordering is in the order of 0.01 % to 1.65 %, or 1 % to 1.5 %"

So re-ordering happens to about max 5% of packets and 95% is re-ordered within 3 packets.

https://tools.ietf.org/html/rfc5236

Long Fat Networks: 10 Gbit/s, 50 ms RTT = 59.6 MB.

Configuration Bandwidth Round-trip time

Long fat network 10 Gbit/s 50 ms 64 MB Satellite connection 512 kbit/s 900 ms 58 kB Voyager Deep Space Network 115.2 kbit/s 36 h 1.8 GB

Assuming packets of at least 1200 bytes a 64 MB bandwidth-delay product means 55924 packets unACKed.

https://tools.ietf.org/html/rfc3393 IP Packet Delay Variation Metric for IP Performance Metrics (IPPM)

5-n byte header

uint16 miliseconds;
uint16 counter;
uint8 nacks;
uint8[] rle_acks;

Multiplexed Sub-channels

The multiplexer transmits datagrams of arbitrary size. Datagrams are prioritized and can have optional delivery guarantees.

When the socket is writeable the multiplexer takes datagrams that need to be sent in order of priority.

It turns the datagrams into arbitrarily sized segments and constructs a maximum datagram size datagram out of them for the lower layer.

For each segment of guaranteed-delivery datagrams it stores the sequence number it was included in.

Segmentation:

uint16 channel;
uint32 datagram_size;
uint32 datagram_offset;
uint16 segment_size; // in multiples of 8 bytes
uint8[] data;

A datagram to be send is transfered in chunks.

uint16 channel
uint16 datagram_no
uint16 offset
uint16 length

uint16 channel
uint16 datagram_no
uint16 offset
uint16 length

Note: Due to packet loss and changes in available room, the ACKed information of a datagram can be complex.

Given a secure channel that provides us with out-of-order unrepeated delivery of datagrams. This means send messages can be delivered in any order, or not at all but are never repeated.

The sequence number is available to higher levels of the stack.

Bundling: The Channel does not send every datagram immediately. It waits until the event loop is idle. This is useful if efficiency is more important than latency.

Selective Acknowledgement:

Low-latency unreliable unordered delivery

There are 2¹⁶ channels numbered from zero. The zeroth channel is opened on connection establishment and speaks the control protocol.

Immediate Selective Acknowledge
Unordered (but can still be fragmented)
First fragment
Last fragment

A channel is configured by setting the protocolId

CONFIGURE(protocolId, channel options)
CONFIGURE-ACK(protocolId, channel options)
CONFIGURE-COMPLETE

ALC operates on datagrams

Request-response
Bi-directional Unreliable Low-Latency Datagram Stream
Publish-subscribe
Push-pull

Channel has:

Priority.
Reliable/unreliable delivery.

Implementation:

For every external Peer there is a PriorityQueue.

For incoming messages there is a PriorityQueue.

Max one connection per peer.
Max one thread per core.
Each thread handles multiple connections.
Only async IO.

Comparison with alternatives

See: https://en.wikipedia.org/wiki/Transport_layer#COMPARISON1

https://en.wikipedia.org/wiki/SCTP_packet_structure

CurveCP
∅MQ
SCTP
SST http://pdos.csail.mit.edu/uia/sst/
MUX/SMUX http://www.w3.org/Protocols/MUX/ http://www.w3.org/TR/WD-mux
SCP http://www.ibiblio.org/ses/scp.html
https://tools.ietf.org/html/rfc2488 Enhancing TCP Over Satellite Channels
https://tools.ietf.org/html/rfc2018 TCP Selective Acknowledgment Options
https://tools.ietf.org/html/rfc1323 TCP Extensions for High Performance
https://tools.ietf.org/html/rfc4960 Stream Control Transmission Protocol

Appendix: Clocks and time synchronization

Round Trip Time Clock:

uint32 timestamp;

https://tools.ietf.org/html/rfc1323#page-11

https://en.wikipedia.org/wiki/Jacobson/Karels_algorithm

Note: TCP seems to maintain a lot of state on the wire in stead of in the ends. Sending the timestamp for the purpose of RTT estimation is unnecessary. The remote end can't interpret the numbers anyway.

Hybrid Logical Clock:

uint32 seconds;
uint16 fractional;
uint16 counter;

http://www.muratbuffalo.blogspot.com/2014_07_01_archive.html

http://www.cse.buffalo.edu/tech-reports/2014-04.pdf

Network Time Protocol:

uint32 seconds;
uint32 fractional;

Proposed:

uint64 seconds;
uint64 fractional;

NTP over internet has about 10 ms resolution. Under the best circumstances it can be 1 ms. Under less advantageous circumstances it can be 100 ms.

It seems that 1 ms is the maximum reasonable resolution.

If we assume an inferior clock synchronization algorithm that is able to determine the clock delta between two peers up to 10 second resolution.

The client can then sent an uint16 clock representing the least significant bits of the true time in milliseconds. This has a range of more than a minute with ms resolution.

Combined Clock and Seqno:

"TCP timestamps (RFC 1323) play a double role: they avoid ambiguities due to the 32-bit sequence number field wrapping around, and they allow more precise RTT estimation in the presence of multiple losses per RTT. With those improvements, it becomes reasonable to increase the TCP window beyond 64 kB, which can be done using the window scaling option (RFC 1323)."

Appendix: Deep Packet Inspectors

We look at the implementation of several hard to detect protocols. Comparison study

PACE: PACE

nDPI: nDPI github

libprotoident: libprotoident github

Tor
Skype

Appendix: Overhead calculation

First handshake message: 56 byte public key + 2 byte cookie + 2 byte timestamp = 60 bytes
Second handshake message: 2 * 56 byte public key + 16 byte tag + 2 byte cookie + 2 byte timestamp + 2 byte reply = 134 bytes
Third handshake message: 56 byte public key + 16 byte tag + 6 byte cookie/reply - 13 byte saved channel overhead = 65 bytes

Total handshake overhead: 259 bytes. Likely more in practice due to early retransmissions of the first handshake message until an accurate round trip time is known.

Per packet overhead: 2 byte seqno + 8 byte tag + 3 byte header + 4 byte ack = 17 bytes.

Not included yet: hybrid logical clocks.