No commit message

Tomáš Gavenčiak · 7e287a13
--- a/design-notes.md
+++ b/design-notes.md
-# Collector notes
+# Collector design draft and notes

-* C99 with [libUCW](http://www.ucw.cz/libucw/doc/ucw/) and protobuf-c
-* PLANNED: Time frames (cca 1-300s), soft rate-limiting
-
-## Input
-* Currently only PCAP
-* Looking at libtrace and SOCK_RAW sockets
-* Supports truncated packets (length checks in all the code)
-
-## TCP/IP status and assumptions
-* Accepts both IPv4 and IPv6
-  * Currently drops IPv6 with extra headers (TODO: skip them, detect fragmentation headers) (none encountered in `akuma` data)
-* No IP fragment reconstruction
-  * Not planned (rather technical, separate for IPv4 and IPv6, ...)
-  * Opening a SOCK_RAW socket handles IP-reconstruction in kernel
-  * Should not happen too much anyway (very few requests have >100 bytes, very few responses have >1000 bytes)
-* TCP is limited to (single request, single response) streams, TCP options accepted but ignored
-  * These short TCP connections seem to bee (almost?) all the cases in "akuma" data
-  * Find out: how many long TCP conns are there?
-  * PLANNED: TCP flow reconstruction, keeping open connections (currently ignores SYN/ACK/FIN)
-* UDP fully suported
-* Dropping all packets with data size mismatches etc.
-
-## DNS status
-* Dropping packets with `OPCODE != QUERY`
-  * Store some other opcode? (IQUERY is obsolete, STATUS?)
-* Dropping packets with QNAME length above 254 (by RFC)`
-* Only accepting packets with exactly 1 QNAME
-* Dropping packets with "compressed" QNAME, see [RFC section](https://tools.ietf.org/html/rfc1035#section-4.1.4)
-  * Find out: are those still used?
-* Dropping packets with the snapshot (captured part) ending before the entire DNS QNAME part (should not happen with reasonable snaplen)
-* TODO NEXT: Actually match the queries and responses 
-
-## Output
+* Author: Tomáš Gavenčiak, tomas.gavenciak@nic.cz
+* Date: 22 Mar 2016

-* Modular, not dependent on protobufs (Can include CBOR or other if needed.)
-* PLANNED: separate threads for:
-  * 1x packet collection, parsing, dumping and matching responses with requests (hash table)
-  * (1+)x time frame serialization and writing (file, socket or database)

-### Protobuf
-* Implemented a message for request+response pair writing (`dnsquery.proto`)
-* PLANNED: Configurable which attributes are included
+## Language and libraries

-### Dumping dropped packets
-* Configurable dump/drop by category
-* PLANNED: Rotate pcap files with time frames
-* PLANNED: Soft rate-limiting to prevent choking#### Dns collector design draft
+Language standart is C99. The proposed libraries are:
+* [Libtrace](http://research.wand.net.nz/software/libtrace.php) for packet capture, dissection and dumping.
+* [libUCW](http://www.ucw.cz/libucw/) for configuration parsing, logging, mempools (in the future?) and some data structures (currently doubly linked lists). Replacable but convinient.
+* libLZ4, libgz, ... for online (de)compression of input pcaps and output files. Partially implemented separately, but also part of libtrace.
+* CBOR: [libcbor](http://libcbor.org/) or other implementation

-* Author: Tomáš Gavenčiak, tomas.gavenciak@nic.cz
-* Date: 1st Mar 2016

-### Operation and main structures
+## Operation and main structures

-## Struct collector
+### Struct collector

 Main container for a collector instance (try toavoid global state).

-# Has
+#### Has
 * Configuration structure (given / loaded before init) (incl. outputs)
 * Current and previous timeframe
 * Queue of timeframes to write (thread safe) and writer thread(s)
 * Basic stats on program run (time, packets collected/dropped)

-# Setup
+#### Setup
 Gets a configuration struct, initializes self and opens a packet capture
 (file list or live, applying capture length, promiscuous settings and BPF filters).

-# Main thread operation
+#### Main thread operation
 Main thread collects a packet from the input and parses its data (IP/UDP/DNS headers). 
 If the time is past the current timeframe, does a frame rotation (see below).
 When the packet is invalid (malformed, unsupported network feature, ...) drop it and optionally
 dump via some of the outputs.

-# Fame rotation
+#### Fame rotation
 The timeframes are cca 0.1-10 sec long time windows (configurable). Any response packet is matched to a
 request packet in the current or te previous timeframe (so a response delayed up to the
 frame length is always matched). When a packet beyond the current timeframe is read, the
@@ -86,24 +48,24 @@ writing the frame (see below).
 If a packet arrives out of order (with time smaller than the previous packet, as in wrong ordering of PCAP files),
 a warning is issued and it is processed as if it had the time of the last in-order packet. 

-# Writer thread 
+#### Writer thread 
 One or more writer threads picking up timeframes from the queue and writing their packets to the outputs.
 Destroy the packets and timeframes afterwards. If a timeframe is the last one to use an output file, that file
 is closed.

 The timeframes have to be processed in the order of creation

-# Current state
+#### Current state
 * The writeout is done in the same thread.
 * Only one output file per configured output is open.
 * Stats to keep track of are not finalised.

-## Struct config
+### Struct config

 Holds collector configuration and configured inputs and outputs.
 Configured via [libucw configuration system](http://www.ucw.cz/libucw/doc/ucw/conf.html).

-## Struct timeframe
+### Struct timeframe

 Structure for queries within a time window (cca 1-10 sec, configurable). Contains all requests within
 that window, their matching responses within that or the next timeframe, and responses within this
@@ -117,25 +79,26 @@ and not for "fast" ones (e.g. counting-only statistics).

 Shared state (with locks) should be accessed only a few times per timeframe, not per packet.

-# Has
+#### Has
 * List of packets to write - possibly with rate-limiting per timeframe (linked list).
 * List of dropped packets to dump - likely with rate-limiting per timeframe (linked list).
 * Hash containing unmatched requests (by IPver, TCP/UDP, client/server port numbers, client/server IPs, DNS ID and QNAME)
 * Possibly: a memory pool for all the packet data

-# Query hash
+#### Query hash
 The hash is a fixed-size table of configurable order. Rationale: rehashing could cause a lot of latency in the main thread.
 A big enough hash for the upper limit of packets in the timeframe (hard limit or just estimated) takes about 3% of memory of the packets,
 so a big enough table can be easily afforded within the expected memory usage.

 The hash is a linked list of packets in each bucket (with the "next" ptr within the packet struct).

-# Limiting memory use
-The numer of requests (and unmatched responses) in the frame should be bounded by a configurable constant.
-This should be a soft limit (e.g. packet should be dropped more frequently when approaching the limit).
-When a request is accepted, its response should be always accepted.
+#### Limiting memory use
+
+The numer of requests (and unmatched responses) in the frame should be bounded by a configurable constant. This should be a soft limit (e.g. packet should be dropped more frequently when approaching the limit). When a request is accepted, its response should be always accepted.
+
+**Estimate:** with limit 1Mq per frame, cca 200 B/q (in memory) and 5x 1s frames in the queue, 1GB of memory should suffice for the packets.

-**Question:** What to do with the (not dropped) responses to dropped requests? 
+**Question:** What to do with the (not dropped) responses to interface-dropped requests? 

 **Rationale:** The packets in the timeframes take up most of collector memory. Since the memory use of a single packet
 is bounded by the packet capture bound plus a fixed overhead, bounding the packet number per timeframe is an easy and
@@ -146,7 +109,7 @@ keeping these numbers in sync between the threads adds complexity. Also, this be
 Another alternative is considering the total memory usage of the program. Not sure how technically viable and
 reliable (what to measure? would such memory usage shrink on `free()`?), and might not be very predictable.

-## Struct packet
+### Struct packet

 Holds data about a sggle query packet. Uses libtrace to handle packet data management and dissection.
 The DNS parsing is done by a simple header and QNAME label reading without compression. The remaining
@@ -158,14 +121,14 @@ Replies should in principle be recomputable from requests.
 If it is necessary to store all the infrmation, a full PCAP (in a separate process)
 could be more appropriate.

-# Has
+#### Has
 * Raw packet data: timestamp, real length, capture lenght, packet data
 * Adresses, ports, transport info
 * DNS header data, qname as a printable string (dot notation)
 * Request may have a matching response packet. In this case the response is owned by the request 
 * (Next packet in hash bucket, next packet in timeframe)

-# Packet network features
+#### Packet network features
 Handles both IPv4 and IPv6, as well as UDP.

 Does not currently handle packet defragmentation. This would be nontrivial to do right and to manage resources for
@@ -176,7 +139,7 @@ TCP flow could be reconstructed, but it seems less of a priority. Currently a on
 (not counting SYN, ACK and FIN packets) are processed, longer streams are dropped. Longer packets and
 long-open TCP connections seem to be uncommon.

-## Stats
+### Stats

 Very basic statistics for the collector (time, dropped/read packets, dropped frames), the timeframes (dropped/read packets),
 the outputs (dropped/read packets, dropped timeframes, written items and bytes before/after compression).
@@ -184,7 +147,7 @@ Not clear what all to measure. Any DNS data statistics should be handled by an o

 Currently partially implemented.

-## Outputs
+### Outputs

 Each output type extends a basic output structure. This basic structure contains the current open file and filename
 (or socket, etc.), time of opening, rotation period, compression settings, basic statistics (bytes written, frames dropped, ...)
@@ -199,53 +162,42 @@ flags(IPv4/6,TCP/UDP) client-addr client-port server-addr server-port id qname q
 Every output has a pathname template with strftime() replacement. An output can be compressed on the fly (which saves
 disk space and also write time). Fast compression (LZ4, ...) is preferred.

-# Memory usage limits
+#### Memory usage limits
 The maximum length of the timeframe queue of every output should be bounded (and configurable).
 When exceeded, oldest timeframe not being currently processed should be dropped.
 Rationale: Together with timeframe size this predictably limits 
 total memory usage. Dropping data on lagging (e.g. IO-bound) outputs is preferable to dropping packets on input
 and therefore missing them on fast (e.g. counting) outputs.

-# Disk usage limits
+#### Disk usage limits
 Optional. When approaching a per-output-file size limit, softly introduce query skipping.

-# CSV output
+#### CSV output
 Optional header line, configurable separator, configurable field set.
 Actually not much larger than Protocol Buffers when compressed (e.g. with just the very fast "lz4 -4": 33 B/query CSV, 29B/query ProtoBuf).
 Most commonly accepted format. No quoting necessary with e.g. "|" delimiter.

-# Protocol Buffer output
+#### Protocol Buffer output
 Similar to CSV, configurable field set, one length-prefixed (16 bits) protobuf message per query.
 Library `protobuf-c` seems to use reflection when serialising rather than fully generated code (as it does in C++)
 so the speed is not great (comparable to CSV?).

-# PCAP
+#### PCAP
 Currently only used for dropped packets. Should be rate-limited (with softly increasing drop-rate).

-# Current state
-Timeframes ready for output are processed immediatelly in the main thread (no output queue, no rate limiting).

 ## Inputs

-The input is either a single interface, or a list of pcap files to be processed in the given order.
-When reading pcap files, the "current" time follows the recorded times. 
+The input is either a single interface, or a list of pcap files to be processed in the given order. When reading pcap files, the "current" time follows the recorded times. 

-Multiple specified input interfaces (and nt just "all") would require multiple PCAPs (or traces) open, but libtrace
-does not seem to support polling on multiple traces. Advanced setups can be obtained by listening to "all" interfaces
-with kernel BPF filter.
+For online capture, multiple interfaces are supported. Configurable promiscuous mode, BPF filtering.

-Multiple reader threads are hard to support, as the access to the query hash would have to be somehow guarded.
-Since the main congestion is expected to be at the outputs, this may not be problem. If required in the future,
-can be a (very) advanced feature.
+**Note:** Multiple reader threads are hard to support, as the access to the query hash would have to be somehow guarded. Since the main congestion is expected to be at the outputs, this may not be problem. If required in the future, can be a (very) advanced feature.

 [Libtrace](http://research.wand.net.nz/software/libtrace.php) is preferred to tcpdump's PCAP for the larger
 feature set, implemented header and layer skipping, larger set of inputs (including kernel ring buffers).

-# Current state
-libPcap is used to read pcap files, live capture could be implemented easily but a switch to libtrace is
-expected (and will be as easily implemented there with additional benefits during parsing the layers).
-
-### Configuration / options
+## Configuration / options

 Configuration is read by the [libucw configuration system](http://www.ucw.cz/libucw/doc/ucw/conf.html).
 Configuration should allow setting predictable limits on memory usage and potentially disk usage.
@@ -258,15 +210,8 @@ The amount of missed packets should not be significant relative to the frequency
 Supporting online reconfiguration would greatly complicate program complexity and potentially introduce bugs and memory leaks.
 a potential exception could be the BPF filter string. What would be good use-cases or easily tunable parameters?

-### Language and libraries
-
-Language standart is C99. The proposed libraries are:
-* [Libtrace](http://research.wand.net.nz/software/libtrace.php) for packet capture, dissection and dumping.
-* [libUCW](http://www.ucw.cz/libucw/) for configuration parsing, logging, mempools (in the future?) and some data structures (currently doubly linked lists). Replacable but convinient.
-* libLZ4, libgz, ... for online (de)compression of input pcaps and output files. Partially implemented separately, but also part of libtrace.
-* protobuf-c for writing protocol buffers.

-### Logging and reports
+## Logging and reports

 Currently using libucw logging system and configured via the same config file. Includes optional log file rotation.
 A sub-logger for potentially frequent messages with rate-limiting is also configured by default.
@@ -274,18 +219,43 @@ A sub-logger for potentially frequent messages with rate-limiting is also config
 Input and output statistics should be logged (e.g. on output file rotation). 
 Statistical outputs migh include some statistics. No other reporting mechanism is currently desined.

-### Questions

-* Libtrace vs libPCAP.
-  Currently: libPCAP. Tomas: in favor of libtrace.
+## Questions

-* One thread per output vs one writer thread.
-  Currently: No threads (WIP). Tomas: in favor of one thread per output
-
-* Runtime control and reconfiguration - how much control is desired and useful? How to implement it?
-  Currently: No runtime control.
+* Runtime control and reconfiguration - how much control is desired and useful? How to implement it? Currently: No runtime control.

 * Which output modules to support? CSV, Protobuf, counting stats (DSC-like?), CBOR, ...

+# Older (process again):
+
+## TCP/IP status:
+
+* Accepts both IPv4 and IPv6
+* No IP fragment reconstruction
+  * Not planned (rather technical, separate for IPv4 and IPv6, ...)
+  * Should not happen too much anyway (very few requests have >100 bytes, very few responses have >1000 bytes)
+* TCP is currently dropped ~~limited to (single request, single response) streams, TCP options accepted but skipped~~
+  * These short TCP connections seem to bee (almost?) all the cases in "akuma" data
+  * PLANNED: TCP flow reconstruction, keeping open connections (currently ignores SYN/ACK/FIN)
+* UDP fully suported
+* Dropping all packets with data size mismatches etc. Optional PCAP dump of such packets (currently deactivated, TODO)
+
+## DNS status
+
+* Dropping packets with QNAME length above 254 (by RFC)
+* Dropping packets with the snapshot (captured part) ending before the entire DNS QNAME part (should not happen with reasonable snaplen)
+* Matching pairs on (IPver, transport, client IP, client port, server IP, server port, DNS ID, QNAME, QType, QClass)
+  * QNAME, QType, QClass might not be present in all responses (e.g. NOT_IMPL)
+  * server IP/port is redundant for ICANN, but might be useful for client watching (or upstream frmo DNS cache, ... ?)
+
+## Output

+* Modular, currently CSV and obsolete (but close to working) Protobuf
+* Optional in-line compression (currently lz4), stored-field selection and time-based output file rotation (based on filename format string)
+* Separate threads for:
+  * main thread - packet collection, parsing and matching responses with requests
+  * thread for every configured output
+* multiple completely independent instances of the same output type can be configured

+### Dumping dropped packets
+* Configurable dump/drop by category