|
1 | | -# csv |
| 1 | +# simdcsv |
2 | 2 |
|
3 | | -There are many kinds of CSV files; this project supports the format described |
| 3 | +simdcsv is a CSV parser that evaluates 64 bytes at a time. There are many kinds of CSV files; this project adheres to the format described |
4 | 4 | in [RFC 4180](https://www.rfc-editor.org/rfc/rfc4180.html). |
5 | 5 |
|
| 6 | +**Introduction** |
| 7 | + |
| 8 | +We can classify every character in CSV into the following: a COMMA, QUOTATION, NEW_LINE, OTHER. We can build a perfect lookup table and use `vqtbl1q_u8` to classify 16 characters at once. Daniel Lemire calls this "vectorized classification" in the simdjson paper. [[code pointer]](https://github.com/friendlymatthew/simdcsv/blob/main/src/classifier.rs) |
| 9 | + |
| 10 | +Once we classify every character, we can build a bitset for each class. We chunk through 64 characters at a time, building a `u64` for every chunk. Here is a naive case: |
| 11 | + |
| 12 | +``` |
| 13 | +[//]: # COMMA = 0, QUOTATION = 1, NEW_LINE = 2, OTHER = 3 |
| 14 | +
|
| 15 | +aaa,bbb,ccc |
| 16 | +33303330333 |
| 17 | +``` |
| 18 | + |
| 19 | +Then the bitsets look like: |
| 20 | + |
| 21 | +```rs |
| 22 | +comma_bitset = 0b00010001000 |
| 23 | +other_bitset = 0b11101110111 |
| 24 | +``` |
| 25 | + |
| 26 | +Now, we can just [count the number of leading zeros](https://doc.rust-lang.org/std/primitive.u64.html#method.leading_zeros) in the comma bitset to pull the csv entries. |
| 27 | + |
| 28 | +Using a bitset is pretty powerful in cases where one wants to check if there exists a symbol, count the # of symbols, or remove escaped symbols. |
| 29 | + |
| 30 | +**Detecting Escaped Quotations and Commas** |
| 31 | + |
| 32 | +Consider the csv row: `"aaa,norm","b""bb","ccc"` |
| 33 | + |
| 34 | +In CSV, quotes are escaped by doubling them (`""`). The `""` in `b""bb` is field content, not a structural delimiter. We detect escaped pairs by finding adjacent quotes: |
| 35 | + |
| 36 | +```rs |
| 37 | +let escaped = q & (q << 1); // Find adjacent quote pairs |
| 38 | +let escaped = escaped | (escaped >> 1); // Mark both quotes in each pair |
| 39 | +let valid_quotes = q & !escaped; // Remove escaped quotes |
| 40 | +``` |
| 41 | + |
| 42 | +```rs |
| 43 | +quote_bitset = 0b100010010011010000000001 |
| 44 | +q << 1 = 0b000100100110100000000010 |
| 45 | +escaped = 0b000000010000000000000000 // Found the "" pair |
| 46 | +escaped | >> 1 = 0b000000011000000000000000 // Both quotes marked |
| 47 | +valid_quotes = 0b100010000011010000000001 // Only structural quotes remain |
| 48 | +``` |
| 49 | + |
| 50 | +**Marking Inside Quotations** |
| 51 | + |
| 52 | +With only structural quotes, we use parallel prefix XOR to mark all bits between quote pairs: |
| 53 | + |
| 54 | +```rs |
| 55 | +valid_quotes = 0b100010000011010000000001 // Structural quotes only |
| 56 | +inside_quotes = 0b011100011111000011111110 // All bits between quote pairs marked as 1 |
| 57 | +``` |
| 58 | + |
| 59 | +Masking out commas inside quotes: |
| 60 | + |
| 61 | +```rs |
| 62 | +comma_bitset = 0b000001000000100000010000 // Commas at positions 4, 10, 18 |
| 63 | +valid_commas = comma_bitset & !inside_quotes |
| 64 | + = 0b000001000000100000000000 // Comma at 4 masked out |
| 65 | +``` |
| 66 | + |
6 | 67 | ## Reading |
7 | 68 |
|
8 | 69 | https://www.rfc-editor.org/rfc/rfc4180.html<br> |
|
0 commit comments