Skip to content

Commit ce5805f

Browse files
Update README
1 parent ea07749 commit ce5805f

File tree

1 file changed

+63
-2
lines changed

1 file changed

+63
-2
lines changed

README.md

Lines changed: 63 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,69 @@
1-
# csv
1+
# simdcsv
22

3-
There are many kinds of CSV files; this project supports the format described
3+
simdcsv is a CSV parser that evaluates 64 bytes at a time. There are many kinds of CSV files; this project adheres to the format described
44
in [RFC 4180](https://www.rfc-editor.org/rfc/rfc4180.html).
55

6+
**Introduction**
7+
8+
We can classify every character in CSV into the following: a COMMA, QUOTATION, NEW_LINE, OTHER. We can build a perfect lookup table and use `vqtbl1q_u8` to classify 16 characters at once. Daniel Lemire calls this "vectorized classification" in the simdjson paper. [[code pointer]](https://github.com/friendlymatthew/simdcsv/blob/main/src/classifier.rs)
9+
10+
Once we classify every character, we can build a bitset for each class. We chunk through 64 characters at a time, building a `u64` for every chunk. Here is a naive case:
11+
12+
```
13+
[//]: # COMMA = 0, QUOTATION = 1, NEW_LINE = 2, OTHER = 3
14+
15+
aaa,bbb,ccc
16+
33303330333
17+
```
18+
19+
Then the bitsets look like:
20+
21+
```rs
22+
comma_bitset = 0b00010001000
23+
other_bitset = 0b11101110111
24+
```
25+
26+
Now, we can just [count the number of leading zeros](https://doc.rust-lang.org/std/primitive.u64.html#method.leading_zeros) in the comma bitset to pull the csv entries.
27+
28+
Using a bitset is pretty powerful in cases where one wants to check if there exists a symbol, count the # of symbols, or remove escaped symbols.
29+
30+
**Detecting Escaped Quotations and Commas**
31+
32+
Consider the csv row: `"aaa,norm","b""bb","ccc"`
33+
34+
In CSV, quotes are escaped by doubling them (`""`). The `""` in `b""bb` is field content, not a structural delimiter. We detect escaped pairs by finding adjacent quotes:
35+
36+
```rs
37+
let escaped = q & (q << 1); // Find adjacent quote pairs
38+
let escaped = escaped | (escaped >> 1); // Mark both quotes in each pair
39+
let valid_quotes = q & !escaped; // Remove escaped quotes
40+
```
41+
42+
```rs
43+
quote_bitset = 0b100010010011010000000001
44+
q << 1 = 0b000100100110100000000010
45+
escaped = 0b000000010000000000000000 // Found the "" pair
46+
escaped | >> 1 = 0b000000011000000000000000 // Both quotes marked
47+
valid_quotes = 0b100010000011010000000001 // Only structural quotes remain
48+
```
49+
50+
**Marking Inside Quotations**
51+
52+
With only structural quotes, we use parallel prefix XOR to mark all bits between quote pairs:
53+
54+
```rs
55+
valid_quotes = 0b100010000011010000000001 // Structural quotes only
56+
inside_quotes = 0b011100011111000011111110 // All bits between quote pairs marked as 1
57+
```
58+
59+
Masking out commas inside quotes:
60+
61+
```rs
62+
comma_bitset = 0b000001000000100000010000 // Commas at positions 4, 10, 18
63+
valid_commas = comma_bitset & !inside_quotes
64+
= 0b000001000000100000000000 // Comma at 4 masked out
65+
```
66+
667
## Reading
768

869
https://www.rfc-editor.org/rfc/rfc4180.html<br>

0 commit comments

Comments
 (0)