I was recently working on parsing 100K CSV files and inserting them into a database. The files have a non-column-oriented header and other idiosyncrasies so they can't be directly imported easily. They're stored on an HDD so my first instinct was to focus on I/O: read the whole file into memory as an async operation so that there are fewer larger IOs to help the HDD and so that other parser tasks can do work while waiting for the read to complete. I used a pretty featureful C# CSV parsing library which did pretty well on benchmarks [0] (CsvHelper) so I wasn't really worried about that part.
But that intuition was completely wrong. The 100K CSV files only add up to about 2GB. Despite being many small files reading through them all is pretty fast the first time, even on Windows, and then they're in the cache and you can ripgrep through them all almost instantaneously. The pretty fast parser library is fast because it uses runtime code generation for the specific object type that is being deserialized. The overhead of allocating a bunch of complex parser and typeconverter objects, doing reflection on the parsed types, and generating code for a parser, means that for parsing lots of tiny files its really slow.
I had to stop worrying about it because 2 minutes is fast enough for a batch import process but it bothers me still.
Edit: CsvHelper doesn't have APIs to reuse parser objects. I tested patching in a ConcurrentDictionary to cache the generated code and it massively sped up the import. But again it was fast enough and I couldn't let myself get nerd sniped.
Edit2: the import process would run in production on a server with low average load, 256GB RAM, and ZFS with zstd compression. So the CSV files will live permanently in the page cache and ZFS ARC. The import will probably run a few dozen times a day to catch changes. IO is really not going to be the problem. In fact, it would probably speed things up to switch to synchronous reads and remove all the async overhead. Oh well.
But that intuition was completely wrong. The 100K CSV files only add up to about 2GB. Despite being many small files reading through them all is pretty fast the first time, even on Windows, and then they're in the cache and you can ripgrep through them all almost instantaneously. The pretty fast parser library is fast because it uses runtime code generation for the specific object type that is being deserialized. The overhead of allocating a bunch of complex parser and typeconverter objects, doing reflection on the parsed types, and generating code for a parser, means that for parsing lots of tiny files its really slow.
I had to stop worrying about it because 2 minutes is fast enough for a batch import process but it bothers me still.
Edit: CsvHelper doesn't have APIs to reuse parser objects. I tested patching in a ConcurrentDictionary to cache the generated code and it massively sped up the import. But again it was fast enough and I couldn't let myself get nerd sniped.
Edit2: the import process would run in production on a server with low average load, 256GB RAM, and ZFS with zstd compression. So the CSV files will live permanently in the page cache and ZFS ARC. The import will probably run a few dozen times a day to catch changes. IO is really not going to be the problem. In fact, it would probably speed things up to switch to synchronous reads and remove all the async overhead. Oh well.
[0]: https://www.joelverhagen.com/blog/2020/12/fastest-net-csv-pa...