Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Consider I get 10 files of size 3 GB every week, which I am supposed to filter based on certain column using a reference index and forward to my colleague. Before filtering I also want to check how the file looks like: column names, first few records etc.

I can use something like following to explore few rows and few columns. $$ awk '{print $1,$3,$5}' file | head -10

And then I can use something like sed with reference index to filter the file. Since I plan to repeat this with different files, databases would be time consuming(even if I automate it loading every file and querying). Due to the file size options like R, Python would be slower than unix commands. I can also save set of commands as script and share/run whenever I need it.

If there is a better way I would be happy to learn.



I think the gain you're seeing there is because it's quicker for you to do quick, dirty ad hoc work with the shell than it is to write custom python for each file. Which totally makes sense, the work's ad hoc so use an ad hoc tool. Python being slow and grep being a marvel of optimization doesn't really matter, here, compared to the dev time you're saving.


I have been doing Python the last few years, but went back to Perl for this sort of thing recently. You can start with a one liner, and if it gets complicated, just turn it into a proper script. As well as the Unix commands mentioned. Its just faster when you don't know what you are dealing with yet.


For this kind of thing, it's easiest to bulk-load them into SQLite and do your exploration and early analysis in SQL




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: