Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It's weird to me that people build libraries on top of the ML stack to track provenance, when it's really the ML library's job to do that for its inputs. However it is a right pain building it into the ML library as it affects all the interfaces. We build data, model & evaluation provenance objects into our ML library, Tribuo (https://tribuo.org), as a first class part of the library. You can take a provenance and emit a configuration to rerun an experiment just by querying the model object. It is built in Java though, which makes it a little easier to enforce the immutability and type safety you need in a provenance system.

edit: I should add that I'm definitely in favour of having provenance in ML systems, and libraries layered on top are the way that people currently do that. It's just odd that people aren't working on adding that support directly into scikit-learn/TF/pytorch etc.



MLFlow and TFX try to add some form of provenance by polluting your code with "logging" calls. A good thing MLFlow has added is auto-loggers - we also added them in our Maggy framework ( https://www.logicalclocks.com/blog/unifying-single-host-and-... ).

I totally agree that where you have framework hooks, you should have provenance, but given there's no standard for what provenance is, no defacto open-source platform, the sklearn and tf and pytorch folks rightly steer clear. We see that if you have a shared file system, you can use conventions for path names (features go in 'featurestore', training data in 'training', models in 'models', etc), to capture a ton of provenance data.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: