Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Data provenance is a standard term of art in machine learning and data science, a “data 101” kind of thing, with many OSS and vendor tools built up to solve provenance problems, like DVC, Pachyderm, kubeflow, mlflow, neptune, etc.


worked with stats, machine learning and data science for 10+ years now. never heard the term used until now. (that's not to say I'm not familiar with the things the term refers to, indeed, most of the intellectual frameworks I've worked with break each of the things that make up provenance into far more fine grained concepts).

course, I've also never heard of or touched the software you listed there either, but that may be because I don't view the data science and machine learning I'm interested in as being about specific software or vendor software...

sounds more database- lingo to me...


It's a common term used in data governance. It's found less in the academic literature, and more in software demos and vendor brochures. You'll also hear "data lineage", which is the context in which the term arises.

"Provenance" just means where the data came from. [1]

It's one of those shibboleths and terms of art used by people in industry. If you go to trade-shows you'll hear it being used -- it's worth knowing if nothing else but for its sociological value among the data software tools crowd.

Side: it's a little like the word "inference" being used as a verb by folks in AI (example usage: we use GPUs to speed up model "inferencing") -- in AI, to inference means to "predict". It's a term of art. If someone with a traditional statistics background went to a deep learning conference, they are likely to be very confused because in traditional statistics, inference means to obtain parameters θ in a model y = f(x,θ), whereas in AI, inferencing refers to obtaining y.

[1] https://en.wikipedia.org/wiki/Data_lineage#Data_provenance


I've also worked as a data scientist for a few years and have never heard or used the word "provenance" in a DS context. Some people used it in the oil & gas industry when talking about where reservoir sands came from, but that usually garnered a eye-roll and mental translation to more everyday language.


Regardless of the term chosen, the concept of 'provenance' described here is the essential purpose behind the scientific notebooks used daily by experimentalists in industry and academia. Without thoroughly recording the bases for your experiment it almost surely will not be reproducible.

Where I work, (a large pharmaceutical), these notebooks are taken very seriously by biologists, chemists, and chemical engineers, and increasingly are shaping the mindset of our data scientists (who have yet to adopt them).

Given the longstanding practice of documenting experiment design and method, I think it's probably long overdue that the exploratory analysis of experiment-based data must also adopt more rigorous governance to ensure that necessity and sufficiency are ensured when drawing inferences from experiment, especially when the data used was not originally intended to answer the current question posed.


It’s shocking if you’ve worked professionally in statistics and not heard about data provenance.

A few publications from ~2011-2015 period:

http://ceur-ws.org/Vol-1558/paper37.pdf

https://ieeexplore.ieee.org/document/5739644

https://link.springer.com/chapter/10.1007/978-3-642-53974-9_...

Add a variety of additional links dating back a bit further (note the emphasis in this case on research data and tracking state of an experiment).

https://nnlm.gov/data/thesaurus/data-provenance

Data provenance is not a database / data warehouse term. It is uniquely and specifically a basic “101” concept of statistical science and ML / data science, where the custody and tracking of data are specifically tied to iterations of experiments, prototypes and research, for the sake of reproducibility.

If I was interviewing an experienced statistical researcher and they didn’t at least have a working knowledge of the core concepts, that would be a huge red flag.


I'm not saying it doesn't exist, but I am saying it must be jargon used within a particular community or minority subset of general stats/ machine learning/AI. honestly, I still think it's a database/ enterprise term because I've worked for our national statistics office and never heard it in the statistics community either. I have frequently heard data lineage however, but again, that's a database/ enterprise type person lingo: when people use that word I know immediately the background they're coming from.

Another poster mentioned vendor brochures and trade shows, which is in line with my expectations about which community it stems from, and also explains why I've never heard of it because I try to keep away from such environments these days.

Everywhere I've been the things which I take to make up "provenance" have generally been referred to under the simple label of "data quality", with separate subset definitions and measures such as timeliness, source, authority, format, history, suitability, verification, etc.

Of course, that's assuming people even worry about such things. In practice, let's be frank, anyone who's worked with data science knows they actually get shorter shrift than they deserve in practice: I'm probably among a minority of people in the real world who actually take things seriously, and I find myself on a constant crusade to remind people that just because a data point exists in a data set doesn't mean it's useful/ appropriate/ truthful/ unbiased.

data quality is a bit problematic, because I can see it being used by people who think provenance doesn't have any thing to do with quality, and from a variety of fields, but it is also infinitely more popular according to historical search trends, and in my last three jobs provenance would fall under the data quality framework.


It’s very, very widely used jargon. I’d put “data provenance” on par with “overfitting” or “GPU model training” in terms of the high, ubiquitous place it occupies in mainstream machine learning.


Sorry, I have to disagree here. Its a term of art in some of the literature, but it's definitely not that widespread, certainly not in consumer tech data science, where I work.


I’ve worked professionally in quant finance, image processing, defense research, and several mid-to-large ecommerce and payment processor companies.

In all of them, data provenance has been a first class consideration of machine learning and data platform teams, like a day-to-day concern and baked in to architecture review guidelines and production checklists and whatnot for every ML project.

In many of these companies we had teams of 20-40 ML scientists, all of whom knew about data provenance as a first class consideration in their work, had experience with it from their past jobs and academic programs, and considered it on equal footing with any aspect of data curation, model selection, model training and model serving.


I mean, I shouldn't be surprised, as given our previous interactions, I feel like you are the anti-me, in that our experiences of similar things is so wildly divergent.

Shrug, such is life I guess. That being said, I care deeply about this stuff (but didn't have a word), so perhaps it will be easier to convince people to pay attention to the data with said word.


TIL.

I’ve worked as a data engineer for the last two years and never heard of this being used in this context before.

Typically the word “data lineage” is used to mean this in my experience.

I don’t think I’ve ever been in a meeting where someone mentioned provenance except referring to a show about paintings.

Lineage isnt the same thing, being a more specific technical term referring to keeping the history of datasets and where they came from (basically), but people actually say the words “data governance” and “lineage”.


Another important use of data provenance is in GDPR. You have to be able to know the source of each data you use and be able to remove them from storage and backups at request.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: