I'm not saying it doesn't exist, but I am saying it must be jargon used within a particular community or minority subset of general stats/ machine learning/AI. honestly, I still think it's a database/ enterprise term because I've worked for our national statistics office and never heard it in the statistics community either. I have frequently heard data lineage however, but again, that's a database/ enterprise type person lingo: when people use that word I know immediately the background they're coming from.
Another poster mentioned vendor brochures and trade shows, which is in line with my expectations about which community it stems from, and also explains why I've never heard of it because I try to keep away from such environments these days.
Everywhere I've been the things which I take to make up "provenance" have generally been referred to under the simple label of "data quality", with separate subset definitions and measures such as timeliness, source, authority, format, history, suitability, verification, etc.
Of course, that's assuming people even worry about such things. In practice, let's be frank, anyone who's worked with data science knows they actually get shorter shrift than they deserve in practice: I'm probably among a minority of people in the real world who actually take things seriously, and I find myself on a constant crusade to remind people that just because a data point exists in a data set doesn't mean it's useful/ appropriate/ truthful/ unbiased.
data quality is a bit problematic, because I can see it being used by people who think provenance doesn't have any thing to do with quality, and from a variety of fields, but it is also infinitely more popular according to historical search trends, and in my last three jobs provenance would fall under the data quality framework.
It’s very, very widely used jargon. I’d put “data provenance” on par with “overfitting” or “GPU model training” in terms of the high, ubiquitous place it occupies in mainstream machine learning.
Sorry, I have to disagree here. Its a term of art in some of the literature, but it's definitely not that widespread, certainly not in consumer tech data science, where I work.
I’ve worked professionally in quant finance, image processing, defense research, and several mid-to-large ecommerce and payment processor companies.
In all of them, data provenance has been a first class consideration of machine learning and data platform teams, like a day-to-day concern and baked in to architecture review guidelines and production checklists and whatnot for every ML project.
In many of these companies we had teams of 20-40 ML scientists, all of whom knew about data provenance as a first class consideration in their work, had experience with it from their past jobs and academic programs, and considered it on equal footing with any aspect of data curation, model selection, model training and model serving.
I mean, I shouldn't be surprised, as given our previous interactions, I feel like you are the anti-me, in that our experiences of similar things is so wildly divergent.
Shrug, such is life I guess. That being said, I care deeply about this stuff (but didn't have a word), so perhaps it will be easier to convince people to pay attention to the data with said word.
Another poster mentioned vendor brochures and trade shows, which is in line with my expectations about which community it stems from, and also explains why I've never heard of it because I try to keep away from such environments these days.
Everywhere I've been the things which I take to make up "provenance" have generally been referred to under the simple label of "data quality", with separate subset definitions and measures such as timeliness, source, authority, format, history, suitability, verification, etc.
Of course, that's assuming people even worry about such things. In practice, let's be frank, anyone who's worked with data science knows they actually get shorter shrift than they deserve in practice: I'm probably among a minority of people in the real world who actually take things seriously, and I find myself on a constant crusade to remind people that just because a data point exists in a data set doesn't mean it's useful/ appropriate/ truthful/ unbiased.
data quality is a bit problematic, because I can see it being used by people who think provenance doesn't have any thing to do with quality, and from a variety of fields, but it is also infinitely more popular according to historical search trends, and in my last three jobs provenance would fall under the data quality framework.