I transform hundreds of tabular sources. For the cleaning / transformation, I found that a very small number of transformations is required, and that we need to review them as a team including business owners. So, I wrote a simple grammar that is very English-like; that gets translated into Polars operations under the covers in Python. It covers 98% + of my ingestion needs, and means that we focus on the needs of the logical data transformations as a team. Business users can easily make changes for sources they manage.
One of the concepts is a “map”, for old values to new values. Those we keep in Excel in Git, so that business users can edit / maintain them. Being Excel, we’re careful to validate the import of those rules when we do a run, mainly to indicate where there’s been a lot of change to identify where there might be an unintended change. Excel makes me nervous in data processing work in general (exploration with Pivots is great, though I’ve moved to Visidata as my first tool of choice). But for years of running in this way we’ve worked around Excel lax approach to data, such as interpreting numerical ID fields as numbers rather than strings.
For output “rendering”, because everything is in Polars, we can most frequently simply output to CSV. We use Jinja for some funky cases.
I think most of these are extremely poor. They can only be interpreted in many cases if you already understand the data, such as by reading the table first.
Sure, but it’d be a lot more interesting and challenging to build a 100 visualizations where each gives a unique insight of the same dataset. An isometric 3d bar chart is just going through the motions.
From my POV this is worth bookmarking - there are many datasets that are much clearer with one chart type or another - having 100 styles with the same data will later offer a visual index to help me decide what will best serve my needs.
My thoughts exactly! At least half of these are chart types that I've never seen before or at least would never think of using so having this reference is awesome.
I write a lot of documentation, knowing that it may be nobody else who reads it. Why? Because when I take the time to write clearly, I think clearly. It’s for my productivity and effectiveness, first.
We’ve been developing niche medical software successfully for some decades.
First, it helps that it’s niche—it avoids the “make healthcare better with electronic healthcare records” space, which can only but descend into making a much of text boxes available on a screen and promising that AI will do… something…
Second, we will listen to our clients, and probe their needs. But we’re most successful when we observe our clients. When we’re not in the thick of it, we have more space to ask “does it have to be this way?” We work very hard to formulate the problem so that a piece of software is not the default solution.
Few of the pain points are “exciting” or “glamorous”. But anything that means the practitioner is spending more time with the patient is a big win, even if it means applying some very boring technology.
Good fun. I think, though, that precision in language might be a challenge. Some previous comments I concur. Over and above those, I was very strict not to infer anything outside the minimum of what was said. For example, “I was in bedroom from 10:00 to 10:15” does not imply that “I was not in the bedroom before or after that time”. Or, “I didn’t see anyone when I arrived” only means I saw no one in the destination room, not that there wasn’t someone in the kitchen (that I must have walked through) or the source room. Illogical that the murder could have happened up to 11:15, at exactly the same time that the police arrive—unless the victim phoned it in. These rules left ambiguity.
Thanks for the feedback! I agree that the rules and explanations needs to be improved. I don't like the ambiguity and I think the deductions you did are indeed, correct. I should focus on improving the language in general, since English is not my native tongue.
One of the concepts is a “map”, for old values to new values. Those we keep in Excel in Git, so that business users can edit / maintain them. Being Excel, we’re careful to validate the import of those rules when we do a run, mainly to indicate where there’s been a lot of change to identify where there might be an unintended change. Excel makes me nervous in data processing work in general (exploration with Pivots is great, though I’ve moved to Visidata as my first tool of choice). But for years of running in this way we’ve worked around Excel lax approach to data, such as interpreting numerical ID fields as numbers rather than strings.
For output “rendering”, because everything is in Polars, we can most frequently simply output to CSV. We use Jinja for some funky cases.