Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

>A machine learning algorithm which can be trained using SQL opens a world of possibilities. The model and the data live in the same space. This is as simple as it gets in terms of architecture. Basically, you only need a database which runs SQL.

First paragraph of the conclusion, and this very much fits with the mindset that's been growing in me in the data world over the past few years. Databases are much more powerful than we think, they're not going to go away, only get better, and having the data and the logic in the same space really removes tons of headaches. ML models, transformation of data, generating json for an API can all be done within the database rather than outside scripting language.

Are others seeing this? Are the current tradeoffs just that more people know python vs sql or database specific languages to where moving logic to postgres or snowflake is looked down on?



The challenge I, with limited knowledge, see with developing detailed algorithms in SQL is a lack of good testing, abstraction, and review tooling. Similarly for a lot of the user-defined-functions for the larger data warehouses (redshift, bigQuery, etc.)

dbt solves a lot of it, but I'd love to learn more about good resources for building reliable and readily-readable SQL algorithms for complex logic.


I've supported 3 different models over the years with inference implemented in SQL. First one I inherited, loved it so much that I implemented it twice again. Amazingly fast for TBs of data and no waiting on other tech teams.

That tooling you're describing is definitely not there. Bigquery has BQML but it's very much in its infancy. I tend to do all the modeling in Python/R on sampled data and then deploy to SQL.


BQML should become standard.


I mean this is an opportunity right? Build those things


I agree. Databases are going to be here for a long time, and we're barely scratching the surface of making people productive with them. dbt is just the beginning.


The problem with throwing everything in a database is you end up with brittle stored procedures all over the place, which are painful to debug. There is no good support for version control or testing, which means you end up creating a dozen copies of each function named (sp_v1, sp_v2,.., etc.). It much more harder to practice iterative development which the rest of software development seems implements effectively.

Also traditional relational databases have a way to go before they can support parallelized machine learning workloads. You do not have control or the ability to spin up threads or processes to boost your performance. You rely on the query processor to make those decisions, and depending on your platform the results will be mixed.


No comment re rDBs supporting parallelized ML but re stored procedures - if your workflow evolves to treat them as ‘1st class’ code assets they’ll be just the same as the rest of your code.

We always had them in version control, unit tested etc. The tools are there if you want to use them.


I ll take this as a learning opportunity. I have looked around to find a reliable framework to implement within our team and failed to find anything usable.

How do you guys manage to implement versioning and testing? If you had a new stored procedure to deploy, where do you deploy? How to you integrate with existing applications which rely on it?


Depends on what you're using. The Java world has a bunch of JDBC wrappers for testing SPs. SQLServer GUI has unit testing tools that can also be run from command line. @dventimi has good postgres recommendation.

Version control - each SP is in a file sp_descriptive_name.sql that's on the source tree under /db.

Deployment - compiled from the command line. Deployment target managed by env variables. Systems can require odd workarounds for ensuring no downtime - this is where systems end up with sp_descriptive_name_vN with N for each version (still stored in sp_descriptive_name.sql tho) so the new SP can be loaded alongside the old version, and then application code can be updated (or config updated) to use the new SP name. Good practice to remove the old version on next release.

I think it's true that SPs introduce a compromise, but having logic operate on data without having to send data back and forth over DB connection is a pretty big win.

It can also be a good abstraction because it avoids application code needing to know (for example) which field in which table indicates a customer is active, and code can execute something like sp_deactivate_customer(cust_id) and then let the stored procedure take care of updating flags, dates etc that describe a customer's active status. ymmv depending on whether your team owns the data and database, or if that's managed separately.


For testing if I'm in PostgreSQL I use pgTAP.

For version control I use git just as I do for other program units.


Updating stored procedures was a pain in the ass last time I checked. Checking changes on them in deployed servers was too very painful


> The problem with throwing everything in a database is you end up with brittle stored procedures all over the place, which are painful to debug. There is no good support for version control or testing...

Shell scripts with the SQL as Here-Docs, in stead of stored procedures. Ordinary text files, eminently gittable. Also lets you insert varying table/column names as environment variables, in stead of the regex rigmarole in TFA. (Or was that in another post on his blog? Just returned from reading several of them.) Best of both worlds.


> Databases are much more powerful than we think

The older I get the more I agree with this.

There is nothing you cannot build by combining SQL primitives. Side effects can even be introduced - on purpose - by way of UDFs that talk to the outside world.

I've seen more than one system where the database itself was directly responsible for things like rendering final HTML for use by the end clients. You might think this is horrible and indeed many manifestations are. But, there lurk an opportunity for incredible elegance down this path, assuming you have the patience to engage in it.


> I've seen more than one system where the database itself was directly responsible for things like rendering final HTML for use by the end clients.

I did this for a side project a few months ago and even used postgrest to serve the page with correct headers for html. It felt simultaneously really cursed and obvious. Shit you could even use plv8 to run mustache or whatever in the db if you really wanted to piss people off.


I'm doing this right now with a DO droplet built with PostgreSQL, postgrest, nginx, and not much else. Do you have any tips, tricks, or blog posts you can share based on your experience? You should post it to HN. Strike while the iron's hot. With a little luck you'll hit the front page.


Nah nothing that refined. Pretty predictably supabase is doing some weird stuff along these lines and I found some abandoned and semi-active repos associated with them and people working for them that were useful examples of some things.

As for posting to HN absolutely no thanks. These people are so fucking hostile there is no accomplishment too small to tear apart for entertainment here. I have no interest in it.


disclosure: supabase ceo

plv8 doesn't have network access, but this can be done with PostgREST.

That said, we actually restricted the HTML content-type on our platform so that users cannot do this. This is because we don't want to get too deep into the "frontend" world, with ever-changing frameworks.

We've found that if we offer a small feature, our users demand feature-completeness. In this case we're better-off waiting until we have solve a few of the major database tasks (branching + migrations, scale-to-zero, edge caching, etc)

FWIW, I think it's very cool serving HTML through PostgREST. I would be pretty happy to do this personally


disclosure: Supabase user

Long time user, first time caller. Besides the DO droplet, separately I also use Supabase and I love it.

I understand completely about restricting the content type, maybe just for now. Fun fact: I work around it by serving up text/xml plus XSLT. I'm sure you don't mind.


> I work around it by serving up text/xml plus XSLT. I'm sure you don't mind.

Not at all, I hear "application/xhtml+xml" is a nice hack from some of the team


> As for posting to HN absolutely no thanks. These people are so fucking hostile there is no accomplishment too small to tear apart for entertainment here. I have no interest in it.

You make a really good point.


> Side effects can even be introduced - on purpose - by way of UDFs that talk to the outside world.

> I've seen more than one system where the database itself was directly responsible for things like rendering final HTML for use by the end clients.

I built BI Web pages entirely in PL/SQL (using some Oracle modules I can't quite recall the nams of) over twenty years ago. Got a bit hairy to keep the meta-recursive stuff in your head when hard-coding JavaScript into it, calling the next such page for each value on the page, for drill-down to the next level... Not fun, but quite doable.

And with a bit more modern tooling -- as I said, this was turn-of-the-century tech -- I have no doubt it could be done much more effectively and efficiently nowadays.


This PL/SQL Overlord is running banks https://www.avaloq.com/solutions/products/avaloq-core


+1 sql is extremely elegant composable and is under rated

Postgres is very powerful. While I sought a short detour in nosql Mongodb land now back to Mysql Postgresql sql territory and glad for it

Being able to generate views is and stored procedures is useful as well.having sql Take over more like ml, gradient descent does open up good possibility.

Also since sql is declarative it Makes it so it's rather easier than imperative scripting languages


SQL has some positives but it is not composable. At all. This is because relations are not first-class values in SQL.


CTEs go a long way to making SQL more composable.


Is a query not a relation?


Basically, but queries are not first class in SQL. You can't assign a query to a variable, or pass it as a parameter to a stored procedure, for example. This would make SQL composable:

    declare @x = (select * from Person)
    select Name from @x where Birthdate < '2000-01-01'


You can do composition through CTE or Table-Valued Function.


CTE and TVF still treat tables as second class citizens, so while they enable some forms of composition, they're still very restricted. This has been the consistent story with SQL, 15 ways to use queries and tables, all to work around the limitation that they are not first class values.


  with persons as (select * from Person)  
  select Name from persons  
  where Birthdate < '2000-01-01


Where is the assignment to a variable? Where can you construct a query using a variable in table/query position? That's the whole point of being first class and composable, a query becomes like any other value so you should be able to parameterize any query by another query assigned to a variable that may have been set inside an if-statement, or accepted as a parameter to a stored procedure. You know, the same kinds of composition we see in ordinary programming languages.


  create table x as (select * from person);
  select name from x where ...;
there you go, just configure your editor to display "create table x" as "declare x = " ;)

or even a version with lazy evaluation:

  create view x as (select * from person);
  select name from x where ...;


You're still not getting it. First-class status means that anywhere a value or variable can be used, a query or table should also be able to appear, and vice versa. This means a table or query can appear as a return type, a parameter to a stored procedure or query, a variable, and so on.

SQL just does not have this, it instead has 15 different second class ways to handle tables and queries that try to make up for the fact that they are not first-class values. These include CTEs, table valued functions, views, etc.


Usually I balk at the idea of of someone describing a language feature as “first class” because it seems to wishy washy. But in this thread you’ve shown me that maybe the best definition is through “proof by negation,” by patiently responding to arguments and demonstrating why a certain usage and the ensuing restriction around it means it is not first class. Bravo!


I agree the term is often abused, but I think the wikipedia page actually does a decent job of making the notion somewhat precise, along the lines I've been arguing here:

https://en.wikipedia.org/wiki/First-class_citizen

If you want to see what queries as first-class values looks like, LINQ in .NET is pretty close. I can actually write a series of queries that build on and compose with each other, like this:

    IQueryable<Person> RunQuery(int userSelection)
    {
        var first = from x in People
                    select x;
        var second = userSelection == 1
            ? from x in first where x.Birthday > '2000-01-01' select x
            : from x in first where x.Name.Contains("Jane") select x;
        return DumbJoin(first, second);
    }

    IQueryable<Person> DumbJoin(IQueryable<Person> first, IQueryable<second>)
    {
        return from x in second
               join y in first on y.Role equals x.Role into g
               select g;
    }
This query is nonsense, but it just shows you what composition really looks like when queries are first-class values. I wish raw SQL were like this!


> You're still not getting it. First-class status means that anywhere a value or variable can be used, a query or table should also be able to appear, and vice versa. This means a table or query can appear as a return type, a parameter to a stored procedure or query, a variable, and so on.

I doubt you could implement a query planner that would cope with that degree of flexibility. Which means you’d be forced to deal with the mechanics of the query, pushing you away from declarative SQL and into procedural and functional programming. At which point you might as well ditch SQL anyway.


Without these features, people have to resort to dynamically generated SQL queries in procedural or functional languages, which is much worse! SQL has also become significantly more complicated by adding all sorts of second-class features to get around this composability limitation (CTEs, table valued functions, views, etc.).

Besides, I don't think it would be as bad as you say. You can approach it as a simple template expansion into flat SQL queries except where a data dependency occurs, at which point template expansion proceeds in stages, one for each dependency.

LINQ on .NET provides most of the composability I'm talking about, although it has a few limitations as well. Still worlds better than raw SQL.


In PostgreSQL at least, a table can appear as a return type of a function and as a parameter to a function. That's not nothing.


What if I wrote a very long, complicated query that I'd like to test against different tables (like test tables), and let's say I can't use stored functions or procedures. How could I pass different tables to my query?


  CREATE TABLE data_a AS (SELECT 'a' AS test_case, 1 AS value);
  CREATE TABLE data_b AS (SELECT 'b' AS test_case, 2 AS value);
  CREATE VIEW data AS (SELECT * FROM data_a UNION ALL SELECT * FROM data_b);

  CREATE VIEW complicated_query AS (SELECT test_case, value+1 FROM data);

  SELECT * FROM complicated_query WHERE test_case = 'a';
  SELECT * FROM complicated_query WHERE test_case = 'b';


Nice, that is what I was looking for. Of course, it'd need to point to production data as well, so maybe test_case is null, in that case:

  CREATE TABLE data_a AS (SELECT 'a' AS test_case, 1 AS value);
  CREATE TABLE data_b AS (SELECT 'b' AS test_case, 2 AS value);
  CREATE TABLE data_prod AS (SELECT NULL AS test_case, prod_table.value FROM prod_table);

  CREATE VIEW data AS (SELECT * FROM data_a UNION ALL SELECT * FROM data_b UNION ALL SELECT * FROM data_prod);

  CREATE VIEW complicated_query AS (SELECT test_case, value+1 FROM data);

  -- when testing
  SELECT * FROM complicated_query WHERE test_case = 'a';
  SELECT * FROM complicated_query WHERE test_case = 'b';

  -- when in 'production'
  SELECT * FROM complicated_query WHERE test_case IS NULL;


You just reinvented defunctionalization, which is a transformation from a domain that has first-class values to a domain where support is only second-class. Defunctionalization is typically used in programming languages to simulate first-class functions in languages where functions are only second-class citizens, like C and Pascal.

This perfectly illustrates my point. You had to manually defunctionalize your data model and queries to support what I'm saying should be inherently part of SQL.


> languages where functions are only second-class citizens, like C and Pascal.

1) Only if you define Pascal as only Wirth's very first version. That changed almost immediately.

2) Only if you refuse to equate “pointer to function” with “function”. Which in C, where “everything is a pointer” (a bit like in Unix Linux “everything is a file”), seems rather silly.


Check out dbt - it's a great tool for organizing queries and solving such patterns


If you can't use stored procedures which are good for this very case, many databases offer dynamic SQL. That might work in some cases.


That's a good point. This would rule out SQLite for me.


Dynamic SQL isn’t SQL, and it’s not relational. It’s no different from using a language like Python to generate SQL queries.


It's a little different. Anyway, this is under the constraint "no stored procedures."


TVF ( Table-Valued Function) with Cross apply.


Exactly since it declarative the style Lends itself using stored procedure calls to become composable


You cannot abstract over stored procedures either, so that's still not composable.


No idea what this means


Think of first-class functions. Can't do that with stored procedures, just like you can't do that with queries or tables in SQL.


Furthermore, stored procedures/functions are not queries.


Stored procedures can be relations. Queries are relations. Ergo, stored procedures can be queries.


Isn't that the point of common table expressions (CTEs)?



Postgres has read write parallism that can scale across millions of read writes ; if ml model is inherent in the Postgres db it is indeed very elegant reminds me of the glory days of tsearch2 to do text search in Postgres for searching our internal document repository using nothing but Postgres and boy was it faster than Java based search systems


I think general programming languages are better for general programs than SQL.

Specifically they have: Type systems, compilers, debuggers, text editors, package managers, C FFI etc.

But I agree that having the data and the program in the same process has benefits.

Writing programs in SQL is one way.

Another way is to move your data to your general program with SQLite.

I like using SQL for ACID, and queries as a first filter to get the data into my program where I may do further processing with the general language.


Another is MS SQL Server, which lets you run .NET on the database server :D "you can author stored procedures, triggers, user-defined functions, user-defined types, and user-defined aggregates in managed code"


I have had nothing but bad experiences trying to run .NET in SSIS packages -- is there another way?


I've never had the pleasure(ha) of using SSIS, but this is the stuff that I was talking about: https://learn.microsoft.com/en-us/dotnet/framework/data/adon...


> Type systems

SQL has types

> compilers

For what specifically do you need a compiler?

> debuggers

Some tasks - like the concurrency SQL enables - are just very difficult to debug with debuggers. It would be the same with any other language. What SQL does here though is to allow you to focus on the logic, not the actual concurrency,

> text editors, package managers

I feel like these two are just for filling up the space.

> C FFI

Many SQL engines have UDFs


> Type systems

Sure SQL has types, but they are checked at runtime, not compile time. Also you cannot define function input and return arguments with types that are checked before you run the program.

> compilers

If you want efficient and/or portable code. They will check your code for type errors before you run them. They give you coding assistance in your editor.

> debuggers

Being able to break a program and see its state and function stack is useful. The quality of the tools for real languages are much better than SQL.

I agree that databases do concurrency better than most languages with their transactions (I mentioned I would use the db for ACID).

> text editors, package managers.

Editor support of real languages is much better than SQL.

Package managers enable code re-use.

> C FFI

Take for example Python. A huge amount of the value comes from re-using C libraries that are wrapped with a nice Python API.

You might be able to do this in SQL, but you'll have to wrap the C library yourself as there is no package manager, and no one else is doing the same thing.


> > text editors, package managers.

> Editor support of real languages is much better than SQL.

So text editors suck at supporting SQL... How is that SQL’s fault?!? Go complain to the text editor authors.

> Package managers enable code re-use.

Yup. Build one for SQL, then you can re-use SQL code. Just like someone had to build one for every other language. What does this prove about SQL being inferior to other languages? A: Nothing at all.


> What does this prove about SQL being inferior to other languages

My point is that I think it is inferior for general application/business logic programs. For queries that filter large datasets to small result sets, SQL is probably better as it has the built in indexes and query planner (plus ACID).

I am pointing out that the current environment (better text editors and package managers) favours general languages, so they are a better current choice (to use in combination with SQL) over just writing everything, including application logic, in SQL.


> I think general programming languages are better for general programs than SQL. Specifically they have: Type systems, compilers, debuggers, text editors, package managers, C FFI etc.

Non sequitur. SQL is typed; SQL can be edited in any text editor; there are lots of SQL IDEs and, arguably, debuggers and package managers. Sure, the package managers are specific to each RDBMS, but so what? Npm is no use in COBOL either. And sure, the “debuggers”, to the extent they can be said to exist, are radically different from those of “conventional” – of other – programming languages. But again, so what? A Smalltalk debugger is no use to fix the output from Intel’s latest C compiler either, or vice versa.

IOW: There is no such “SQL vs programming languages” dichotomy. SQL is just another programming language, with its own strengths and weaknesses, just like all the rest of them. “The rest” are not distinguished from SQL by somehow magically all having the attributes you claim for them: Some have them, some don't; some have this but not that, some others, the other way around. Someone built all those IDEs and debuggers and package managers for (some / many / most of) those other languages; you can build them for SQL too.


Another option is to use an approach like Prevayler: https://prevayler.org/

The basic notion is you keep your data hot in RAM and manage it directly. You make every change an object (or a command), and write that out serially to a log before you execute it. That gets you the ACID guarantees but with no I/O but linear writes, so it can be extremely fast.

It only makes sense when your data fits conveniently in RAM, but that's a lot of things.


I don't consider querying a relational database, transforming its data, or validating its state transitions to be general programs. I consider those to be special purpose programs, which benefit from special purpose tools tailored for writing them. SQL is one such tool.


I spend most of my time in the parallel universe that is scientific computing/HPC. In this alternate reality SQL (not to mention databases) never really took off. Instead of scalable, performant databases, we have only the parallel filesystem. I'm convinced the reason contemporary scientific computing don't involve much SQL is sociological/path-dependency, but there are also very good technical reasons. Optimizing software in scientific computing involves two steps: 1) Think hard about your problem until you can frame it as one or more matrix multiplications 2) Plug that into a numerical linear algebra library The SQL abstraction (in my experience) takes you very much in the opposite direction.


For sure. Anything done in SQL is running on top of a million lines or more of extremely complicated non-SQL code. If that works for a given use case, great, but if not, optimizing can get very challenging. I'd much rather deal with something closer to the metal.


Self-proclaimed database expert here. What a database is good at depends on what you're trying to get that database to do, at least in part.

Take it into piecess, elegance and efficiency. These will correspond to a logical statement of what you're trying to do, and how quickly the database will actually do it in practice.

SQL can do some nice things in areas, making it elegant in those areas. Elsewhere it can be pretty wordy and ugly.

In efficiency, it comes down largely to how the database is implemented and that also includes the capability of the optimiser. Both of these are out of your control. In my experience trying to turn a database into a number cruncher is just not going to work.

I guess that's long way round of me saying that I don't think I agree with you!


In the data warehouse / OLAP space, I think we are heading towards a world where the underlying data storage consists of partitioned Arrow arrays, and you can run arbitrary UDFs in arbitrary languages on the (read-only) raw data with zero copy and essentially no overhead other than the language runtime itself and marshalling of the data that is emitted from the UDF.

Something like the Snowflake data storage model + DuckDB as an execution engine + a Pandas/Polars-like API.

There is no reason why we have to be stuck with "database == SQL" all the time. SQL is extremely powerful, but sometimes you need a bit more, and in that case we shouldn't be so constrained.

But in general yes, the world is gradually waking up to the idea that performance matters, and that data locality can be extremely important for performance, and that doing your computations and data processing on your warehouse data in-place is going to be a huge time and money saver in the longer term.


> Databases are much more powerful than we think

And a function of what people think is attitudes towards working at the DB level. I see this often with ORM's in the web dev sphere (rather than Dat Science). Yes, ORM's are great but many people rely on them to completely abstract away the database and are terrified by raw sql or even query building. You also see it with services that abstract away the backend like PocketBase, Fireship, etc. Writing a window function or even a sub select looks like black magic to many.

I say this after several experiences with codebases where joins and filtering were often done at the application layer and raw sql was seen as the devil.


Opposite here - dont like ORMs. Too much overhead - though i get their value.


Database first designs make a lot of sense in a lot of ways. I've worked for a company with an Oracle database that has SQL scripts that do all the validation and create text files for downstream usage. I think it makes more sense than a ton of Java, but there are pros and cons. One is that SQL is relational and the advanced stuff can be extra hard to troubleshoot if you don't have enough experience. Even those that can't code can usually understand a for loop and can think imperatively.

Unfortunately it's an expensive commercial product or I'd recommend you look at kdb+ if you work with time series data. The big banks use it and essentially put all thier latest RT data into kdb+ and then can write extremely succinct queries with a SQL-like syntax, but the ability to approach it far more programmatically than what is typically doable with something like PL-SQL. You can even write your neural network or whatever code in less than a page of code as the language of kdb+ is extremely powerful, although also basically incomprehensible until someone puts some time into learning it. It's extremely lightweight though, so very easy to deal with in an interactive fashion.

All that to say I agree with you that it's nice to just have everything you want all in one spot rather than to deal with 4 different tools and pipelines and shared drives and so on.


Absolutely.

Consider too that PostgreSQL databases support different languages, like Python.

Loads of for-profit companies have tried to cash in on this. SAP HANA is one of the ones I've had recent experience with. It is unfortunately a poor implementation. The right architecture tends to be: put your model behind an API interface, not internal on the system. Train your model separately from production systems, and so on.

You might also be interested in checking out MLOps platforms like Kubeflow, Flyte, and others.


There is a recent trend in database research, ML in databases. Not sure how much an impact it can make though, the sweet spot is doing relatively linear stuff, arguably just a step up from analytical functions in queries, while cutting edge ML needs dedicated GPUs for compute load and often uses unstructured and even binary data.


> Are the current tradeoffs just that more people know python vs sql or database specific languages to where moving logic to postgres or snowflake is looked down on?

Yes, mostly development, deployment, etc.. concerns. I haven't ever seen an org that versions their SQL queries, unless they are in a codebase. The environment is just unfriendly towards that type of management. Nevermind testing! Things that have solutions but we haven't matured enough here, because that type of development has been happening in application code.

Also, SQL is generally more complex than application logic, because they are designed to do different things. What is a simple exercise in iteration or recursion can more easily become something a little more of a headache.

Problems that can be resolved, but they are problems.


I would like for this to be the case. I was DBA of a Postgres database for LIMS almost 2 decades ago. Back then you could code functions for the database to execute on data and it was very powerful, but also very clunky. The tools to do software development in the database was not mature at all. A lot has changed in the past 20 years and SQL has evolved. Do you think SQL will expand that much or there will be APIs built into the database? Near-data functions are powerful and useful, but I would want my development environment to be more like version controlled code than "built-in".

I wonder if near-data functions on small databases is the solution to the limit of statelessness that you have with functions as a service.


With “traditional” RDBMS systems, putting a lot of code in the DB lead to a lot of scaling issues where you’d need gigantic machines with a lot of RAM and CPU: because the DB was doing so much work. It was expensive and clunky to get HA right.

In more modern DBs being distributed horizontally, this approach may see a rebound. The big “but” is still costs, in my experience in AWS as an example, managed Postgres Aurora was surprisingly expensive in terms of monthly cost.


Why would running code in a database process be intrinsically more expensive than running it in some other process?


The issue is that many relational databases are not horizontally scalable, so you want to be frugal with their resources.


The resources I'm familiar with are I/O, memory, and CPU. The only one I believe can be spared in the database by using that resource outside of the database, is CPU. When the database is far from saturated on CPU and latency and throughput are determined by I/O and memory, using CPU on some other machine that isn't the database can't possibly have any impact on latency and throughput.


> When the database is far from saturated on CPU

The issue here is if you scale enough saturate the database, you'll have to rewrite essentially all your code if you're a typical CRUD webapp. Basically all of your business logic is about data retrieval. There's probably some companies that can get away with this, but it would be way too expensive for most.


If I scale up to saturate the database CPU...by doing data retrieval? Setting aside my skepticism about saturating the CPU with mere data retrieval, how is that solved by moving the data to another host's CPU, when moving the data involves the very data retrieval that's saturating the database's CPU?


If you mix in compute-heavy calculations (like stochastic gradient descent) in with your pure data-retrieval, yes you'll saturate the database's CPU, and you won't have a reasonable way to scale it.

If you do it on a host that's not a database, then you can horizontally scale it. There's a reason stateless apps are the default.


Ok so if I'm not mixing in compute heavy workloads like the stochastic gradient descent described in the article, then it's less likely I'll saturate the CPU. Perhaps that won't happen at all and then I won't have to scale horizontally.

On the other hand if I'm doing stochastic gradient descent that's saturating the CPU then there's a good chance I'm doing offline training of an ML model. In that case my latency tolerances are probably much much higher. In other words, I can also avoid scaling horizontally provided I can live with longer training times. That might be a worthwhile trade-off to me given the added complexity of horizontal scaling.

Good to know!


you can easily scale linearly your app layer on multiple machines, while it is harder with previous gen databases.


What good is that going to do if the data always resides in the database?


I agree. I took a course in databases and SQL and was blown away by its power. With CTE’s and PLSQL you can do a lot of stuff inside the database.

I played with SQLite and it’s json columns. Once you get the hang of the syntax for walking a json structure you can do all sorts of neat things. Doing the same thing in Python would be tedious.

And I also believe it ended up being way faster than what I did in python.


had a very good chat with https://postgresml.org/ last week which is focusing on bringing ML to postgres: https://youtu.be/j8hE8-jZJGU


I'm watching it, it's really good. Montana makes a great point: you can move data to the models, or move the models to the data. Data is typically larger than models, so it makes sense to go with the latter.


thanks for watching! i should really up the production quality haha but also this is what i can kinda manage with my existing workload. idk how the pro youtubers make these calls interesting


Junior developers like me were uncomfortable with SQL twenty years ago. Java ORM frameworks became popular because of the Object-Relational impedance. I kind of see the same kind of sentiment nowadays among newer generations but in Python&Co.

The success of the Apache Spark engine can at least partially be attributed to

* being able to have the same expressive power as SQL but with a real Scala API (including having reusable libraries based on it)

* being able to embed it into unit tests at a low price of additional ~20 seconds latency to spin up a local Spark master


A dbms is really it's own operating system, usually this is hosted on another operating system, one that understands the hardware.

I remember one place I worked where we had several old graybeard programmers who considered the dbms[1] the operating system, as a unix sysadmin we had some interesting discussions as I was always confused and confined by the dbms and they felt the same about unix.

1. unidata if curious, a weird multi value(not relational) database, very vertically integrated compared to most databases today.


I've created a big system with only SQL in the past like you just said. I wouldn't do it again because of these two pain points that I had to deal with:

1. it's really hard to debug SQL queries and stored procedures (at least it was in Postgres 11)

2. when you hit a performance bottleneck, you don't have much control over it - parallelizing is hard and you have to trick the query planner to do what you want (and it doesn't work sometimes)


Not challenging your experience but genuinely trying to learn from it, in broad strokes what kinds of things were you doing in these stored procedures?


The problem with this approach is that:

1) No static typing 2) Updating the logic requires migrations 3) You put the logic into the place which is the hardest to scale and many times a single point of failure 4) Cannot compose the code effectively and general verbosity

It's one of those ideas that sounds great on paper and maybe works in some smaller problems but as you go up in complexity things get worse and worse


It is tempting to combine web server, database and some imperative language with built in data oriented / SQL features in a single executable and call it an application server that would communicate with the outside world using for example JSON based RPC. I think there were / are some products in the area even with the built in IDE (like Wakanda).


Fully agree.

And using PostgREST [0] you can serve your postgreSQL database as REST-API. And if you throw foreign data wrappers / multicorn in the mix, you can map any other datasource into your postgreSQL-db as table.

[0] https://postgrest.org/en/stable/


> Databases are much more powerful than we think

and data has mass. One example of bringing the work to the data is https://madlib.apache.org/ (works on Postgres and Greenplum)

[Disclaimer - former employee of Pivotal]


I have yet to see a decent IDE or system which allows great version control, unit testing and collaboration with SQL source code.

So I think a lot of the reluctance is from practical concerns.


We store individual sql files in github and keep them in separate folders

This is very simple and scales well for our purposes




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: