Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

GPT-4 (and Claude) are definitely the top models out there, but: Llama, even the 8b, is more than capable of handling extraction like this. I've pumped absurd batches through it via vLLM.

With serverless GPUs, the cost has been basically nothing.



Can you explain a bit more about what "serverless GPUs" are exactly? Is there a specific cloud provider you're thinking of, e.g. is there a GPU product with AWS? Google gives me SageMaker, which is perhaps what you are referring to?


There are a few companies out there that provide it, Runpod and Replicate being the two that I've used. If you've ever used AWS Lambda (or any other FaaS) it's essentially the same thing.

You ship your code as a container within a library they provide that allows them to execute it, and then you're billed per-second for execution time.

Like most FaaS, if your load is steady-state it's more expensive than just spinning up a GPU instance.

If your use-case is more on-demand, with a lot of peaks and troughs, it's dramatically cheaper. Particularly if your trough frequently goes to zero. Think small-scale chatbots and the like.

Runpod, for example, would cost $3.29/hr or ~$2400/mo for a single H100. I can use their serverless offering instead for $0.00155/second. I get the same H100 performance, but it's not sitting around idle (read: costing me money) all the time.


You can check out this technical deep dive on Serverless GPUs offerings/Pay-as-you-go way.

This includes benchmarks around cold-starts, performance consistency, scalability, and cost-effectiveness for models like Llama2 7Bn & Stable Diffusion across different providers -https://www.inferless.com/learn/the-state-of-serverless-gpus... .Can save months of your time. Do give it a read.

P.S: I am from Inferless.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: