Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I was impressed enough by replit's 2.7B model that I'm convinced it's doable. I have a 4090 and consider that the "max expected card for a consumer to own".

Also exllama doesn't support non-llama models and the creator doesn't seem interested in adding support for wizardcoder/etc. Because of this, using the alternatives are prohibitively slow to use a quantized 16B model on a 4090 (if the exllama author reads this _please_ add support for other model types!).

3B models with refact are pretty snappy with Refact, about as fast as github copilot. The other benefit is more context space, which will be a limiting factor for 16B models.

tl;dr - I think we need ~3B models if we want any chance of consumer hardware to reasonably run coding models akin to github copilot with decent context length. And I think it's doable.



I'm fairly confident a coding specific model should be a lot smaller - 3b should be plenty if not 1b or less. As it stands, there are quite a few 7-13b model sizes that can predict natural language quite well. Code seems at its surface a much simpler language, strict grammars, etc so I wouldn't think it needs to be anywhere near as large as the nlp models. Right now people are retraining nlp models to work with code, but I think the best code helper models in the future will be trained primarily on code and maybe fine tuned on some language. I'm thinking less of a chat bot api and more of a giant leap in "intellisense" services.


> Code seems at its surface a much simpler language

When using GitHub Copilot, I often write a brief comment first and most of the time, it is able to complete my code faster than if I had written it myself. For my workflow, a good code model must therefore also be able to understand natural text well.

Although I am not sure to which degree the ability to understand natural text and the ability to generate natural text are related. Perhaps a bit of text generation capabilities can be traded off against faster execution and fewer parameters.


Understanding should be much easier, for the same reason humans (e.g. children, foreign-language learners) can always understand more than they can say: human language is fairly low-entropy, so if there's a word you don't understand, you can pick up most of the meaning from context. On the other hand, producing natural-sounding language requires knowing every single word you're going to use.


I'd really like to see smaller models trained on only one specific language, with it's own language specific tokenizer. I imagine the reduction in vocab size would translate to handling more context easier?


I think simply having the vocab more code friendly (e.g. codex) would make the biggest difference, whitespace is the biggest one (afaik every space is a token), but consider how many languages continue `for(int i=0;`, `) {\n`, `} else {`, 'import ', etc.

My understanding is that a model properly trained on multiple languages will beat an expert based system. I feel like programming languages overlap, and interop with each other enough that I wouldn't want to specialize it in just one language.


There's also just far more tokens to train on if you do multi-language. I'd guess only the most popular languages would even have enough training data to get a specialized version - but it would still be an interesting trade off for certain use cases. Being able to run a local code assistant on a typescript-only project for example, with a 32k context window would really come in handy for a lot of people. I don't know enough to understand the impact of vocab size vs context size.


Its worth noting that from what I can tell - A model well trained in most languages would be able to learn the niche ones much more easily.

The vocab size of llama2 is 32,000. I guess I personally don't think that there's enough difference in programming languages to actually save any meaningful number of tokens considering the magnitude of the current vocab.


I wonder if you could train a model generally across a lot of languages, then specialize for a specific one with a different tokenizer / limited vocabulary? Here's the reference I've been using for llama 2 tokens:

https://huggingface.co/mlc-ai/mlc-chat-Llama-2-7b-chat-hf-q4...

it looks like if you just limit it to English it'd cut the count almost by half - further limiting the vocab to a specific programming language could cut it down even more. Pure armchair theory-crafting on my part, no idea if limiting vocab is even a reasonable way to improve context handling. But it's an interesting idea - build on a base then specialize as needed and let the user swap out the LLM on an as-needed bases (or the front-end tool could simply detect the language of the project). 3B or smaller models with very long context which excel at one specific thing could be really useful (e.g. local code completer for English typescript projects)


replit’s model is surprisingly good at generating code, even at following complex instructions that I was sure would confuse it. I have found it’s a bit weak on code analysis, for open-ended questions like ‘is there a bug anywhere in this code?’ that GPT-4 can answer.


exLlama is not the only viable quantized backend. TVM (as use by mlc-llm) and GGML (which is used by llama.cpp) are very strong contenders.

~7B-13B will work in 16GB RAM with pretty much any dGPU for help, and context extending tricks.

TBH I suspect Stability released a 3B model because its cheap and quick to train. If they really wanted a good model on modest devices, they would have re used a supported architecture (like Falcon, MPT, Llama, Starcoder...) or contributed support to a good backend.

*Also, I think any PyTorch based model is not really viable for consumer use. Its just too finicky to install and too narrow with hardware support.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: