More

stephanheijl · on Dec 4, 2024

Love to see this on HN, very interesting research. I'm looking forward to the paper release and I appreciate the way that you are licensing these models. ESM 2-650M is still a solid baseline but seeing ESM C 6B outperforming it by these kinds of strides looks encouraging for the future possibilities of protein language models. Would be very interested to find out how well it performs on other benchmarks (ie ProteinGym zero-shot).

stephanheijl · on Nov 30, 2024

I would normally only buy apples around september/october in the Netherlands, trying to get them fresh from local orchards when possible. Elstar is amazing when plucked right from the tree, but becomes mealy before the end of the year IMO.

My new go to is the Magic Star variety, which has been sold as "Sprank" for the last 3 years at least in the Netherlands. These apples keep amazingly well; they ran out of stock around the summer in the last two years, but I found them delicious year round. I hope that this cultivar does not befall the same fate as the Honeycrisp, which I had the pleasure of tasting 5 years ago.

stephanheijl · on March 24, 2023

To be more exact, LoRA adds two matrices `A` and `B` to any layers that contain trainable weights. The original weights (`W_0`) have the shape `d × k`. These are frozen. Matrix `A` has dimensions `d × <rank>` (`rank` is configurable) and matrix `B` has the shape `<rank> × k`. A and B are then multiplied and added to `W_0` to get altered weights. The benefit here is that the extra matrices are small compared to `W_0`, which means less parameters need to be optimized, so less activations need to be stored in memory.

twic · on March 24, 2023

Ah, so the resulting model contains both the large matrix of original weights, and also the two small matrices of alterations? But this is smaller than the alternative of a model which contains the large matrix of original weights, and an equally large matrix of alterations.

Why is fine-tuning done with separate alterations, rather than by mutating the original weights?

arugulum · on March 24, 2023

> Why is fine-tuning done with separate alterations, rather than by mutating the original weights?

The goal of most parameter-efficient methods is to store one gold copy of the original model, and learn minor modifications/additions to the model. The easiest way to think about this is in some kind of deployment setting, where you have 1 capable model and you learn different sets of LoRA weights for different tasks and applications.

The original intent of parameter-efficient methods is to reduce the amount of storage space needed for models (do you really want to keep a whole additional copy of LLaMA for each different task?). A secondary benefit is that because you are fine-tuning a smaller number of parameters, the optimizer states (can take up to 2x the size of your model) are also heavily shrunk, which makes it more economical (memory-wise) to (parameter-efficient) fine-tune your model.

leobg · on March 25, 2023

That’s probably what OpenAI does with their custom fine tuned models, no?

stu2b50 · on March 24, 2023

> But this is smaller than the alternative of a model which contains the large matrix of original weights, and an equally large matrix of alterations.

It's actually larger. If you just have two equally large matrices of the same dimension, one original, and one of "altercations"... then you can just add them together.

> Why is fine-tuning done with separate alterations, rather than by mutating the original weights?

Then you'd have to compute the gradients for the whole network, which is very expensive when the model has 7b, 65b, 165b parameters. The intent is to make that cheaper by only computing gradients for a low rank representation of the change in the weight matrix from training.

arugulum · on March 24, 2023

>Then you'd have to compute the gradients for the whole network

You have to do that with LoRA regardless, to compute the gradients for the lowest-level LoRA weights.

gliptic · on March 24, 2023

Correct me if I'm wrong, but I think you still need to compute gradients of non-trained weights in order to compute the gradients of the LoRA weights. What you don't have to do is store and update the optimizer state for all those non-trained weights.

stu2b50 · on March 24, 2023

I mean the derivative of a constant is 0. So if all of the original weights are considered constants, then computing their gradients is trivial, since they’re just zero.

jprafael · on March 24, 2023

Computing gradients is easy/cheap. What this technique solves is that you no longer need to store the computed values of the gradient until the backpropagation phase, which saves on expensive GPU RAM, allowing you to use commodity hardware.

TuringTest · on March 24, 2023

It's larger, but there are less parameters to train for your specific use case since you are training the small matrix only, while the original ones remain unaltered.

seydor · on March 24, 2023

Can rank decomposition be used to reduce the original weight matrices as well? Or are they assumed to be compressed already?

metanonsense · on March 29, 2023

Those fully trained networks are usually considered full-rank. At least that is what they say in the LoRA paper.

grph123dot · on March 24, 2023

Your explanation is crystal clear. I suppose it works well in practice, but is there any reason it works that well?

stu2b50 · on March 24, 2023

Per the original paper, empirically it’s been found that neural network weights often have low intrinsic rank. It follows, then, that the change in the weights as you train also have low intrinsic rank, which means that you should be able represent them with a lower rank matrix.

grph123dot · on March 24, 2023

Since we are in ELI5, it seems that the concept of low rank approximation is required to understand this method.

(1) https://en.wikipedia.org/wiki/Low-rank_approximation

Edited: By the way, it seems to me that there is an error in the wikipedia page because if the Low-rank approximation takes a larger rank then the bound of the error should decrease, and in this page the error increases.

grph123dot · on March 24, 2023

>> that the change in the weights as you train also have low intrinsic rank

It seems that the initial matrix of weights has a low rank approximation A and this implies that the difference E = W - A is small, also it seems that PCA fails when E is sparse because PCA is designed to be optimum when the error is gaussian.

stu2b50 · on March 24, 2023

In terms of PCA, PCA is also quite expensive computationally. Additionally, you'd probably have to do SVD instead.

Since the weights are derived from gradient descent, yeah we don't really know what the distributions would be.

A random projection empirically works quite well for very high dimensions, and is of course very cheap computationally.

seydor · on March 24, 2023

Does this mean the matrices are highly compressible?

loxias · on March 24, 2023

kinda/yes. To translate to more intuitive concepts: the matrices don't contain much variance in as many degrees of freedom as they could.

Think of a point cloud of a piece of paper floating in the wind. It would be a 3xn list of points, but "really" it's a 2d piece of paper.

Just like I can rewrite the number 27 as 333 or 8+19 or (2^3)+(2^4)+3.. Given a single matrix one can find myriad ways to rewrite it as a sequence of matrices that have the same (or similar) numeric value, but with interesting or desirable properties. :D

My favorite example (which is used in signal processing) is to take your ugly matrix and rewrite it as a set of smaller matrices where most of the elements are zero, or a power of 2.

It turns out, computers can multiply by zeros and powers of two very fast

stephanheijl · on Jan 30, 2023

> Much too often, we are overworked and underpaid. We are in what people call a “passion industry,” one that ultimately capitalizes on our love of stories to excuse low wages and a “you better be grateful to this opportunity” attitude. We wish it was different. We’re not quite sure how to make it different.

The author identifies the situation completely accurately and is also able to see that they have no leverage whatsoever to make Harper Collins change their behavior. The fact is that doing a job that you are absolutely passionate about is something to be grateful for, and also something that is in high demand. Combined with the fact that the exact skill set required for this kind of work is currently possessed by a surplus of workers means that any position left by a striker will be rapidly fulfilled by a passionate scab.

I have no doubt that publishers will also readily take advantage of AI to make the pool of available jobs even shallower, which will decrease the viability of this movement even more.

stephanheijl · on Jan 4, 2023

Designing novel viral proteins might become trivial, but actually doing the lab work to produce them would still be a tough exercise. On the other hand, exactly by the mechanism that would give rise to such a novel pathogen, actors that do have access to large manufacturing capabilities would be able to create novel drugs rapidly or even prevantatively.

KidComputer · on Jan 4, 2023

It’s difficult, but doable and nowhere near as hard as the synthetic chemistry need to produce small molecule therapeutics to fight novel pathogens. The hardest part in my opinion would be getting accurate predictions of protein-protein binding free energies.

> create novel drugs rapidly or even prevantatively. On your final point I’m skeptical. Drugs are difficult to design because you need to account for off-target effects among other things. That’s not a concern when designing a harmful agent. Furthermore I presume one could intelligently harden the pathogen so any potential treatment might be as harmful as the pathogen itself. But that’s a strong assumption and I know of no way to formally verify it.

stephanheijl · on Oct 17, 2022

I just add them to an img folder in my Github repository and it is then served as part of the Github page. Just using `src=img/picture.jpg` works fine.

taubek · on Oct 18, 2022

I've used systems like Docusarus[1] (for building my project documentation). It is build from my github project. And I end up with URLs on my server like https://mysite.com/docs/assets/images/connect-to-app-02c6a40...

But I've never tried to link directly to source file, not builded version on my server.

[1] https://docusaurus.io/

stephanheijl · on Sept 9, 2022

This is the dockerized version of this repo: https://github.com/AbdBarho/stable-diffusion-webui-docker

stephanheijl · on Aug 19, 2021

> has less catastrophic failure modes

Given incidence of dam bursts like the 1975 Banqiao dam failure[1], with an estimate death toll of 26,000 to 240,000 people and flooding of over 12,000 square kilometers, I'm inclined to disagree with this assessment.

[1]https://en.wikipedia.org/wiki/1975_Banqiao_Dam_failure

goohle · on Aug 19, 2021

Can you compare number of deaths because of dam failure with number of deaths because of flood and no dam to protect from flood? I.e. should we stop to build dams for flood protection because of telegraph failure at Banqiao dam?

stephanheijl · on Aug 19, 2021

I am not claiming we should not have dams, or refrain from using dams to provide hydroelectricity. My point of contention is the assertion that hydro energy has less catastrophic failure modes than nuclear (presumably fission). Clearly the failure modes for hydro dams are at least on par with those of nuclear installations, given that we have historic evidence that they can inflict tens of thousands of casualties.

Hypothetical scenarios can be constructed for both methods of power generation (What if Braidwood plant melts down, somehow killing the 5 million people living within a 50 mile radius? What if the Three Gorges Dam busts, inundating an area inhabited by 600 million people?[1]) Either way, it is far from obvious to me that hydro has the superior safety profile, especially when their fatality rates are on the same order of magnitude even when the Banqiao incident is removed[2]*.

[1] https://web.archive.org/web/20210620174812/https://www.japan... p/opinion/2020/09/01/commentary/world-commentary/big-china-disaster/

[2] https://ourworldindata.org/safest-sources-of-energy

* Sovacool et al. (2016), the source of the data in [2] includes a hydro fatality rate per tWh 2.4x larger than nuclear when Banqiao is included.

heavenlyblue · on Aug 19, 2021

"We should not do tooth fillings because tooth fillings fail and then you're in pain"

stephanheijl · on July 22, 2021

I'm impressed and grateful that DeepMind released this resource, this will save a lot of compute from labs trying to replicate an entire exome for themselves. While some structures look great, there are still some misses here. Important structures like BRCA1 (a well-studied breast cancer associated protein) are just structures for the BRCT and RING domains surrounded by a low-confidence string of amino acids, likely shaped to be globular: https://alphafold.ebi.ac.uk/entry/P38398

Maybe I was wrong for expecting the impossible here, but I was excited to see this specific structure and it appears that there is still work to do. Nevertheless, kudos to Deepmind on their amazing achievement and contributions to the field!

cing · on July 22, 2021

Everything between the BRCT and RING domains of BRCA1 is an intrinsically unstructured region which DeepMind correctly predicts, https://pubmed.ncbi.nlm.nih.gov/15571721/

Another famous one would be R-domain of CFTR, which was not resolved in experimental structure determination, and AlphaFold models correctly show disorder there. Nothing to be done in those cases except perform molecular simulation or other experiments to assess dynamic ensembles, https://alphafold.ebi.ac.uk/entry/P13569

maga · on July 22, 2021

A curious non-biologist here: how valuable are these low confidence predictions for biologists? In other words, is it hard to predict but easy to check situation as with, say, prime numbers in mathematics?

toufka · on July 22, 2021

The medium-confidence predictions are great for grounding or sourcing intuition. If you're trying to divide up a protein for an experiment and you have to choose where to divy it up - you'd like to use even a bad prediction to help weight an otherwise completely random approach. AND there are great methods to help with this, but they're often custom, time-consuming, and out-of-field for most. So being able to very quickly spot-check using a uniform state-of-the art, for any arbitrary protein, makes it actually pretty useful for certain kinds of pre-experimental guidance.

devindotcom · on July 22, 2021

Some are valuable for the reasons the other person responding noted, but some of the low confidence predictions may also be high confidence predictions of a disordered class of protein that doesn't have a standard rest state. So it's useful work one way or the other.

stephanheijl · on April 13, 2021

You linked an income inequality graph, which is distinct from wealth. If you look at the list on Wikipedia [1] and sort by Wealth Gini (2019) you will find that the Netherlands are #1.

[1] https://en.wikipedia.org/wiki/List_of_countries_by_wealth_eq...

Bootvis · on April 13, 2021

Any explanation for the jump from 2018 to 2019?

stephanheijl · on April 13, 2021

Based on the Global Wealth Book 2018 and 2019 from Credit Suisse, comparing Table 3-1 in both books, it appears that the number of adults in "under 10.000" range increased significantly in 2019. This is likely not a real change, but rather a change of methodology or data sources. As far as I can tell however, this is not detailed in the text. I have directed an email to Credit Suisse on this, as it's a rather interesting piece of data.

Bootvis · on April 13, 2021

Quite interesting, curious what they have to say. I'd be glad if you can put me in cc (e-mail in profile).

mdorazio · on April 13, 2021

Ah you’re right. I misread the parent comment. Thanks.