LoRA(Low-Rank Adaptation) as explained in the paper “LoRA: Low-Rank Adaptation of Large Language Models” ,It hypothesized that instead of full fine tuning the large model for downstream task , we can use low rank decomposition matrices to update parameter partially as all parameters are not used to produce significant result in huge model so that the model with 175B parameter can be easily finetune for specialized task without inference latency and with limited hardware resources. "LoRA allows us to train every dense layer in a neural network indirectly by injecting and optimizing rank decomposition matrices of the dense layer’s update instead, while keeping the original matrices frozen” Understanding the Low-Rank Updates This technique is used to reduce hardware resources for downstream task , but where to apply the LoRA , to which weight should we apply LoRA. Where should we update the weight(transformer , MLP , attention ) Q. Which Weight Matrices in Transformer Should We Apply LoRA to? we only consider weight matrices in the self-attention module Setting r = 8 for single attention module (k,q,v) and r =4 if we use combined((k,q) or (v,q)) , containing total of 18M parameters in all 96 layers in case of GPT-3. Note that putting all the parameters in ∆Wq or ∆Wk results in significantly lower performance,while adapting both Wq and Wv yields the best result. This suggests that even a rank of four captures enough information in ∆W such that it is preferable to adapt more weight matrices than adapting a single type of weights with a larger rank Q. What is the Optimal Rank r for LoRA? We adapt both Wq and Wv since, they performed the best in the previous experiment and as well as Wq alone for comparison.LoRA already performs competitively with a very small r (more so for Wq+v than Wq). This suggests the update matrix ∆W could have a very small “intrinsic rank". To further support this finding, we check the overlap of the subspaces learned by different choices of r and by different random seeds. We argue that increasing r does not cover more meaningful subspaces, which suggests that a low-rank adaptation matrix is sufficient. Q. How Does the Adaptation Matrix ∆W Compare to W? We further investigate the relationship between ∆W and W. In particular, does ∆W highly correlate with W? (Or mathematically, is ∆W mostly contained in the top singular directions of W?) Also, how “large” is ∆W comparing to corresponding directions in W? This can shed light on the underlying mechanism for adapting pre-trained language models. t, ∆W has a stronger correlation with W compared to a random matrix, indicating that ∆W amplifies some features that are already in W. Second, instead of repeating the top singular directions of W, ∆W only amplifies directions that are not emphasized in W. Conclusion: Fine-tuning enormous language models is prohibitively expensive in terms of both the hardwarerequirement and the storage/switching cost for hosting multiple instances. We propose LoRA, an efficient adaptation strategy that neither introduces inference latency nor reduces input sequence length while retaining model quality. Importantly, it allows for quick task-switching when deployed as a service by sharing the vast majority of the model parameters. While we focused on Transformer, the proposed principles are generally applicable to any neural networks with dense layers.LoRA can potentially work in tandem with other fine-tuning techniques. In the future, we hope to explore only tuning some layers or adding adversarial training. Finally, the rank-deficiency of ∆W

LoRA Quantization

Comments

Leave a Comment

Categories

Tags

Related Posts

LoRA Quantization

Comments

Leave a Comment

Categories

Tags

Related Posts

AI & Machine Learning: "Building Scalable AI Systems: Key Lessons from My Journey"