AdaMix is a parameter-efficient fine-tuning (PEFT) method for large language models that outperforms both full fine-tuning and existing PEFT approaches like LoRA and adapters. By using a mixture of adaptation modules with stochastic routing and merging, AdaMix trains only 0.1–0.2% of parameters while maintaining the same computational cost as baseline PEFT methods. This innovation dramatically reduces storage needs and boosts performance across NLU and NLG tasks, making it one of the most effective fine-tuning techniques to date.AdaMix is a parameter-efficient fine-tuning (PEFT) method for large language models that outperforms both full fine-tuning and existing PEFT approaches like LoRA and adapters. By using a mixture of adaptation modules with stochastic routing and merging, AdaMix trains only 0.1–0.2% of parameters while maintaining the same computational cost as baseline PEFT methods. This innovation dramatically reduces storage needs and boosts performance across NLU and NLG tasks, making it one of the most effective fine-tuning techniques to date.

How to Improve AI Models While Training Only 0.1% of Parameters

2025/10/01 15:00

:::info Authors:

(1) Yaqing Wang, Purdue University (wang5075@purdue.edu);

(2) Sahaj Agarwal, Microsoft (sahagar@microsoft.com);

(3) Subhabrata Mukherjee, Microsoft Research (submukhe@microsoft.com);

(4) Xiaodong Liu, Microsoft Research (xiaodl@microsoft.com);

(5) Jing Gao, Purdue University (jinggao@purdue.edu);

(6) Ahmed Hassan Awadallah, Microsoft Research (hassanam@microsoft.com);

(7) Jianfeng Gao, Microsoft Research (jfgao@microsoft.com).

:::

Abstract and 1. Introduction

  1. Background

    2.1 Mixture-of-Experts

    2.2 Adapters

  2. Mixture-of-Adaptations

    3.1 Routing Policy

    3.2 Consistency regularization

    3.3 Adaptation module merging and 3.4 Adaptation module sharing

    3.5 Connection to Bayesian Neural Networks and Model Ensembling

  3. Experiments

    4.1 Experimental Setup

    4.2 Key Results

    4.3 Ablation Study

  4. Related Work

  5. Conclusions

  6. Limitations

  7. Acknowledgment and References

Appendix

A. Few-shot NLU Datasets B. Ablation Study C. Detailed Results on NLU Tasks D. Hyper-parameter

Abstract

Standard fine-tuning of large pre-trained language models (PLMs) for downstream tasks requires updating hundreds of millions to billions of parameters, and storing a large copy of the PLM weights for every task resulting in increased cost for storing, sharing and serving the models. To address this, parameter-efficient fine-tuning (PEFT) techniques were introduced where small trainable components are injected in the PLM and updated during fine-tuning. We propose AdaMix as a general PEFT method that tunes a mixture of adaptation modules – given the underlying PEFT method of choice – introduced in each Transformer layer while keeping most of the PLM weights frozen. For instance, AdaMix can leverage a mixture of adapters like Houlsby (Houlsby et al., 2019) or a mixture of low rank decomposition matrices like LoRA (Hu et al., 2021) to improve downstream task performance over the corresponding PEFT methods for fully supervised and few-shot NLU and NLG tasks. Further, we design AdaMix such that it matches the same computational cost and the number of tunable parameters as the underlying PEFT method. By only tuning 0.1 − 0.2% of PLM parameters, we show that AdaMix outperforms SOTA parameter-efficient fine-tuning and full model fine-tuning for both NLU and NLG tasks. Code and models are made available at https://aka.ms/AdaMix.

1 Introduction

Standard fine-tuning of large pre-trained language models (PLMs) (Devlin et al., 2019; Liu et al., 2019; Brown et al., 2020; Raffel et al., 2019) to downstream tasks requires updating all model parameters. Given the ever-increasing size of PLMs (e.g., 175 billion parameters for GPT-3 (Brown et al., 2020) and 530 billion parameters for MTNLG (Smith et al., 2022)), even the fine-tuning step becomes expensive as it requires storing a full copy

\ Figure 1: Performance of different parameter-efficient fine-tuning methods on GLUE development set with RoBERTa-large encoder following a setup similar to (Houlsby et al., 2019) for fair comparison. We report the performance of Pfeiffer (Pfeiffer et al., 2021), Houlsby (Houlsby et al., 2019) and LoRA (Hu et al., 2021) with their default number of fine-tuned parameters as well as the number of fine-tuned parameters used in AdaMix with a mixture of adaptations . Red dash shows the performance of full model fine-tuning.

\ of model weights for every task. To address these challenges, recent works have developed parameterefficient fine-tuning (PEFT) techniques. These approaches typically underperform standard full model fine-tuning, but significantly reduce the number of trainable parameters. There are many varieties of PEFT methods, including prefix-tuning (Li and Liang, 2021) and prompt-tuning (Lester et al., 2021) to condition frozen language models via natural language task descriptions, low dimensional projections using adapters (Houlsby et al., 2019; Pfeiffer et al., 2020, 2021) and more recently using low-rank approximation (Hu et al., 2021). Figure 1 shows the performance of some popular PEFT methods with varying number of tunable parameters. We observe a significant performance gap with respect to full model tuning where all PLM parameters are updated.

\ In this paper, we present AdaMix, a mixture of adaptation modules approach, and show that it outperforms SOTA PEFT methods and also full model fine-tuning while tuning only 0.1 − 0.2% of PLM parameters.

\ In contrast to traditional PEFT methods that use a single adaptation module in every Transformer layer, AdaMix uses several adaptation modules that learn multiple views of the given task. In order to design this mixture of adaptations, we take inspiration from sparsely-activated mixture-of-experts (MoE) models. In traditional dense models (e.g., BERT (Devlin et al., 2019), GPT-3 (Brown et al., 2020)), all model weights are activated for every input example. MoE models induce sparsity by activating only a subset of the model weights for each incoming input.

\ Consider adapters (Houlsby et al., 2019), one of the most popular PEFT techniques, to illustrate our method. A feedforward layer (FFN) is introduced to down-project the hidden representation to a low dimension d (also called the bottleneck dimension) followed by another up-project FFN to match the dimensionality of the next layer. Instead of using a single adapter, we introduce multiple project-up and project-down FFNs in each Transformer layer. We route input examples to one of the project-up and one of the project-down FFN’s resulting in the same amount of computational cost (FLOPs) as that of using a single adapter. For methods like LoRA (Hu et al., 2021), that decomposes the gradient of pre-trained weights into low-rank matrices (A and B), we introduce multiple lowrank decompositions and route the input examples to them similar to adapters.

\ We discuss different routing mechanism and show that stochastic routing yields good performance while eliminating the need for introducing any additional parameters for module selection. To alleviate training instability that may arise from the randomness in selecting different adaptation modules in different training steps, we leverage consistency regularization and the sharing of adaptation modules during stochastic routing.

\ The introduction of multiple adaptation modules results in an increased number of adaptation parameters. This does not increase computational cost but increases storage cost. To address this, we develop a merging mechanism to combine weights from different adaptation modules to a single module in each Transformer layer. This allows us to keep the number of adaptation parameters the same as that of a single adaptation module. Our merging mechanism is inspired by model weight averaging model soups (Wortsman et al., 2022) and multi BERTs (Sellam et al., 2022). Weight averaging of models with different random initialization has been shown to improve model performance in recent works (Matena and Raffel, 2021; Neyshabur et al., 2020; Frankle et al., 2020) that show the optimized models to lie in the same basin of error landscape. While the above works are geared towards fine-tuning independent models, we extend this idea to parameter-efficient fine-tuning with randomly initialized adaptation modules and a frozen language model.

\ Overall, our work makes the following contributions:

\ (a) We develop a new method AdaMix as a mixture of adaptations for parameter-efficient fine-tuning (PEFT) of large language models. Given any PEFT method of choice like adapters and low-rank decompositions, AdaMix improves downstream task performance over the underlying PEFT method.

\ (b) AdaMix is trained with stochastic routing and adaptation module merging to retain the same computational cost (e.g., FLOPs, #tunable adaptation parameters) and benefits of the underlying PEFT method. To better understand how AdaMix works, we demonstrate its strong connections to Bayesian Neural Networks and model ensembling.

\ (c) By tuning only 0.1 − 0.2% of a pre-trained language model’s parameters, AdaMix is the first PEFT method to outperform full model fine-tuning methods for all NLU tasks on GLUE, and outperforms other competing methods for NLG and few-shot NLU tasks.

\ Practical benefits of PEFT methods. The most significant benefit of PEFT methods comes from the reduction in memory and storage usage. For a Transformer, the VRAM consumption can be significantly reduced as we do not need to keep track of optimizer states for the frozen parameters. PEFT methods also allow multiple tasks to share the same copy of the full (frozen) PLM. Hence, the storage cost for introducing a new task can be reduced by up to 444x (from 355MB to 0.8MB with RoBERTa-large encoder in our setting).

\ We present background on Mixture-of-Experts (MoE) and adapters in Section 2 of Appendix.

\

2 Background

2.1 Mixture-of-Experts

\

\ \ \ Figure 2: Mixture-of-Adaptations (AdaMix) with adapters (Houlsby et al., 2019) as the underlying PEFT mechanism. For illustration, we show M = 4 adaptation modules consisting of feedforward up (FFN_U) feedforward down (FFN_D) projection matrices. The above block shown for one Transformer layer is repeated across all the layers. AdaMix stochastically routes instances from an input batch via randomly selected adaptation modules resulting in FLOPs match to a single module with consistency regularization and parameter sharing. Adaptation merging (Figure 4) collapses multiple modules to match single-module parameters in each layer.

\ \ \ Figure 3: Conventional adapter design in standardTransformer architecture.

\ \ \

\

2.2 Adapters

The predominant methodology for task adaptation is to tune all of the trainable parameters of the PLMs for every task. This raises significant resource challenges both during training and deployment. A recent study (Aghajanyan et al., 2021) shows that PLMs have a low instrinsic dimension that can match the performance of the full parameter space.

\ To adapt PLMs for downstream tasks with a small number of parameters, adapters (Houlsby et al., 2019) have recently been introduced as an alternative approach for lightweight tuning.

\ \

\

:::info This paper is available on arxiv under CC BY 4.0 DEED license.

:::

\

Disclaimer: The articles reposted on this site are sourced from public platforms and are provided for informational purposes only. They do not necessarily reflect the views of MEXC. All rights remain with the original authors. If you believe any content infringes on third-party rights, please contact service@support.mexc.com for removal. MEXC makes no guarantees regarding the accuracy, completeness, or timeliness of the content and is not responsible for any actions taken based on the information provided. The content does not constitute financial, legal, or other professional advice, nor should it be considered a recommendation or endorsement by MEXC.
Share Insights

You May Also Like

Michigan’s Stalled Reserve Bill Advances After 7 Months

Michigan’s Stalled Reserve Bill Advances After 7 Months

The post Michigan’s Stalled Reserve Bill Advances After 7 Months appeared on BitcoinEthereumNews.com. After seven months of inactivity, Michigan’s Bitcoin Reserve Bill, HB 4087, made progress Thursday by advancing to the second reading in the state House of Representatives. The bill, introduced in February, aims to establish a strategic bitcoin BTC$115,427.11 reserve by authorizing the state treasury to invest up to 10% of its reserves in the largest cryptocurrency and possibly others. It has now been referred to the Committee on Government Operations. If approved, Michigan would join the three states — Texas, New Hampshire and Arizona — that have enacted bitcoin reserve laws. While Texas allocated $10 million to purchase BTC in June, the other two have yet to fund the reserve with state money. Recently, the U.S. House directed the Treasury Department to study the feasibility and governance of a strategic bitcoin reserve, including key areas such as custody, cybersecurity and accounting standards. Sovereign adoption of bitcoin has emerged as one of the defining trends of 2025, with several U.S. states and countries considering or implementing BTC reserves as part of their public finance strategy. That’s in addition to the growing corporate adoption of bitcoin in company treasuries. This institutional embrace has contributed to a significant boost in bitcoin’s market valuation. The BTC price has increased 25% this year, and touched a record high near $124,500 in August, CoinDesk data show. Despite the enthusiasm, skeptics remain concerned about the risks posed by bitcoin’s notorious price volatility. Source: https://www.coindesk.com/policy/2025/09/19/michigan-s-stalled-bitcoin-reserve-bill-advances-after-7-months
Share
BitcoinEthereumNews2025/09/20 04:26
Share
The SEC Finally Approves Investment Giant Grayscale’s Multi-Crypto Fund! What Altcoins Does the Fund Contain? Here Are the Details

The SEC Finally Approves Investment Giant Grayscale’s Multi-Crypto Fund! What Altcoins Does the Fund Contain? Here Are the Details

The post The SEC Finally Approves Investment Giant Grayscale’s Multi-Crypto Fund! What Altcoins Does the Fund Contain? Here Are the Details appeared on BitcoinEthereumNews.com. The U.S. Securities and Exchange Commission (SEC) has approved Grayscale’s application for a multi-cryptoasset exchange-traded product (ETP) as part of its efforts to expedite the approval process for crypto funds. SEC Approves Grayscale’s Multi-Crypto Fund Including XRP, Solana, and Cardano Grayscale CEO Peter Mintzberg announced the approval of the Grayscale Digital Large Cap Fund (GDLC) on Wednesday via social media platform X. Mintzberg stated that GDLC will be the first multi-cryptoasset ETP to be traded on the market. The fund offers investment opportunities in Bitcoin, Ethereum, XRP, Solana, and Cardano. According to Grayscale’s official website, the fund has a net asset value of $57.7 per share and over $915 million in total assets under management. The SEC previously postponed the filing in July and began reviewing its conversion to trade on NYSE Arca. On the same day, the SEC also approved “expedited” public listing standards for crypto ETF issuers. SEC Chairman Paul Atkins stated that this step would provide investors with more options and lower barriers to accessing digital asset products. According to experts, this decision could lead to the launch of more than 100 new crypto ETFs in the next 12 months. Bloomberg ETF Analyst Eric Balchunas emphasized that this could be a critical turning point for the crypto market, noting that previous similar regulations have tripled ETF launches. *This is not investment advice. Follow our Telegram and Twitter account now for exclusive news, analytics and on-chain data! Source: https://en.bitcoinsistemi.com/the-sec-finally-approves-investment-giant-grayscales-multi-crypto-fund-what-altcoins-does-the-fund-contain-here-are-the-details/
Share
BitcoinEthereumNews2025/09/19 02:39
Share
Fintech 3.0: Blockchain Eats the World

Fintech 3.0: Blockchain Eats the World

Author: Harj Taggar (YC), Jesse Pollak (Base) Compiled by Tim, PANews PANews Editor's Note: Y Combinator, a well-known Silicon Valley incubator in the United States, and crypto giant Coinbase have jointly launched a crypto entrepreneurship camp. This article is a "call for action" and hopes that more entrepreneurs will start on-chain development. The article mainly explains the current development of the crypto industry, such as how stablecoins and tokenized assets will penetrate people's lives, and also indicates the investment areas they are interested in. We believe the time is ripe for a shift to on-chain development. Over the past decade, the relevant tools have continued to develop, and with the emergence of low-gas public chains, the global circulation of stablecoins, an easy-to-use wallet ecosystem, and a growing user base, the infrastructure is finally in place. We've observed several key trends that are creating tremendous opportunities for developers worldwide. This starts with the fact that we are at the beginning of a new era in FinTech – FinTech 3.0. Fintech 1.0 was the initial digitization of the financial industry in the 1990s, driven by companies like PayPal. The key breakthrough during this period was the rise in consumer acceptance of online payment methods. Fintech 2.0, which took place over the past decade and was driven by companies like Stripe, Plaid, Brex, and Chime, is centered around building application programming interfaces (APIs) on top of the existing financial system. A key breakthrough in this phase was the emergence of banking-as-a-service providers, which enabled startups to innovate and develop on top of the legacy financial system. We are entering the FinTech 3.0 era. This era will see the financial system restructured with code, with payments and settlements delivered instantly around the clock around the world. User assets will be stored in digital wallets and fully controlled by individuals, and traditional banks will no longer be the only option for asset custody. For years, regulatory uncertainty has been a major obstacle to building Fintech 3.0. With the enactment of the GENIUS Act and the potential imminent introduction of the CLARITY Act, the US now has a clear regulatory framework for cryptocurrencies, enabling entrepreneurs to confidently build groundbreaking businesses on-chain. This represents the greatest opportunity for cryptocurrency startups in years, and Y Combinator and Coinbase are committed to providing funding and support to help you seize this opportunity. While the list below is by no means exhaustive, there are several key areas where we will be particularly focused and looking to invest. Stablecoins Stablecoins are the first major success story in the FinTech 3.0 era. Stablecoins are on-chain assets whose value is pegged to fiat currencies or assets like gold, designed to maintain price stability. As a payment tool, stablecoins offer significant advantages over traditional financial transactions, particularly in cross-border payments. Users can transfer stablecoins to anywhere in the world 24/7, at a cost of less than a cent, in less than a second, and without foreign exchange fees. This isn't just a theoretical assumption; trillions of dollars in stablecoins are already being used to settle payments in real time. People are already building stablecoin applications with millions of users. YC alumni companies like Kontigo, DolarApp, and Aspora are providing instant, low-cost payment and remittance services to millions of users across Latin America and South Asia. El Dorado, a platform for sending and receiving stablecoins in Latin America, backed by Coinbase Ventures, has processed $200 million in transactions for nearly 1 million users over the past year, demonstrating the region's growing demand for cryptocurrencies as a hedge against currency devaluation. This isn’t just an attempt by startups: Coinbase has just partnered with Shopify to launch an open-source commercial payment protocol that supports any traditional online business scenario and on-chain stablecoin payment processes. It combines all the advantages of crypto payments (lightning-fast settlement speeds and near-zero transaction fees) with the security and scalability of typical e-commerce functions (delay capture, final tax confirmation, and refund capabilities). Despite regulatory headwinds, the continued success of stablecoins demonstrates strong market demand for them. Following the successful passage of the GENIUS Act in the United States, stablecoin adoption is poised for explosive growth. This legislation creates a comprehensive federal regulatory framework for stablecoins, similar to that of the banking system. Since the GENIUS Act's enactment, the total stablecoin market capitalization has grown by over $30 billion, with major corporations such as Amazon and Walmart expressing interest in issuing their own stablecoins. There are many directions for development in the stablecoin space, but we are particularly interested in the following: Full access to stablecoins: Platforms that process payments, lending, and other financial services can achieve significant efficiency gains through stablecoins. Enabling businesses and consumers to transact seamlessly on these platforms will unlock significant value. Local currency stablecoins: Stablecoins pegged to local currencies allow citizens in countries with high inflation to reap the benefits of cryptocurrency without relying solely on the US dollar. Governments and consumers concerned about dollarization can use these stablecoins as the cornerstone of local payment, savings, and credit systems. Crypto-native businesses: With the emergence of the Commerce Payments protocol and other tools, merchants, lenders, and consumers will have the opportunity to process acceptance, credit, and payments in a crypto-native manner. This will open up new possibilities for serving customers, given the global nature of the platform. Tokenization and Trading The infrastructure that powers stablecoins can be used for any asset. This is what makes Fintech 3.0 truly fascinating. Through tokenization, we will fundamentally change the definition of assets and the range of holders. Tokenization involves representing real-world assets (such as government bonds, startup equity, art, or loans) as digital tokens on a blockchain. Its core value lies in making assets that have historically lacked liquidity and been monopolized by layers of middlemen accessible to anyone, anywhere, at any time. In reality, this might mean: Instead of waiting a month for a check, you can receive your share of the rental income of your properties in real time, every second. Rather than going through complicated paperwork to exercise your startup stock options, you could have a “live cap sheet” that converts your equity into programmable tokens that you actually own and can buy and sell freely on the open market. Instead of investing millions in private lending, you can simply purchase tokens that represent a portion of a decentralized loan portfolio. This is already happening. Mainstream institutions like JPMorgan Chase are bringing deposit tokens to the blockchain, while startups like Courtyard are tokenizing physical collectibles. We're also witnessing a wave of tokenization of new on-chain native assets like creator tokens and content tokens on platforms like Zora and Pump.fun. All of this is giving rise to a ton of new things: companies like Axiom, a YC alumnus, have become the fastest-growing YC alumnus we've ever seen. The core infrastructure is in place, and we are looking for founders to develop products that will bring all types of assets online. We are particularly interested in: New credit market: Lending protocols leverage on-chain identity and reputation to provide undercollateralized loans, providing funding to individuals and businesses overlooked by the traditional financial system. On-chain capital structuring: A tool for startups to raise funds directly from users, managing equity structure tables through programmable tokens, replacing the traditional model of spreadsheets and legal services. New trading front ends: The surge in assets presents new trading and investment opportunities for consumers and businesses. Applications and Agents On-chain technology also opens up new frontiers for applications and intelligent agents, unattainable in the previous internet era. Think of blockchain as a new operating system: a globally shared platform that makes application development an order of magnitude more efficient than traditional models. No single company holds a monopoly, and any developer can build products on it without permission. With its "money as software" nature, intelligent agents are inherently equipped to participate in this new economic landscape. We believe this will trigger a surge in new applications. Social, financial, collaborative, gaming—you name it. We're already seeing this trend with platforms like Base: you can use these apps to do everything from getting a loan instantly, to earning money while playing games, to supporting your favorite creators and earning money yourself. We believe these applications will also appear in chats in the form of agents. AI agents equipped with digital wallets will be super-empowered to help people participate in and navigate the rapidly growing global economy. They will simplify and improve the user experience, just as they do in other areas of commerce around the world.
Share
PANews2025/09/24 16:49
Share