Introduction
Transformers and the Large Language Models have taken the world by storm after they have been introduced in the field of Natural Language Processing (NLP). Since their inception, the field has been quickly evolving with innovations and research that make these LLMs more efficient. These include LoRA(Low-Rank Adaption), Flash Attention, Quantization, and the recent Merging approach of the notable LLMs. In this guide, we will look at a new approach to merging LLMs (Solar 10.7B) introduced by the Upstage AI.
Learning Objectives
- Understand the unique architecture of Solar 10.7B and its innovative “depth up-scaling”
- Explore the model’s pre-training process and the diverse data it consumes
- Analyze the impressive performance benchmarks of Solar 10.7B across different NLP tasks
- Compare and contrast Solar 10.7B with other notable LLMs, like Mixtral MoE
- Learn how to access and work with Solar 10.7B for your projects
This article was published as a part of the Data Science Blogathon.
Table of contents
What is SOLAR 10.7B?
Upstange AI introduced the new 10.7 Billion Parameter model, SOLAR 10.7B. This model is a result of merging two 7 Billion Parameter Models, specifically two Llama 2 7 Billion models, which were pretrained to create SOLAR 10.7B. The unique aspect of this merge is the application of a new approach called Depth Up-Scaling (DUS), contrasting with the Mixtral method where a mixture of experts is employed.
The new 10.7B Model outperformed the Mistral 7B, Qwen 14B. An Instruct version called SOLAR 10.7B Instruct has been released, and upon its release, it topped the leaderboard, surpassing both the Qwen 72B and the Mixtral 8x7B Large Language Model. Despite being a 10.7 Billion Parameter model, the SOLAR was able to outperform the LLMs that are multiple times its size
What is Depth Up Scaling?
Let’s understand how it all began, and the formation of SOLAR 10.7B. It all starts with a single Base Model. The Upstage has chosen the Llama 2 containing 32 Transformer Layers for its Base Model due to its wider Open Source Contributors. Then a copy of this Base Model was created
We then get two Base Models. As for the weights, the Upstage has taken the pretrained weights from the Mistral 7B because it was performing the best at that time. Now, we start the depthwise scaling. Each of the Base Models contains 32 Layers. From these 32 Layers, we remove m Layers, that is the final m Layers from the Original Model and the first m layers from the copy version of it. This adds up to 24 Layers in each of them. Then we merge these two models:
The two Base Models are concatenated to form the scaled model. The scaled model now contains 48 Layers. The scaled model performs poorly due to the merging. Hence the scaled model undergoes pretraining. This Depthwise Scaling followed by the continued Pretraining together makes the Depth Up-Scaling (DUS).
Training the SOLAR 10.7B
The scaled model needs to be pretrained because of the decrease in performance due to merging. The makers said that the performance has risen quickly with pretraining. The pretraining / fine-tuning involved two stages
The first stage was the Instruction Fine-Tuning. In this type of Fine-Tuning, the model underwent training on datasets to align with the instructions. The fine-tuning process involved working with popular Open Source datasets such as Alpaca-GPT4 and OpenOrca. The paper noted that only a subset of the dataset was utilized in fine-tuning the merged model. Along with the Open Source data, the Upstage even trained it with some closed source Math data.
In the second stage, Alignment Tuning is performed. In Alignment Tuning, we take the stage one fine-tuned model and further fine-tune it to be more aligned with humans or powerful AIs like GPT4. This was done through the DPOTrainer(Direct Preference Optimization) an RLHF(Reinforcement Learning with Human Feedback)-like technique.
In Direct Preference Optimization, we have a dataset containing three columns, a Prompt, a preferred answer column, and a rejected answer column. This is then used to train the scaled model to make it generate the answers that we need it to generate. The same datasets that were trained for instruction-finetuning are used here.
Evaluation and Benchmark Results
The Hugging Face OpenLLM Leaderboard uses several benchmarks to evaluate the capabilities of Large Language Models (LLMs). Each benchmark assesses different aspects of an LLM’s performance:
- ARC (AI2 Reasoning Challenge): This benchmark tests an LLM’s ability to answer elementary-level science questions, providing insights into the model’s understanding and reasoning of scientific concepts.
- MMLU (Massive MultiTask Language Understanding): MMLU is a diverse benchmark that covers 57 different tasks, including questions related to basic mathematics, history, law, computer science, and others. It evaluates the LLM’s ability to process and understand information across multiple disciplines.
- HellaSwag: Aimed at testing an LLM’s commonsense reasoning, HellaSwag challenges models to apply everyday logic to a variety of scenarios, assessing their ability to make intuitive judgments similar to human thought processes.
- Winogrande: This benchmark similar to the HellaSwag, focuses on commonsense reasoning but with different nuances compared to HellaSwag. It requires LLMs to demonstrate a sophisticated level of understanding and logical reasoning.
- TruthfulQA: TruthfulQA evaluates the accuracy and reliability of information provided by LLMs. It includes questions from different areas including science, law, politics, and more, testing the model’s ability to generate truthful and factual responses.
- GSM8K: Specifically designed to test Math abilities, GSM8K includes multi-step math problems that need logical reasoning and computational thinking, challenging LLMs to evaluate their problem-solving skills in mathematics.
The base SOLAR 10.7B Model outperformed models like the Mistral 7B Instruct v0.2 model and the Qwen 14B model. The Instruct version of the SOLAR 10.7B was able to even beat the very Large Language Models like the Mistral 8x7B, Qwen 72B, Falcon 180B, and the other huge Large Language Models. It was ahead of all the models in the ARC and the TruthfulQA benchmark
Getting Started with SOLAR 10.7B
The SOLAR 10.7B Model is readily available in the HuggingFace Hub to work with the transformers library. Even the quantized models of the SOLAR 10.7B are available to work with. In this section, we will be downloading the quantized version and try inputting the model with different tasks and seeing the output generated
For testing with the quantized version of SOLAR 10.7B, we will be working with the llama_cpp_python library of Python that lets us run quantized Large Language Models. For this demo, we will be working with the free version of Google Colab.
Download the Package
!CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip3 install llama-cpp-python
!pip3 install huggingface-hub
- The CMAKE_ARGS=”-DLLAMA_CUBLAS=on” and FORCE_CMAKE=1, will allow the llama_cpp_python to work the Nvidia GPU available in the free colab version
- Then we install the llama_cpp_python package through the pip3
- We even download the huggingface-hub, with which we will be downloading the quantized SOLAR 10.7B model
To work with the SOLAR 10.7B model, we need to first download the quantized version of it. To download it, we will run the following code:
from huggingface_hub import hf_hub_download
# specifying the model name
model_name = "TheBloke/SOLAR-10.7B-Instruct-v1.0-GGUF"
# specifying the type of quantization of the model
model_file = "solar-10.7b-instruct-v1.0.Q2_K.gguf"
# download the model by specifying the model name and quantized model name
model_path = hf_hub_download(model_name, filename=model_file)
Working with Hugging Face Hub
Here, we work with the hugging_face_hub to download the quantized model. For this, we import the hf_hub_download that takes in the following parameters
- model_name: This is the type of model that we wish to download. Here we wish to download the SOLAR 10.7B Instruct GGUF model
- model_file: Here we tell which quantized version we want to download. Here we will download the 2bit quantized version of the SOLAR 10.7B Instruct
- We then pass these parameters to the hf_hub_download, which takes in these parameters and downloads the specified model. After downloading, it returns the path where the model is downloaded
- This path returned is being saved in the model_path variable
Now, we can load this model through the llama_cpp_python library. The code for loading the model will be like the one below
from llama_cpp import Llama
llm = Llama(
model_path=model_path,
n_ctx=512, # the number of i/p tokens the model can take
n_threads=8, # the number of threads to use
n_gpu_layers=110 # how many layers of the model to offload to the GPU
)
Import the Llama Class
We import the Llama class from the llama_cpp, which takes in the following parameters
- model_path: This variable takes in the path where our model is stored. We have got the path from the previous step, which we will be providing here
- n_ctx: Here, we give the context length for the model. For now, we are providing 512 tokens for the context length
- n_threads: Here we mention the number of threads to be used by the Llama class. For now, we pass it 8, because we have 4 core CPU, where each core can run 2 threads simultaneously
- n_gpu_layers: We give this if we have a running GPU, which we do because we are working with the free colab. To this, we pass 110, which tells that we want to offload the entire model into the GPU and do not want some part of it to run in the system RAM
- Finally, we create an object from this Llama class and give it to the variable llm
Running this code will load the SOLAR 10.7B quantized model onto the GPU and set the appropriate context length. Now, it’s time to perform some inferences on this model. For this, we work with the below code
output = llm(
"### User:\nWho are you?\n\n### Assistant:", # User Prompt
max_tokens=512, # the number of output tokens generated
stop=["</s>"], # the token which tells the LLM to stop
)
print(output['choices'][0]['text']) # llm generated text
Infer the Model
To infer the model, we pass the following parameters to the LLMs:
- Prompt/chat template: This is the template needed to chat with the model. The above-mentioned template(### User:\n{user_prompt}?\n\n### Assistant:) is the one that works for the SOLAR 10.7B model. In the template, the sentence after the User is the User Prompt and the generation will be generated after the Assistant
- max_tokens: This is the maximum amount of tokens that the Large Language Model can output when a Prompt is given. For now, we are limiting it to 512 tokens
- stop: This is the stop token. The stop token tells the Large Language Model that it needs to stop generating further tokens. For SOLAR 10.7B, the stop token is </s>
Running this will store the results in the output variable. The result generated is similar to the OpenAI API call. Hence we can access the generation through the given print statement, which is similar to how we access the generation from the OpenAI responses. The output generated can be seen below
The generated sentence seems good enough without the appearance of major grammatical mistakes. Let’s try the common sense part of the model by giving the following Prompts
output = llm(
"### User:\nHow many eggs can a monkey lay in its lifetime?\n\n### Assistant:",
max_tokens=512,
stop=["</s>"],
)
print(output['choices'][0]['text'])
output = llm(
"### User:\nHow many smartphones can a human eat?\n\n### Assistant:",
max_tokens=512,
stop=["</s>"],
)
print(output['choices'][0]['text'])
Here we see two examples related to common sense and surprisingly SOLAR 10.7B handles it very well. The Large Language Model was able to deliver the right answers with some useful content. Let’s try testing the math and Reasoning Abilities of the model through the following Prompts
output = llm(
"### User:\nLook at this series: 80, 10, 70, 15, 60, ... \
What number should come next?\n\n### Assistant:",
max_tokens=512,
stop=["</s>"],
)
print(output['choices'][0]['text'])
output = llm(
"### User:\nJohn runs faster than Ken. Magnus runs faster than John. \
Does Ken run faster than Magnus?\n\n### Assistant:",
max_tokens=512,
stop=["</s>"],
)
print(output['choices'][0]['text'])
From the given example Prompts, the SOLAR 10.7B generated a good response. It was able to answer the given mathematical, and logical reasoning correctly and even the questions related to common sense. Overall we can conclude that SOLAR 10.7B Large Language Model is generating good responses
SOLAR 10.7B vs Mixtral MoE
Mixtral 8x7B MoE is created by the Mistral AI with the Mixture of Experts architecture. In brief, this Mixture of Experts, the Mistral employs 8 7Billion Parameter Models. Each of these models has some of its feed-forward networks replaced by other layers called experts. Hence the Mixtral 8x7B is considered to have 8 experts. And everyone the model takes in the Input Prompt, there will be a gating mechanism that selects only 2 of these experts from the 8. The 2 experts then take in this Input Prompt and generate final output tokens. So we can see that there is a bit of complexity involved in this type of merging, where we have to replace the feed-forward layers with other layers and introduce a gating mechanism that selects between these experts
While the SOLAR 10.7B Model from Upstage leverages the Depth Up-Scaling method. In the Depth Up-Scaling, we only just remove some number of the starting layers from a Base Model and the same number of final layers from its copy version. Then we just merge the models by stacking one on top of the other. And with just a few epochs of fine-tuning the merged model can show a rapid growth in performance. Here we do not replace the existing layers with some other layers. Also here we do not have a gating mechanism. In overall, the Depth Up-Scaling is a simple and effective way to merge models that do not involve complexities.
Also comparing the performances, the Depth Up-Scaling, though by just combining two 7 Billion Models, the SOLAR 10.7B was able to clearly outperform the Mixtral 8x7B, which is a far larger model in comparison. This proves the effectiveness of a simple merging method over a complex one like the Mixtral of Experts
Limitations and Considerations
- Hyperparameter Exploration: A crucial limitation is the insufficient exploration of hyperparameters in the DUS approach. Due to hardware limitations, 8 layers were removed from both ends of the Base Model without verifying if this number is optimal for getting the best performance. Future work aims to conduct more rigorous experiments and to do an analysis to address this.
- Computational Demands: The model needs a huge amount of computational resources for training and inference. This could limit its usage, mainly for those with limited computational capabilities.
- Biases in Training Data: Like all machine learning models, it is susceptible to biases present in the training data, potentially leading to skewed outcomes in certain scenarios.
- Environmental Impact: Even the energy consumption necessary for training and operating the model poses environmental concerns, highlighting the importance of sustainable AI development.
- Model’s Broader Implications: While the model shows improved performance in following instructions, it still requires task-specific fine-tuning for optimal performance in specialized applications. This fine-tuning process is resource-intensive and may not always be effective.
Conclusion
In this guide, we have taken a look at the recently released SOLAR 10.7Billion Parameter model by the Upstage AI. Upstage AI has taken a new approach to merge and scale models. The paper used a new approach called Depth Up-Scaling to merge two Llama-2 7 Billion Parameter models by removing some of the starting and final transformer layers. Afterward, it fine-tuned the model on Open Source datasets and tested it on the OpenLLM Leaderboard, achieving the highest H6 score and topping the leaderboard.
Key Takeaways
- SOLAR 10.7B introduces Depth Up-Scaling, a unique merging approach, challenging traditional methods and showing the advancements in model architecture
- Despite its 10.7 billion parameters, SOLAR 10.7B outshines larger models, surpassing Mistral 7B, Qwen 14B, and even topping leaderboards with versions like SOLAR 10.7B Instruct
- The two-stage fine-tuning process involving Instruction and Alignment Tuning ensures the model’s adaptability to different tasks, making it very good at following instructions and aligning with human preferences
- SOLAR 10.7B excels across diverse benchmarks, thus showing its competence in tasks ranging from Basic Mathematics and language understanding to commonsense reasoning and truthfulness evaluation
- Readily available on the HuggingFace Hub, SOLAR 10.7B provides developers and researchers with an efficient and available tool for language-processing applications
- You can fine-tune the model using the regular methods employed for fine-tuning large language models. For instance, you can utilize the Supervised Fine-Tune Trainer (SFTrainer) from Hugging Face to fine-tune the SOLAR 10.7B Model.
Frequently Asked Questions
A. SOLAR 10.7B is a 10.7 billion parameter model by Upstage AI, utilizing a unique merging technique called Depth Up-Scaling. It distinguishes itself by outperforming larger LLMs and showcasing advancements in merging models.
A. Depthwise Scaling involves two base models. The process involves directly merging these two base models by stacking them on top of one another. Before the merging takes place, the initial layers from one model and the final layers from the other model are removed.
A. SOLAR 10.7B undergoes a two-stage pretraining process. Instruction fine-tuning involves training the model on datasets emphasizing instruction-following. Alignment tuning refines the model’s alignment with human preferences using a technique called Direct Preference Optimization (DPO).
A. SOLAR 10.7B excels across various benchmarks, including ARC (AI2 Reasoning Challenge), MMLU (Massive MultiTask Language Understanding), HellaSwag, Winogrande, TruthfulQA, and GSM8K. It achieves high scores, demonstrating its versatility in handling different language tasks.
A. SOLAR 10.7B surpasses models like Mistral 7B and Qwen 14B, showcasing superior performance despite having fewer parameters. The instruct version even competes with and outperforms very large models, including Mistral 8x7B and Qwen 72B, on various benchmarks.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.