-
Notifications
You must be signed in to change notification settings - Fork 383
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Garbled output from model in Unity #178
Comments
Sorry for seeing this issue late and I'm so happy that you've got it worked. In fact none of the main contributors of LLamaSharp knows much about unity, which sometimes confuses us. In this case, is there anything else to do to run LLamaSharp with unity compared to running it with net core app? We will appreciate it if you would like to help with the document about how to use it in unity ^_^ |
Do you have any more information on what kind of exception is happening? (e.g. is it an |
Sorry for the late reply, but I could not get it working... I have narrowed the error down to the function I specified in the post, though! I also think the issue could be due to the dotnet environment Unity is using, as I tested the examples in Visual Studio and they were working. It may also be due to the changes I made.
I'll check what the exception is and let you know... |
Here is what I found in the editor logs:
|
I've done some googling on this error:
It seems like it indicates something very wrong inside the Mono runtime, all the mentions of it I could find were associated with things like buggy alpha versions of the editor, or a corrupt install etc. The error further down:
Is just a generic error indicating that something tried to access memory it shouldn't. It's probably a symptom of the first issue. |
I see. If it is a problem with Mono, I can build it with IL2CPP. I'm also updating to the latest version of Unity, so if it was a corrupt install, that should fix it. |
It still crashes the editor. I'll try building the project with IL2CPP and see. |
@Uralstech @martindevans @AsakusaRinne I am having the same issues and I found fixes to some: Unity crashes
This can be fixed by compiling older llama.cpp revisions that are closer in time to current LLamaSharp release. I also noticed that when linking to llama.cpp master, ModelParams class gets mapped incorrectly (e.g. context size gets mapped to cuda device id in my case). Garbled outputI was able to generate correct output by using Posting the MonoBehaviour to test this below.
LLamaSharpTestScript.cs using UnityEngine;
using LLama;
using LLama.Native;
using LLama.Common;
using System.Collections.Generic;
using System;
using System.Runtime.InteropServices;
using System.Linq;
using System.Text;
public class LLamaSharpTestScript : MonoBehaviour
{
[ContextMenu("Generate Test")]
public void GenerateTest()
{
string modelPath = Application.streamingAssetsPath + "/llama/mistral-7b-v0.1.Q4_K_M.gguf";
var prompt = "### Human: Hi!\n### Assistant: Hello, how can I help you?\n### Human: say 'this is a test'\n### Assistant:";
// Load a model
var parameters = new ModelParams(modelPath)
{
ContextSize = 1024,
Seed = 1337,
GpuLayerCount = 5,
};
var inferenceParams = new InferenceParams()
{
TokensKeep = 128,
MaxTokens = 32,
TopK = -1,
Temperature = 0,
};
using var model = LLamaWeights.LoadFromFile(parameters);
var context = new LLamaContext(model, parameters);
var generatedWithPast = GenerateFromNative(context, prompt, 32, 1, -1, 0.0f, 2);
var generated = GenerateNoPastNaive(context, prompt, 32, -1, 0.0f, 2);
context.Dispose();
var instructExecutor = new StatelessExecutor(model, parameters);
var sb = new StringBuilder();
foreach (var tok in instructExecutor.Infer(prompt, inferenceParams: inferenceParams))
{
sb.Append(tok);
}
Debug.Log($"Generated with past: {generatedWithPast}\n Generated without past: {generated}\n Generated with executor: {sb.ToString()}");
}
private string GenerateFromNative(
LLamaContext context,
string prompt, int maxNewTokens = 1024,
int window = 1, int topK = -1,
float temperature = 1, int eosToken = 2
)
{
var encoded = context.Tokenize(prompt);
var idsArray = encoded.Select(x => (uint)x).ToArray();
int n_past = 0;
var generated = new List<uint>();
var idsLast = idsArray.Last();
var idsArrayWithoutLast = idsArray.Take(idsArray.Length - 1).ToArray();
// Consume prompt tokens
for (var cur = 0; cur < idsArrayWithoutLast.Length; cur += window)
{
var windowIds = idsArrayWithoutLast.Skip(cur).Take(window).ToArray();
var _ = ComputeLogits(context, windowIds, n_past);
n_past += windowIds.Length;
}
// Generate one-by-one until EOS token
for (var cur = 0; cur < maxNewTokens; cur++)
{
var inpIds = new uint[] { idsLast };
var logits = ComputeLogits(context, inpIds, n_past);
var nextToken = Sample(logits, topK, temperature);
if (nextToken == eosToken)
{
break;
}
generated.Add(nextToken);
n_past += 1;
}
return context.DeTokenize(generated.Select(x => (int)x).ToArray());
}
private string GenerateNoPastNaive(
LLamaContext context,
string prompt, int maxNewTokens = 1024,
int topK = -1, float temperature = 0,
int eosToken = 2
)
{
var encoded = context.Tokenize(prompt);
var idsArray = encoded.Select(x => (uint)x).ToArray();
var generated = new List<uint>();
for (int cur = 0; cur < maxNewTokens; cur++)
{
var logits = ComputeLogits(context, idsArray);
var nextToken = Sample(logits, topK, temperature);
if (nextToken == eosToken)
{
break;
}
generated.Add(nextToken);
idsArray = idsArray.Append(nextToken).ToArray();
}
return context.DeTokenize(generated.Select(x => (int)x).ToArray());
}
private uint Sample(float[] logits, int topK = -1, float temperature = 1)
{
var probs = Softmax(logits);
var topKProbs = probs.Select((x, i) => new { x, i }).OrderByDescending(x => x.x).Take(topK > 0 ? topK : probs.Length);
if (temperature == 0)
{
return (uint)topKProbs.First().i;
}
var topKProbsArray = topKProbs.Select(x => x.x).ToArray();
var topKProbsSum = topKProbsArray.Sum();
var topKProbsNormalized = topKProbsArray.Select(x => x / topKProbsSum).ToArray();
var topKProbsCumSum = topKProbsNormalized.Select((x, i) => topKProbsNormalized.Take(i + 1).Sum()).ToArray();
var random = UnityEngine.Random.value;
var index = Array.FindIndex(topKProbsCumSum, x => x > random);
return (uint)topKProbs.ElementAt(index).i;
}
private float[] ComputeLogits(LLamaContext context, uint[] idsArray, int n_past = 0)
{
var ids = MemoryMarshal.Cast<uint, int>(new ReadOnlySpan<uint>(idsArray));
var ok = context.NativeHandle.Eval(ids, n_past, 8);
if (ok)
{
var logits = context.NativeHandle.GetLogits();
var logitsArray = logits.ToArray();
return logitsArray;
}
throw new Exception("Eval failed");
}
private float[] Softmax(float[] logits)
{
var max = logits.Max();
var exps = logits.Select(x => Math.Exp(x - max));
var sum = exps.Sum();
var softmax = exps.Select(x => (float)(x / sum));
return softmax.ToArray();
}
} Output is:
Am I using n_past correctly? |
Each version of LLamaSharp works with exactly one version of llama.cpp. The llama.cpp API isn't stable, so there's absolutely no flexibility in terms of using other versions without reviewing and fixing the corresponding C# code! If you're not using the correct llama.cpp version that could cause all kinds of weird behaviour (if you're lucky; a hard crash). |
Where can I find the exact version from the release? |
The readme has commit hashes in the If you're using an unreleased version you should look back in history to the last time the DLL files were changed, that commit should mention the commit hash (if not have a look at the corresponding PR notes). |
I will try running this with correct version tomorrow. Also on a side note, the behavior with official LLamaSharp.Backend.CPU is identical. |
Issue persists, on v0.6.0 and llama.cpp commit from the readme :( |
Hello! I just tried your script in Unity, without changing anything, and it gives proper replies now! Thanks! But, when I increase the |
Wait, even when using
|
|
I am sorry, I am being dumb. Everything works on v0.6.0. It's just my script was buggy, and I was using native handle instead of @Uralstech, here is the updated MonoBehaviour that works (Ignore ModelParams, they are probably wrong). using UnityEngine;
using System.Collections.Generic;
using System;
using System.Linq;
using LLama;
using LLama.Native;
using LLama.Common;
public class LLamaSharpTestScript : MonoBehaviour
{
[ContextMenu("Generate Test")]
public void GenerateTest()
{
string modelPath = Application.streamingAssetsPath + "/llama/mistral-7b-v0.1.Q4_K_M.gguf";
var prompt = "### Human: Hi!\n### Assistant: Hello, how can I help you?\n### Human: say 'this is a test'\n### Assistant:";
// Load a model
var parameters = new ModelParams(modelPath)
{
ContextSize = 32768,
Seed = 1337,
GpuLayerCount = 16,
BatchSize = 128
};
using var model = LLamaWeights.LoadFromFile(parameters);
var context = new LLamaContext(model, parameters);
var generatedWithPast = GenerateFromNative(context, prompt, 32, -1, 0.0f);
// var generated = GenerateNoPastNaive(context, prompt, 32, -1, 0.0f, 2);
Debug.Log($"Generated with past: {generatedWithPast}");
}
private string GenerateFromNative(
LLamaContext context,
string prompt, int maxNewTokens = 1024, int topK = -1,
float temperature = 1
)
{
var idsArray = context.Tokenize(prompt);
int n_past = 0;
float[] logits;
var generated = new List<int>();
var idsLast = idsArray.Last();
var idsArrayWithoutLast = idsArray.Take(idsArray.Length - 1).ToArray();
(logits, n_past) = ComputeLogits(context, idsArrayWithoutLast, n_past);
// Generate one-by-one until EOS token
for (var cur = 0; cur < maxNewTokens; cur++)
{
var inpIds = new int[] { idsLast };
(logits, n_past) = ComputeLogits(context, inpIds, n_past);
var nextToken = Sample(logits, topK, temperature);
if (nextToken == NativeApi.llama_token_eos(context.NativeHandle))
{
break;
}
idsLast = nextToken;
generated.Add(nextToken);
}
return context.DeTokenize(generated.Select(x => (int)x).ToArray());
}
private int Sample(float[] logits, int topK = -1, float temperature = 1)
{
var probs = Softmax(logits);
var topKProbs = probs.Select((x, i) => new { x, i }).OrderByDescending(x => x.x).Take(topK > 0 ? topK : probs.Length);
if (temperature == 0)
{
return topKProbs.First().i;
}
var topKProbsArray = topKProbs.Select(x => x.x).ToArray();
var topKProbsSum = topKProbsArray.Sum();
var topKProbsNormalized = topKProbsArray.Select(x => x / topKProbsSum).ToArray();
var topKProbsCumSum = topKProbsNormalized.Select((x, i) => topKProbsNormalized.Take(i + 1).Sum()).ToArray();
var random = UnityEngine.Random.value;
var index = Array.FindIndex(topKProbsCumSum, x => x > random);
return topKProbs.ElementAt(index).i;
}
private (float[] logits, int past_tokens) ComputeLogits(LLamaContext context, int[] idsArray, int n_past = 0)
{
var newPastTokens = context.Eval(idsArray, n_past);
var logits = context.NativeHandle.GetLogits();
var logitsArray = logits.ToArray();
return (logitsArray, newPastTokens);
}
private float[] Softmax(float[] logits)
{
var max = logits.Max();
var exps = logits.Select(x => Math.Exp(x - max));
var sum = exps.Sum();
var softmax = exps.Select(x => (float)(x / sum));
return softmax.ToArray();
}
} I will double check with LLamaSharp executors later. Thanks for the help! |
Just a kind request, is anyone here willing to write a short blog/document about using LLamaSharp in unity? We have little experience with unity and are not fully aware of the gap between unity and dotnet core runtime, so that there's no documents about deploying on unity yet. However according to the issues, unity is one of the most important parts of the application. I'll appreciate it if anyone could help. :) |
Hello, I periodically look at what’s going on with you and thought I’d connect later, when your version changes less often. I have a number of questions.
As I understand it, this is independent of Microsoft and can be used as a source of semantic hash independently? Am I understanding this correctly? If this is so, then by the nature of the embedding vector, the relevance function will be the cosine distance between the vectors. I wrote all this to make it clearer to you that what you are doing is maximally suitable for creating web servers that solve specific AI problems. What is needed for Unity is a set of basic, non-asynchronous functions that provide the ability to build your own components and pipelines within Unity. For example, the question is still unclear to me. Can I use the same model in both embedding and inference mode without rebooting? Is it possible to change the operating mode without rebooting? All this leads to the fact that it might be better to have some kind of simplified branch of your project for Unity? The entire project simply includes parts that are unnecessary and will only create additional problems. Most often in Unity, these will be tasks of creating game logic and content, which still cannot be executed in the true sense asynchronously, since they will be a dependent series of requests. There is asynchrony, it’s just processing that doesn’t block the WindowsMessage loop. The whole point is to use them natively in Unity, and not on the server. Since there is no point in a local server, and accessing external services will cause problems with their reliability or with their payment. The goal is to get a working mechanism on small models. Although I understand the theory, I have not studied the current implementations in detail, so my reasoning may be completely wrong! What do you think about it? |
@Xsanf Hey, thank you very much for taking the time out of caring for the children to provide these suggestions :)
When starting this project, what I pursued is that making other C# developers deploy LLM with less code, which I think attract ed many developers without much knowledge about internal mechanism of LLaMA. However, as you said, it sacrificed some flexibility, even though native apis are all public now. There're two questions I want to further ask about:
It's a good idea. I didn't notice that using
The semantic-kernel is supported via an extension package named
I think this is the point for us to better support unity. Do you think the current release mode (main package + backend package + extension package) make it difficult for unity users on the side of
I think one model for both embedding and inference mode is supported now because of the introduce of Thanks again for you suggestions. Hope the applications in unity could be better supported in the future. :) |
I haven't tried this but yes I think you should be able to load one set of weights and use it for everything.
LLamaSharp is basically split into multiple layers at the moment. From your description it sounds like you want to use the first 2 layers. Lowest Level
Resources which needs to be freed (e.g. a context) will be wrapped in a handle (e.g. Middle LevelThis level provides more idiomatic C# objects which represent the low level llama.cpp capabilities. e.g. Top LevelExecutors, text transforms, history transforms etc all exist at the highest level. They exist to make it as easy as possible to assemble a system that responds to text prompts without the user having to understand all of the deeper llama.cpp implementation details.
We recently decided to make all the high level executors async. That allows any IO that's needed, such as loading.saving state, to be async. Possibly more importantly it means that the executor can yield while the LLM is being evaluated, instead of blocking (which would be disastrous in Unity, causing dropped frames). As far as I'm aware it should all work inside Unity (although not with the native Unity Job system, you'd have to build your own executor using the lower layer components for that). |
AsakusaRinne and martindevans Let me clarify again. I’m not talking about the LLamaSharp project, I’m talking specifically about the specifics of Unity. And once again I will confirm my opinion, you are doing everything right within the framework of LLamaSharp. Your entire approach is appropriate to the task. Everything is separated and can be used separately for the user's specific purposes. I'll try to answer consistently.
No, the presence of asynchronous functions does not interfere with development. If you remember in my example (based on version 0.3), I was using your asynchronous function, streaming output via yield
The issue is that this is only useful for chat mode. In Unity, most often this will be game logic where streaming output is useless. So it will be a non-asynchronous call placed on a separate thread. Most often, this requires the formation of context from different sources. This includes information about characters and objects on the stage, history of interactions, etc. You need to collect context while minimizing its size from different sources in the game. This applies not only to games, but also to any intelligent agents. But this is not the task of your project, since it strongly depends on what kind of architecture the user will create.
The question is not entirely correct, preferences will be differents. If for me it is preferable to build the basic mechanics myself, since I understand well what I am doing, then for someone it will be more preferable to use the same semantic kernel from Microsoft. Because it is easier to find examples of use for it.
Once again, you are fine. You allow not to use what the user considers unnecessary. In Unity itself this can be used. The only question is that most often this will be needed only to implement chat, and this is not the main task in Unity. And therefore, if problems of transferring into Unity arise with precisely such parts, then it would probably be more reasonable to ignore them, simply not transferring them to Unity And have a separate shortened version for unity, which simply won't contain the problematic but optional parts. In Unity, the task is game logic (agent logic). And this is inference, embedding, vector comparison. The implementation of semantic functions through text substitutions is generally best left entirely to the will of the Unity user, since only he understands exactly how he wants to form the context.
I haven't checked it yet, but judging by the description, this is what need. I will definitely try to check the 0.6 version in Unity later. Just as I wrote, I have grandchildren on me)) Now one is sitting and snotting, and tomorrow they will bring another one))
Yes, you are absolutly right. I use both the native API itself and abstractions over context and parameters. It is very comfortable. I used your top-level abstractions as a simple chat example and it all works (I hope this hasn’t changed in 0.6). So, this does not interfere with Unity itself, it’s just that in some cases it is redundant.
Once again, your decisions are absolutely clear and justified for LLamaSharp. Unity has its own methods for non-blocking execution, but your implementation does not interfere with this in any way. The main thing is to ensure the availability, at a low level, of capabilities - interference, embedding (semantic hash) and calculation of semantic distance. This is absolutely enough to implement any processing of intelligent agents. Somehow I forgot to mention, behind all these discussions. Although coroutines are more common in Unity, async/await and the I don’t know if this is appropriate here, but I haven’t had anyone to discuss this for quite some time)). I’ll just express my opinion, formed by my 66 years, most of which I worked in one way or another on algorithms close to the topic of AI. The brain is a parallel correlator of signals in the window of events, displaying, with its characteristic distortions, the world here and now, in the form of entities (concepts) identified by it, presented through the strength of connections between elements. It does not have individual thoughts because it does not have a serialized thread. In the process of evolution, for the exchange of signals between school animals, he created a mechanism for serializing his parallel state into a stream of symbols, based on attention, highlighting part of the general state. As it evolved, this allowed speech and serialized thinking to emerge. In linguistics there is the concept of Concept and Detonation. Detonat is a label attached to the concept. A concept is a group of states that conveys its semantics through connections with other concepts (semantic network). Personality is a serializer that can express, as a sequence of detonations, a parallel representation of concepts highlighted by attention. The personality does not make decisions, it only bears witness to them, translating a deep parallel structure into a flow of Detonations (thoughts). Language is a serialized stream of Detonations created by a person based on attention and a system of Concepts that form a semantic network. This is why it is so important to take into account that English is a projective language and the prediction of the next character relies entirely on the left context. This is the GPT model. A huge number of languages are non-projective and GPT cannot take into account the influence of the right context. two-way BERT is inconvenient for generation and so far the most likely candidate for a generative model that takes into account two-way context is T5. So, pay attention to T5 whenever possible. Especially if your language is non-projective. Non-projective languages also contain a projective part, so GPT will extract something from them, but T5 will extract more Concepts from them. The first successful attempt to create non-life (imitation), which in principle is difficult to distinguish from life. |
Now
On my personal side, I'm very much willing to hear such a deep-discussion of it. I was a researcher of the area of image processing studying for a master's degree and am working on AI infra (mainly distributed training system and high performance inference) now. I can't agree more that It's late night here and I have work tomorrow, so my reply is limited. For the subsequent discussion, could you please open a discussion in this repo, or sending emails to me (AsakusaRinne@gmail.com)? I'm not sure if other people in this issue would find the message disturbing though I think it's really a good topic. :) |
I've tested LLamaSharp v0.6.0 on Unity with @eublefar's new script and am getting 15-17 seconds for a reply, which is good for my laptop. One thing about LlamaSharp's Unity compatibility - Unity is on C# 9, so there are some errors when first importing into unity:
public readonly struct LoraAdapter
{
public readonly string Path;
public readonly float Scale;
public LoraAdapter(string path, float scale)
{
Path = path;
Scale = scale;
}
} Unity is targeting NetStandard 2.1, so I get these errors:
#if !NETSTANDARD2_0 && !NETSTANDARD2_1
// Try to check the size without enumerating the entire IEnumerable. This may not be able to get the count,
// in which case we'll have to check later
if (data.TryGetNonEnumeratedCount(out var dataCount) && dataCount > size)
throw new ArgumentException($"The max size set for the quene is {size}, but got {dataCount} initial values.");
#endif
#if NETSTANDARD2_0 || NETSTANDARD2_1
public static void EnsureCapacity<T>(this List<T> list, int capacity)
{
if (list.Capacity < capacity)
list.Capacity = capacity;
}
#endif In general, Unity support will improve greatly if LLamaSharp targets NetStandard 2.1, even if in addition to NetStandard 2.0.
I can make a small document of the changes I made to LLamaSharp to make it work in Unity. I am pretty good in Unity and CSharp, but I have no idea how LLaMA CPP works... I can help in any way possible! Also, somewhat irrelevant to this issue, would LLamaSharp work on Android and is there any plan to have an official runtime for Android? I have yet to try building LLaMA CPP for Android and use it in LLamaSharp. I am mainly interested in using LLamaSharp to inference a LLaMA model for chat in Android apps. |
@Uralstech Many thanks for your reply! We'll fix the compatibility problem you listed soon!
I think adding an extra runtime from the next release is okay if it helps on the compatibility with unity.
I'll really appreciate that! A blog/document about how to make LLamaSharp work with unity is enough. I think being stuck at the first step will depress lots of users.
I'm pretty sure that llama.cpp works on Android but not sure if it's okay with dotnet runtime. In fact I'm mainly a cpp developer and C# is my interest, so that I don't have an idea if the same LLamaSharp code could work on Android. Recently I'm also doing some works on Android (see faster-rwkv, which supports android inference). Therefore I think I could help with the step of building llama.cpp but still need some knowledges about dotnet apps on Android plarform. |
You can use precompiled DLL from NuGet and then there is no need to modify the code. I also use NuGetForUnity for dependencies, but It should be trivial to just find and add dlls from those yourself.
I can clean up a demo project and write some README on how to start with LLAMASharp in Unity, if you want. I also switched to using But one big feature I don't know how to implement is support for multiple sequences. As far as i understand native API owns the KV cache and even has support for different sequence ids, but It's unclear how to use it from high level API.
The closest thing I found is |
I'd definitely recommend taking this approach. It's going to be a lot easier than trying to maintain a fork of LLamaSharp which tries to pull it back 2 entire language versions!
Looks like we have a "hole" in our version support since all of the compatibility shims are written in Edit: If you want to do something a little extra while doing this, it might be good to add something like this to ensure our
I don't have much experience working on Android, but I would expect LLamaSharp to work on Android. We already support ARM64 for MacOS, which doesn't require any architecture/platform specific code (just compatible binaries).
I've recently been working on this, check out the new
I think that's the best option with LLamaSharp right now, but it is an expensive operation because it's a pretty big chunk of data! LLamaSharp itself isn't adding any overhead here, we're just allocating a big block of memory and asking llama.cpp to copy the state data into it. Batched decoding will definitely be better. |
@AsakusaRinne, Here is the demo project https://github.com/eublefar/LLAMASharpUnityDemo btw. |
From what I understand, I cannot add my own build of LLaMA CPP without using the source version of LLamaSharp - I am mainly interested in running a model on Android, and as there is no official backend for Android, I plan to build LLaMA CPP from source with Android support and use it in LLamaSharp.
I have forked the project! I'll updated it with the changes as soon as possible. |
I think you can tho. You just need to clone LLAMASharp project separately and cross compile it for arm64 target (or whatever the architecture you are targeting) the same you would do for llama.cpp. |
If you want to use a custom compiled DLL just don't install a backend package (e.g. If you do that ensure that you use exactly the correct commit from llama.cpp! There's absolutely no compatibility from version to version. |
eublefar Thank you very much for the demo. As I said, I'm not a big expert in Unity, so my example is just a provocation, which I hoped would attract the attention of more powerful developers. Unity is an extremely interesting segment for LLM. A huge relief, because now there is no need to restart the project, everything ends normally. Based on your explanation, I rebuilt the project for the GPU. Everything works very quickly. I'll try to figure out how this version works. |
Thank you very much for your work! I've put the demo in the README. :) |
I have made a pull request regarding the preprocessor directives for targeting newer versions of .NET Standard! I have only made the changes in the root LLamaSharp project, as I haven't explored the other parts. |
Hi everyone, I really like the project and I want to contribute with what I can. Can we make a to do list on the github page? @eublefar @Uralstech |
Of course, you're always welcome! I made a project which contains TODO list just now, please refer to LLamaSharp Dev Project. You could also join our discord and I'll invite you to dev channel. |
I'll close this one now, since it seems like the discussion is over. Feel free to re-open this or open new issues of course :) |
Hi! I was trying to get the latest version of LLamaSharp working in Unity.
This is my script:
And this is my output:
I am on Unity 2022.3, .NET Standard 2.1 and am using TheBloke/llama-2-7B-Guanaco-QLoRA-GGUF as it is listed under 'Verified Model Resources' in the readme. My code is mostly derived from ChatSessionWithRoleName.cs.
I did make a few changes in the LLamaSharp code -
LLama\Exceptions\GrammarFormatExceptions.cs
I changedEncodingExtensions.cs
,LLamaBeamsState.cs
,LLamaBeamView.cs
andNativeApi.BeamSearch.cs
using Microsoft.Extensions.Logging;
andILogger
.ILogger
instances with UnityEngine'sDebug.Log()
,Debug.LogWarning()
andDebug.LogError()
.#if NETSTANDARD2_0
and#if !NETSTANDARD2_0
with#if !NETSTANDARD2_1
and#if NETSTANDARD2_1
respectively, as I believe they are compatible.LLamaContext.cs
, I replacedvar last_n_array = lastTokens.TakeLast(last_n_repeat).ToArray();
withvar last_n_array = IEnumerableExtensions.TakeLast(lastTokens, last_n_repeat).ToArray();
as I was getting the errorThe call is ambiguous between the following methods or properties: 'System.Linq.Enumerable.TakeLast<TSource>(System.Collections.Generic.IEnumerable<TSource>, int)' and 'LLama.Extensions.IEnumerableExtensions.TakeLast<T>(System.Collections.Generic.IEnumerable<T>, int)'
.Please do tell me if I did something wrong.
Thanks in advance!
Edit
If I ever decrease
ContextSize
from 1024 or increaseMaxTokens
to above ~150, Unity just crashes. I have narrowed the crash down toin
SafeLLamaContextHandle.cs
.Update
I downloaded LLamaSharp again and compiled the example project (LLama.Examples) with the same
llama-2-7b-guanaco-qlora.Q2_K.gguf
model and it works! So my issue has something to do with my changes or with the C# environment that Unity uses.The text was updated successfully, but these errors were encountered: