配资入门网技术速递｜使用 Semantic Kernel 和 Foundry Local 构建企业级本地 RAG 应用

本文将带你设计与实施一个生产级的 RAG 解决方案配资入门网，重点关注实践性的实现模式与架构考量。

技术栈理解

Foundry Local：搭载 ONNX Runtime 的 Edge AI

Foundry Local 使 AI 模型推理能够高效、安全、可扩展地在本地设备上执行。它构建在 ONNX Runtime 之上，为企业应用程序提供了几个关键优势：

硬件抽象：支持多种执行提供者，如 NVIDIA CUDA、AMD、Qualcomm、Intel，以优化性能。

模型灵活性：本地网关实现了与 OpenAI 相同的 /v1/chat/completions路由，因此你可以将现有 Python 或 JS 客户端指向 base_url=manager.endpoint，即可无缝工作。

隐私优先架构：所有处理全部在本地进行，敏感数据绝不离开设备。

硬件抽象：支持多种执行提供者，如 NVIDIA CUDA、AMD、Qualcomm、Intel，以优化性能。

模型灵活性：本地网关实现了与 OpenAI 相同的 /v1/chat/completions路由，因此你可以将现有 Python 或 JS 客户端指向 base_url=manager.endpoint，即可无缝工作。

展开剩余94%

隐私优先架构：所有处理全部在本地进行，敏感数据绝不离开设备。

Semantic Kernel：AI 协作层

Semantic Kernel 是一款轻量级的开源开发工具包，可帮助你轻松构建 AI 代理并将最新的 AI 模型集成到 C#、Python 或 Java 代码中。它作为你应用逻辑与 AI 模型之间的中间层，具备以下功能：

模型无关设计：可通过最小代码改动在不同的大语言模型之间切换。

插件架构：支持通过自定义功能和工具来扩展功能。

内存管理：内置语义内存与向量存储支持。

模型无关设计：可通过最小代码改动在不同的大语言模型之间切换。

插件架构：支持通过自定义功能和工具来扩展功能。

内存管理：内置语义内存与向量存储支持。

RAG 架构概览

Foundry Local 与 Semantic Kernel 的整合，打造出一个高性能、隐私优先、可扩展的本地 RAG 架构：

所有组件 —— 从文档处理到响应生成 —— 都在本地环境中独立运作。

实施指南

准备与环境设置

请确认你的开发环境满足以下条件：

.NET 8.0 或更高版本

Docker（用于运行 Qdrant 向量存储）

Qdrant

已安装 Foundry Local

Visual Studio Code + .NET Extension Pack

.NET 8.0 或更高版本

Docker（用于运行 Qdrant 向量存储）

Qdrant

已安装 Foundry Local

Visual Studio Code + .NET Extension Pack

第一步：安装与设置 Foundry Local

安装并配置 Foundry Local：

# List available modelsfoundrymodel list

# Download a suitable model (e.g., Phi-3.5-mini)foundrymodel download qwen2. 5- 0. 5b-instruct-generic-cpu

# Start the servicefoundryservice start

Foundry Local 会根据系统硬件与软件配置自动下载最适合的模型版本（例如 NVIDIA GPU 下使用 CUDA 版本的模型）。

第二步：使用 Docker 配置 Qdrant 向量存储

通过 Docker 部署 Qdrant：

第三步：使用本地模型配置 Semantic Kernel

在构建完全本地化的 RAG 解决方案时，一个关键挑战是如何处理嵌入。近几个月我们看到的大多数示例都展示了如何通过调用 OpenAI 或 Azure Search 等云服务来实现 RAG。但对于必须确保数据保留在本地的使用场景来说，这种方式可能并不适用。

解决方案是将 Semantic Kernel 配置为同时使用 Foundry Local 来执行聊天补全，以及使用基于 ONNX 的嵌入模型来进行文本向量化。

// Initialize the Semantic Kernel builder for local AI orchestrationvarbuilder = Kernel.CreateBuilder;

// Configure Foundry Local chat completion service// This connects to the local Foundry service running on port 5273// The service provides OpenAI-compatible API endpoints for seamless integrationbuilder.AddOpenAIChatCompletion(modelId: "qwen2.5-0.5b-instruct-generic-gpu", // Model identifier matching Foundry Localendpoint: newUri( "http://localhost:5273/v1"), // Local Foundry endpointapiKey: "", // No API key needed for local serviceserviceId: "qwen2.5-0.5b"); // Service identifier for kernel resolution

// Configure local ONNX embedding model for text vectorization// These models run entirely offline for privacy-preserving embeddingsvarembeddingModelPath = "Your Jinaai jina-embeddings-v2-base-en onnx model path"; varvocabPath = "Your Jinaai jina-embeddings-v2-base-en vocab file path";

// Add BERT-based ONNX embedding generation// This enables local text-to-vector conversion without cloud dependenciesbuilder.AddBertOnnxTextEmbeddingGeneration(embeddingModelPath, vocabPath);

// Build the configured kernel instancevarkernel = builder.Build;

第四步：实现向量存储操作

VectorStoreService类为在 Qdrant 中管理文档嵌入提供了一个健壮的接口。该服务负责集合初始化、向量存储以及相似度搜索等操作，这些功能构成了我们 RAG 系统的核心基础。

<summary>Initializes a new instance of the VectorStoreService</summary><param name="endpoint">Qdrant server endpoint (e.g., http://localhost:6334)</param><param name="apiKey">API key for authentication (empty for local deployment)</param><param name="collectionName">Name of the vector collection to manage</param>publicVectorStoreService( stringendpoint, stringapiKey, stringcollectionName) {_client = newQdrantClient( newUri(endpoint)); _collectionName = collectionName;}

<summary>Initializes the vector collection with specified dimensionsCreates a new collection if it doesn't exist, otherwise uses the existing one</summary><param name="vectorSize">Embedding vector dimensions (default: 768 for most BERT models)</param>publicasync Task InitializeAsync(intvectorSize = 768){try{// Attempt to get existing collection infoawait_client.GetCollectionInfoAsync(_collectionName); }catch{// Create new collection with cosine similarity for semantic searchawait_client.CreateCollectionAsync(_collectionName, newVectorParams {Size = ( ulong)vectorSize, Distance = Distance.Cosine // Cosine similarity works well for text embeddings});}}

<summary>Stores or updates a vector embedding with associated metadata</summary><param name="id">Unique identifier for the vector point</param><param name="embedding">Vector embedding of the text chunk</param><param name="metadata">Associated metadata (document ID, chunk text, etc.)</param>publicasync Task UpsertAsync(stringid, ReadOnlyMemory<float> embedding, Dictionary<string, object> metadata){// Create a point structure for Qdrant storagevarpoint = newPointStruct {Id = newPointId { Uuid = id }, Vectors = embedding.ToArray,Payload = { }};

// Convert metadata to Qdrant-compatible formatforeach( varkvp inmetadata) {point.Payload[kvp.Key] = kvp.Value switch{strings => s, inti => i, boolb => b, _ => kvp.Value.ToString ?? string.Empty };}

// Store the vector point in the collectionawait_client.UpsertAsync(_collectionName, new[] { point }); }

<summary>Performs similarity search to find relevant document chunks</summary><param name="queryEmbedding">Vector embedding of the user query</param><param name="limit">Maximum number of results to return</param><returns>List of scored points ordered by similarity</returns>publicasync Task<List<ScoredPoint>> SearchAsync(ReadOnlyMemory< float> queryEmbedding, intlimit = 3) {varsearchResult = await_client.SearchAsync(_collectionName, queryEmbedding.ToArray, limit: ( ulong)limit); returnsearchResult.ToList; }}

第五步：构建 RAG 查询管道

RagQueryService负责协调完整的 RAG 工作流，从查询向量化到上下文检索，再到生成响应。该服务展示了将本地嵌入与 Foundry Local 的聊天补全功能相结合所带来的强大能力：

<summary>Initializes the RAG query service with required dependencies</summary>publicRagQueryService(IEmbeddingGenerator< string, Embedding< float>> embeddingService, IChatCompletionService chatService,VectorStoreService vectorStoreService){_embeddingService = embeddingService;_chatService = chatService;_vectorStoreService = vectorStoreService;}

<summary>Processes a user question through the complete RAG pipeline</summary><param name="question">User's natural language question</param><returns>AI-generated answer based on retrieved context</returns>publicasync Task<string> QueryAsync(stringquestion){// Step 1: Convert the user question into a vector embedding// This embedding will be used for similarity search in the vector storevarqueryEmbeddingResult = await_embeddingService.GenerateAsync(question); varqueryEmbedding = queryEmbeddingResult.Vector; // Step 2: Perform semantic search to find the most relevant document chunks// Retrieve top 5 most similar chunks based on cosine similarityvarsearchResults = await_vectorStoreService.SearchAsync(queryEmbedding, limit: 5);

// Step 3: Extract and concatenate text content from search results// This forms the context that will inform the AI's responsestringcontextText = ""; foreach( varresult insearchResults) {if(result.Payload.TryGetValue( "text", outvar text)) {contextText += text.ToString + " "; }}

// Step 4: Construct a prompt that combines the question with retrieved context// This prompt guides the AI to answer based on the specific contextvarprompt = $@"Based on the question: '{question}', please provide a comprehensive answer using the following context. Optimize and simplify the content for clarity:Context: {contextText}";

// Step 5: Create chat history with system instruction and user promptvarchatHistory = newChatHistory; chatHistory.AddSystemMessage( "You are a helpful assistant that answers questions based on the provided context. "+ "Use only the information from the context to answer questions accurately."); chatHistory.AddUserMessage(prompt);

// Step 6: Generate streaming response using Foundry Local// Stream the response for better user experiencevarfullMessage = string.Empty; awaitforeach ( varchatUpdate in_chatService.GetStreamingChatMessageContentsAsync(chatHistory, cancellationToken: default)) { if(chatUpdate.Content is{ Length: > 0}) {fullMessage += chatUpdate.Content;}}returnfullMessage ?? "I couldn't generate a response based on the available context."; }}

第六步：文档摄取与文本分块

DocumentIngestionService负责处理 RAG 所需的关键文档预处理任务。它通过带有重叠的智能文本分块来确保上下文的连续性，并生成嵌入以支持高效的语义搜索：

<summary>Initializes the document ingestion service</summary>publicDocumentIngestionService(IEmbeddingGenerator< string, Embedding< float>> embeddingService, VectorStoreService vectorStoreService){_embeddingService = embeddingService;_vectorStoreService = vectorStoreService;}

<summary>Processes a document by chunking text and storing embeddings</summary><param name="documentPath">File path to the document to process</param><param name="documentId">Unique identifier for tracking the document</param>publicasync Task IngestDocumentAsync(stringdocumentPath, stringdocumentId){// Read the entire document contentvarcontent = awaitFile.ReadAllTextAsync(documentPath); // Split document into manageable chunks with overlap for context preservation// 300 words per chunk with 60-word overlap ensures semantic continuityvarchunks = ChunkText(content, chunkSize: 300, overlap: 60);

// Process each chunk individuallyfor( inti = 0; i < chunks.Count; i++) {varchunk = chunks[i]; // Generate vector embedding for the text chunkvarembeddingResult = await_embeddingService.GenerateAsync(chunk); varembedding = embeddingResult.Vector; // Store the chunk embedding with comprehensive metadataawait_vectorStoreService.UpsertAsync( id: Guid.NewGuid.ToString,embedding: embedding,metadata: newDictionary< string, object> {[ "document_id"] = documentId, // Links chunk to original document[ "chunk_index"] = i, // Maintains chunk order[ "text"] = chunk, // Stores original text for retrieval[ "document_path"] = documentPath // Tracks source file location});}}

<summary>Implements intelligent text chunking with configurable overlapOverlap ensures that context spanning chunk boundaries is preserved</summary><param name="text">Text content to chunk</param><param name="chunkSize">Number of words per chunk</param><param name="overlap">Number of overlapping words between chunks</param><returns>List of text chunks with preserved context</returns>privateList<string> ChunkText(stringtext, intchunkSize, intoverlap){varchunks = newList< string>; varwords = text.Split( ' ', StringSplitOptions.RemoveEmptyEntries); // Create overlapping chunks to maintain context continuityfor( inti = 0; i < words.Length; i += chunkSize - overlap) {// Extract words for this chunk, respecting boundariesvarchunkWords = words.Skip(i).Take(chunkSize).ToArray; varchunk = string.Join( " ", chunkWords); chunks.Add(chunk);// Stop if we've processed all wordsif(i + chunkSize >= words.Length) break; }returnchunks; }}

第七步：编排完整的 RAG 应用

最后这一步演示了如何将所有组件组装起来，构建一个可运行的 RAG 应用。代码展示了从服务初始化、文档处理到查询执行的完整工作流程：

// Step 2: Initialize the vector store service// Connect to local Qdrant instance running on port 6334// Collection name "demodocs" will store our document embeddingsvarvectorStoreService = newVectorStoreService( endpoint: "http://localhost:6334", apiKey: "", // No API key needed for local QdrantcollectionName: "demodocs");

// Step 3: Initialize the vector collection// This creates the collection if it doesn't exist, with proper embedding dimensionsawaitvectorStoreService.InitializeAsync;

// Step 4: Create service instances for document processing and queryingvardocumentIngestionService = newDocumentIngestionService(embeddingService, vectorStoreService); varragQueryService = newRagQueryService(embeddingService, chatService, vectorStoreService);

// Step 5: Ingest a sample document into the RAG system// Replace with your actual document path and provide a unique document IDvarfilePath = "./foundry-local-architecture.md"; vardocumentId = "foundry-architecture-doc";

// Process the document: chunk text, generate embeddings, and store in vector databaseawaitdocumentIngestionService.IngestDocumentAsync(filePath, documentId);

// Step 6: Test the RAG system with a sample queryvarquestion = "What's Foundry Local?";

// Execute the complete RAG pipeline:// 1. Convert question to embedding// 2. Search for relevant document chunks// 3. Generate contextual response using Foundry Localvaranswer = awaitragQueryService.QueryAsync(question);

// Step 7: Display the resultConsole.WriteLine( $"Question: {question}"); Console.WriteLine( $"Answer: {answer}");

关键集成要点：

服务解析：内核会自动解析已配置的聊天和嵌入服务

向量存储管理：通过正确初始化来确保集合存在且维度正确

错误处理：系统能够优雅地处理集合缺失或连接问题

可扩展性：该模式支持多文档及并发查询

服务解析：内核会自动解析已配置的聊天和嵌入服务

向量存储管理：通过正确初始化来确保集合存在且维度正确

错误处理：系统能够优雅地处理集合缺失或连接问题

可扩展性：该模式支持多文档及并发查询

结论

使用 Semantic Kernel 和 Foundry Local 构建 RAG 应用，为注重隐私且具成本效益的 AI 解决方案提供了坚实的基础。这种架构使组织能够在完全掌控自身数据与基础设施的前提下，充分利用强大的语言模型能力。