
本文将带你设计与实施一个生产级的 RAG 解决方案配资入门网,重点关注实践性的实现模式与架构考量。
技术栈理解
Foundry Local:搭载 ONNX Runtime 的 Edge AI
Foundry Local 使 AI 模型推理能够高效、安全、可扩展地在本地设备上执行。它构建在 ONNX Runtime 之上,为企业应用程序提供了几个关键优势:
硬件抽象:支持多种执行提供者,如 NVIDIA CUDA、AMD、Qualcomm、Intel,以优化性能。
模型灵活性:本地网关实现了与 OpenAI 相同的 /v1/chat/completions路由,因此你可以将现有 Python 或 JS 客户端指向 base_url=manager.endpoint,即可无缝工作。
隐私优先架构:所有处理全部在本地进行,敏感数据绝不离开设备。
硬件抽象:支持多种执行提供者,如 NVIDIA CUDA、AMD、Qualcomm、Intel,以优化性能。
模型灵活性:本地网关实现了与 OpenAI 相同的 /v1/chat/completions路由,因此你可以将现有 Python 或 JS 客户端指向 base_url=manager.endpoint,即可无缝工作。
展开剩余94%隐私优先架构:所有处理全部在本地进行,敏感数据绝不离开设备。
Semantic Kernel:AI 协作层
Semantic Kernel 是一款轻量级的开源开发工具包,可帮助你轻松构建 AI 代理并将最新的 AI 模型集成到 C#、Python 或 Java 代码中。它作为你应用逻辑与 AI 模型之间的中间层,具备以下功能:
模型无关设计:可通过最小代码改动在不同的大语言模型之间切换。
插件架构:支持通过自定义功能和工具来扩展功能。
内存管理:内置语义内存与向量存储支持。
模型无关设计:可通过最小代码改动在不同的大语言模型之间切换。
插件架构:支持通过自定义功能和工具来扩展功能。
内存管理:内置语义内存与向量存储支持。
RAG 架构概览
Foundry Local 与 Semantic Kernel 的整合,打造出一个高性能、隐私优先、可扩展的本地 RAG 架构:
所有组件 —— 从文档处理到响应生成 —— 都在本地环境中独立运作。
实施指南
准备与环境设置
请确认你的开发环境满足以下条件:
.NET 8.0 或更高版本
Docker(用于运行 Qdrant 向量存储)
Qdrant
已安装 Foundry Local
Visual Studio Code + .NET Extension Pack
.NET 8.0 或更高版本
Docker(用于运行 Qdrant 向量存储)
Qdrant
已安装 Foundry Local
Visual Studio Code + .NET Extension Pack
第一步:安装与设置 Foundry Local
安装并配置 Foundry Local:
# List available modelsfoundrymodel list
# Download a suitable model (e.g., Phi-3.5-mini)foundrymodel download qwen2. 5- 0. 5b-instruct-generic-cpu
# Start the servicefoundryservice start
Foundry Local 会根据系统硬件与软件配置自动下载最适合的模型版本(例如 NVIDIA GPU 下使用 CUDA 版本的模型)。
第二步:使用 Docker 配置 Qdrant 向量存储
通过 Docker 部署 Qdrant:
第三步:使用本地模型配置 Semantic Kernel
在构建完全本地化的 RAG 解决方案时,一个关键挑战是如何处理嵌入。近几个月我们看到的大多数示例都展示了如何通过调用 OpenAI 或 Azure Search 等云服务来实现 RAG。但对于必须确保数据保留在本地的使用场景来说,这种方式可能并不适用。
解决方案是将 Semantic Kernel 配置为同时使用 Foundry Local 来执行聊天补全,以及使用基于 ONNX 的嵌入模型来进行文本向量化。
// Initialize the Semantic Kernel builder for local AI orchestrationvarbuilder = Kernel.CreateBuilder;
// Configure Foundry Local chat completion service// This connects to the local Foundry service running on port 5273// The service provides OpenAI-compatible API endpoints for seamless integrationbuilder.AddOpenAIChatCompletion(modelId: "qwen2.5-0.5b-instruct-generic-gpu", // Model identifier matching Foundry Localendpoint: newUri( "http://localhost:5273/v1"), // Local Foundry endpointapiKey: "", // No API key needed for local serviceserviceId: "qwen2.5-0.5b"); // Service identifier for kernel resolution
// Configure local ONNX embedding model for text vectorization// These models run entirely offline for privacy-preserving embeddingsvarembeddingModelPath = "Your Jinaai jina-embeddings-v2-base-en onnx model path"; varvocabPath = "Your Jinaai jina-embeddings-v2-base-en vocab file path";
// Add BERT-based ONNX embedding generation// This enables local text-to-vector conversion without cloud dependenciesbuilder.AddBertOnnxTextEmbeddingGeneration(embeddingModelPath, vocabPath);
// Build the configured kernel instancevarkernel = builder.Build;
第四步:实现向量存储操作
VectorStoreService类为在 Qdrant 中管理文档嵌入提供了一个健壮的接口。该服务负责集合初始化、向量存储以及相似度搜索等操作,这些功能构成了我们 RAG 系统的核心基础。
<summary>Initializes a new instance of the VectorStoreService</summary><param name="endpoint">Qdrant server endpoint (e.g., http://localhost:6334)</param><param name="apiKey">API key for authentication (empty for local deployment)</param><param name="collectionName">Name of the vector collection to manage</param>publicVectorStoreService( stringendpoint, stringapiKey, stringcollectionName) {_client = newQdrantClient( newUri(endpoint)); _collectionName = collectionName;}
<summary>Initializes the vector collection with specified dimensionsCreates a new collection if it doesn't exist, otherwise uses the existing one</summary><param name="vectorSize">Embedding vector dimensions (default: 768 for most BERT models)</param>publicasync Task InitializeAsync(intvectorSize = 768){try{// Attempt to get existing collection infoawait_client.GetCollectionInfoAsync(_collectionName); }catch{// Create new collection with cosine similarity for semantic searchawait_client.CreateCollectionAsync(_collectionName, newVectorParams {Size = ( ulong)vectorSize, Distance = Distance.Cosine // Cosine similarity works well for text embeddings});}}
<summary>Stores or updates a vector embedding with associated metadata</summary><param name="id">Unique identifier for the vector point</param><param name="embedding">Vector embedding of the text chunk</param><param name="metadata">Associated metadata (document ID, chunk text, etc.)</param>publicasync Task UpsertAsync(stringid, ReadOnlyMemory<float> embedding, Dictionary<string, object> metadata){// Create a point structure for Qdrant storagevarpoint = newPointStruct {Id = newPointId { Uuid = id }, Vectors = embedding.ToArray,Payload = { }};
// Convert metadata to Qdrant-compatible formatforeach( varkvp inmetadata) {point.Payload[kvp.Key] = kvp.Value switch{strings => s, inti => i, boolb => b, _ => kvp.Value.ToString ?? string.Empty };}
// Store the vector point in the collectionawait_client.UpsertAsync(_collectionName, new[] { point }); }
<summary>Performs similarity search to find relevant document chunks</summary><param name="queryEmbedding">Vector embedding of the user query</param><param name="limit">Maximum number of results to return</param><returns>List of scored points ordered by similarity</returns>publicasync Task<List<ScoredPoint>> SearchAsync(ReadOnlyMemory< float> queryEmbedding, intlimit = 3) {varsearchResult = await_client.SearchAsync(_collectionName, queryEmbedding.ToArray, limit: ( ulong)limit); returnsearchResult.ToList; }}
第五步:构建 RAG 查询管道
RagQueryService负责协调完整的 RAG 工作流,从查询向量化到上下文检索,再到生成响应。该服务展示了将本地嵌入与 Foundry Local 的聊天补全功能相结合所带来的强大能力:
<summary>Initializes the RAG query service with required dependencies</summary>publicRagQueryService(IEmbeddingGenerator< string, Embedding< float>> embeddingService, IChatCompletionService chatService,VectorStoreService vectorStoreService){_embeddingService = embeddingService;_chatService = chatService;_vectorStoreService = vectorStoreService;}
<summary>Processes a user question through the complete RAG pipeline</summary><param name="question">User's natural language question</param><returns>AI-generated answer based on retrieved context</returns>publicasync Task<string> QueryAsync(stringquestion){// Step 1: Convert the user question into a vector embedding// This embedding will be used for similarity search in the vector storevarqueryEmbeddingResult = await_embeddingService.GenerateAsync(question); varqueryEmbedding = queryEmbeddingResult.Vector; // Step 2: Perform semantic search to find the most relevant document chunks// Retrieve top 5 most similar chunks based on cosine similarityvarsearchResults = await_vectorStoreService.SearchAsync(queryEmbedding, limit: 5);
// Step 3: Extract and concatenate text content from search results// This forms the context that will inform the AI's responsestringcontextText = ""; foreach( varresult insearchResults) {if(result.Payload.TryGetValue( "text", outvar text)) {contextText += text.ToString + " "; }}
// Step 4: Construct a prompt that combines the question with retrieved context// This prompt guides the AI to answer based on the specific contextvarprompt = $@"Based on the question: '{question}', please provide a comprehensive answer using the following context. Optimize and simplify the content for clarity:Context: {contextText}";
// Step 5: Create chat history with system instruction and user promptvarchatHistory = newChatHistory; chatHistory.AddSystemMessage( "You are a helpful assistant that answers questions based on the provided context. "+ "Use only the information from the context to answer questions accurately."); chatHistory.AddUserMessage(prompt);
// Step 6: Generate streaming response using Foundry Local// Stream the response for better user experiencevarfullMessage = string.Empty; awaitforeach ( varchatUpdate in_chatService.GetStreamingChatMessageContentsAsync(chatHistory, cancellationToken: default)) { if(chatUpdate.Content is{ Length: > 0}) {fullMessage += chatUpdate.Content;}}returnfullMessage ?? "I couldn't generate a response based on the available context."; }}
第六步:文档摄取与文本分块
DocumentIngestionService负责处理 RAG 所需的关键文档预处理任务。它通过带有重叠的智能文本分块来确保上下文的连续性,并生成嵌入以支持高效的语义搜索:
<summary>Initializes the document ingestion service</summary>publicDocumentIngestionService(IEmbeddingGenerator< string, Embedding< float>> embeddingService, VectorStoreService vectorStoreService){_embeddingService = embeddingService;_vectorStoreService = vectorStoreService;}
<summary>Processes a document by chunking text and storing embeddings</summary><param name="documentPath">File path to the document to process</param><param name="documentId">Unique identifier for tracking the document</param>publicasync Task IngestDocumentAsync(stringdocumentPath, stringdocumentId){// Read the entire document contentvarcontent = awaitFile.ReadAllTextAsync(documentPath); // Split document into manageable chunks with overlap for context preservation// 300 words per chunk with 60-word overlap ensures semantic continuityvarchunks = ChunkText(content, chunkSize: 300, overlap: 60);
// Process each chunk individuallyfor( inti = 0; i < chunks.Count; i++) {varchunk = chunks[i]; // Generate vector embedding for the text chunkvarembeddingResult = await_embeddingService.GenerateAsync(chunk); varembedding = embeddingResult.Vector; // Store the chunk embedding with comprehensive metadataawait_vectorStoreService.UpsertAsync( id: Guid.NewGuid.ToString,embedding: embedding,metadata: newDictionary< string, object> {[ "document_id"] = documentId, // Links chunk to original document[ "chunk_index"] = i, // Maintains chunk order[ "text"] = chunk, // Stores original text for retrieval[ "document_path"] = documentPath // Tracks source file location});}}
<summary>Implements intelligent text chunking with configurable overlapOverlap ensures that context spanning chunk boundaries is preserved</summary><param name="text">Text content to chunk</param><param name="chunkSize">Number of words per chunk</param><param name="overlap">Number of overlapping words between chunks</param><returns>List of text chunks with preserved context</returns>privateList<string> ChunkText(stringtext, intchunkSize, intoverlap){varchunks = newList< string>; varwords = text.Split( ' ', StringSplitOptions.RemoveEmptyEntries); // Create overlapping chunks to maintain context continuityfor( inti = 0; i < words.Length; i += chunkSize - overlap) {// Extract words for this chunk, respecting boundariesvarchunkWords = words.Skip(i).Take(chunkSize).ToArray; varchunk = string.Join( " ", chunkWords); chunks.Add(chunk);// Stop if we've processed all wordsif(i + chunkSize >= words.Length) break; }returnchunks; }}
第七步:编排完整的 RAG 应用
最后这一步演示了如何将所有组件组装起来,构建一个可运行的 RAG 应用。代码展示了从服务初始化、文档处理到查询执行的完整工作流程:
// Step 2: Initialize the vector store service// Connect to local Qdrant instance running on port 6334// Collection name "demodocs" will store our document embeddingsvarvectorStoreService = newVectorStoreService( endpoint: "http://localhost:6334", apiKey: "", // No API key needed for local QdrantcollectionName: "demodocs");
// Step 3: Initialize the vector collection// This creates the collection if it doesn't exist, with proper embedding dimensionsawaitvectorStoreService.InitializeAsync;
// Step 4: Create service instances for document processing and queryingvardocumentIngestionService = newDocumentIngestionService(embeddingService, vectorStoreService); varragQueryService = newRagQueryService(embeddingService, chatService, vectorStoreService);
// Step 5: Ingest a sample document into the RAG system// Replace with your actual document path and provide a unique document IDvarfilePath = "./foundry-local-architecture.md"; vardocumentId = "foundry-architecture-doc";
// Process the document: chunk text, generate embeddings, and store in vector databaseawaitdocumentIngestionService.IngestDocumentAsync(filePath, documentId);
// Step 6: Test the RAG system with a sample queryvarquestion = "What's Foundry Local?";
// Execute the complete RAG pipeline:// 1. Convert question to embedding// 2. Search for relevant document chunks// 3. Generate contextual response using Foundry Localvaranswer = awaitragQueryService.QueryAsync(question);
// Step 7: Display the resultConsole.WriteLine( $"Question: {question}"); Console.WriteLine( $"Answer: {answer}");
关键集成要点:
服务解析:内核会自动解析已配置的聊天和嵌入服务
向量存储管理:通过正确初始化来确保集合存在且维度正确
错误处理:系统能够优雅地处理集合缺失或连接问题
可扩展性:该模式支持多文档及并发查询
服务解析:内核会自动解析已配置的聊天和嵌入服务
向量存储管理:通过正确初始化来确保集合存在且维度正确
错误处理:系统能够优雅地处理集合缺失或连接问题
可扩展性:该模式支持多文档及并发查询
结论
使用 Semantic Kernel 和 Foundry Local 构建 RAG 应用,为注重隐私且具成本效益的 AI 解决方案提供了坚实的基础。这种架构使组织能够在完全掌控自身数据与基础设施的前提下,充分利用强大的语言模型能力。
将 Semantic Kernel 的编排能力与 Foundry Local 的边缘优化推理相结合,可以打造一套从开发到企业部署均可扩展的生产级平台。随着本地 AI 生态的不断成熟,这一方法能够帮助组织在满足隐私与安全要求的同时,抢先利用新兴能力。
通过实施本指南中介绍的模式与实践,开发团队可以构建出既符合企业级性能要求,又不牺牲数据隐私与运营成本的高阶 RAG 应用。
卢建晖
微软高级云技术布道师
专注在 AI + Data配资入门网,著有超过 70 万阅读量的 《Phi-3 Cookbook》。
发布于:北京市金鼎配资提示:文章来自网络,不代表本站观点。