【ChatGPT 】如何使用自定义知识库构建自己的自定义ChatGPT

developer.chat

9 May 2023

SEO Title

How To Build Your Own Custom ChatGPT With Custom Knowledge Base

1.通过Prompt Engineering提供数据

在我们讨论如何扩展ChatGPT之前，让我们看看如何手动扩展ChatGPT以及存在哪些问题。扩展ChatGPT的传统方法是通过即时工程(prompt engineering)。

这很简单，因为ChatGPT是上下文感知的。首先，我们需要通过在实际问题之前附加原始文档内容来与ChatGPT进行交互。

I will ask you questions based on the following content:
- Start of Content-
Your very long text to give ChatGPT context
- End of Content-

这种方法的问题在于，该模型的上下文有限；它只能接受GPT-3的大约4097个令牌。这种方法很快就会遇到麻烦，因为总是要粘贴内容也是一个非常手动、乏味的过程。

想象一下，有数百个PDF文档要注入到ChatGPT中。你很快就会遇到付费墙问题。您可能认为GPT-4是GPT-3的继任者。它于2023年3月14日刚刚推出，可以处理25000个单词，大约是GPT-3处理图像的八倍，并且可以处理比GPT-3.5更细微的指令。这仍然存在数据输入限制这一基本问题。我们如何绕过这些限制？我们可以利用一个名为LlamaIndex的Python库。

2.使用LlamaIndex扩展ChatGPT（GPT索引）

LlamaIndex，也称为GPT索引，是一个提供中央接口以将LLM与外部数据连接的项目。是的，你没看错。使用LlamaIndex，我们可以构建如下图所示的内容：

Custom data sources feeding into ChatGPT

LlamaIndex将您现有的数据源和类型与可用的数据连接器连接起来，例如（API、PDF、文档、SQL等）。它通过为结构化和非结构化数据提供索引，使您能够使用LLM。这些指数通过去除典型的样板和痛点来促进上下文学习：以可访问的方式保存上下文，以便快速插入。

当上下文太大时，处理提示限制（GPT-3 Davinci的4096个令牌限制和GPT-4的8000个令牌限制）变得更容易访问，并通过为用户提供与索引交互的方式来解决文本拆分问题。LlamaInde还抽象了从文档中提取相关部分并将其提供给提示的过程。

如何添加自定义数据源

在本节中，我们将使用GPT“text-davinci-003”和LlamaIndex基于预先存在的文档创建问答聊天机器人。

先决条件

在我们开始之前，请确保您可以访问以下内容：

Python≥3.7安装在您的机器上
OpenAI API密钥，可在OpenAI网站上找到。您可以使用您的Gmail帐户进行单次登录。

一些Word文档上传到您的谷歌文档中。LlamaIndex支持许多不同的数据源。在本教程中，我们将演示谷歌文档。

工作原理

使用LlamaIndex创建文档数据索引。
使用自然语言搜索索引。
LlamaIndex将检索相关片段，并将其传递给GPT提示符。LlamaIndex将把原始文档数据转换为便于查询的矢量化索引。它将利用该索引根据查询和数据的匹配程度来查找最相关的部分。然后，信息将加载到提示中，提示将发送给GPT，以便GPT拥有回答您的问题所需的背景。
在那之后，您可以询问ChatGPT，给定上下文中的提要。

为您的Python项目创建一个新文件夹，您可以调用mychatbot，最好使用虚拟环境或conda环境。

我们需要首先安装依赖关系库。方法如下：

pip install openai
pip install llama-index
pip install google-auth-oauthlib

接下来，我们将导入Python中的库，并在新的main.py文件中设置OpenAI API密钥。

# Import necessary packages
import os
import pickle

from google.auth.transport.requests import Request

from google_auth_oauthlib.flow import InstalledAppFlow
from llama_index import GPTSimpleVectorIndex, download_loader

os.environ['OPENAI_API_KEY'] = 'SET-YOUR-OPEN-AI-API-KEY'

在上面的片段中，为了清晰起见，我们显式地设置了环境变量，因为LlamaIndex包隐式地要求访问OpenAI。在典型的生产环境中，您可以将密钥放入环境变量、保险库或您的基础设施可以访问的任何机密管理服务中。

让我们构造一个函数来帮助我们根据我们的谷歌帐户进行身份验证，以发现谷歌文档。

def authorize_gdocs():
    google_oauth2_scopes = [
        "https://www.googleapis.com/auth/documents.readonly"
    ]
    cred = None
    if os.path.exists("token.pickle"):
        with open("token.pickle", 'rb') as token:
            cred = pickle.load(token)
    if not cred or not cred.valid:
        if cred and cred.expired and cred.refresh_token:
            cred.refresh(Request())
        else:
            flow = InstalledAppFlow.from_client_secrets_file("credentials.json", google_oauth2_scopes)
            cred = flow.run_local_server(port=0)
        with open("token.pickle", 'wb') as token:
            pickle.dump(cred, token)

要启用Google Docs API并在Google控制台中获取凭据，可以执行以下步骤：

转到谷歌云控制台网站（Console.Cloud.Google.com）。
如果你还没有创建一个新项目。您可以通过单击顶部导航栏中的“选择项目”下拉菜单并选择“新建项目”来完成此操作。按照提示为项目命名并选择要与其关联的组织。
创建项目后，请从顶部导航栏的下拉菜单中进行选择。
从左侧菜单转到“API和服务”部分，然后单击页面顶部的“+ENABLE APIs and Services”按钮。
在搜索栏中搜索“Google Docs API”并从结果列表中选择它。
单击“启用”按钮为您的项目启用API。
单击OAuth同意屏幕菜单，创建并为您的应用程序命名，例如“mychatbot”，然后输入支持电子邮件，保存并添加范围。

您还必须添加测试用户，因为此谷歌应用程序尚未获得批准。这可以是你自己的电子邮件。

然后，您需要为您的项目设置凭据以使用API。要执行此操作，请转到左侧菜单中的“凭据”部分，然后单击“创建凭据”。选择“OAuth客户端ID”，然后按照提示设置凭据。

设置好凭据后，您可以下载JSON文件并将其存储在应用程序的根目录中，如下所示：

Example folder structure with google credentials in root

设置凭据后，可以从Python项目访问Google Docs API。

转到您的谷歌文档，打开其中一些文档，然后获得可以在浏览器URL栏中看到的唯一id，如下图所示：

复制gdoc ID并将它们粘贴到下面的代码中。您可以有N个gdocs进行索引，这样ChatGPT就可以对您的自定义知识库进行上下文访问。我们将使用LlamaIndex库中的GoogleDocsReader插件来加载您的文档。

# function to authorize or download latest credentials 
authorize_gdocs()

# initialize LlamaIndex google doc reader 
GoogleDocsReader = download_loader('GoogleDocsReader')

# list of google docs we want to index 
gdoc_ids = ['1ofZ96nWEZYCJsteRfqik_xNQTGFHtnc-7cYrf0dMPKQ']

loader = GoogleDocsReader()

# load gdocs and index them 
documents = loader.load_data(document_ids=gdoc_ids)
index = GPTSimpleVectorIndex(documents)

LlamaIndex有各种数据连接器，涵盖Notion、Obsidian、Reddit、Slack等服务。您可以在此处找到可用数据连接器的压缩列表。

如果您希望动态保存和加载索引，可以使用以下函数调用。这将加快从预先保存的索引中提取数据的过程，而不是对外部源进行API调用。

# Save your index to a index.json file
index.save_to_disk('index.json')
# Load the index from your saved index.json file
index = GPTSimpleVectorIndex.load_from_disk('index.json')

通过运行下面的代码可以查询索引并获得响应。代码可以很容易地扩展为rest API，该API连接到UI，您可以在UI中通过GPT接口与自定义数据源进行交互。

# Querying the index
while True:
    prompt = input("Type prompt...")
    response = index.query(prompt)
    print(response)

考虑到我们有一个谷歌文档，里面有关于我的详细信息，如果你在谷歌上公开搜索，这些信息就很容易获得。

我们将首先直接与vanilla ChatGPT交互，看看它在不注入自定义数据源的情况下生成了什么输出。

这有点令人失望！让我们再试一次。

INFO:google_auth_oauthlib.flow:"GET /?state=oz9XY8CE3LaLLsTxIz4sDgrHha4fEJ&code=4/0AWtgzh4LlIfmCMEa0t36dse_xoS0fXFeEWKHFiouzTvz4Qwr7T2Pj6anb-GiZ__Wg-hBBg&scope=https://www.googleapis.com/auth/documents.readonly HTTP/1.1" 200 65
INFO:googleapiclient.discovery_cache:file_cache is only supported with oauth2client<4.0.0
INFO:root:> [build_index_from_documents] Total LLM token usage: 0 tokens
INFO:root:> [build_index_from_documents] Total embedding token usage: 175 tokens
Type prompt...who is timothy mugayi hint he is a writer on medium

INFO:root:> [query] Total LLM token usage: 300 tokens
INFO:root:> [query] Total embedding token usage: 14 tokens
Timothy Mugayi is an Engineering Manager at OVO (PT Visionet Internasional), a subsidiary of GRAB. He is also an avid writer on medium.com who writes on technical topics covering python and freelancing side hustling for programmers. Timothy has been coding for over 15 years, building enterprise solutions for large cooperations. During his free time, he enjoys mentoring and coaching.
last_token_usage=300
Type prompt...

Type prompt...Given you know who timothy mugayi is write an interesting introduction about him

Timothy Mugayi is an experienced and accomplished professional with a wealth of knowledge in engineering, coding, and mentoring. He is currently an Engineering Manager at OVO, a subsidiary of GRAB, and has been coding for over 15 years, building enterprise solutions for large cooperations. In his free time, Timothy enjoys writing on technical topics such as Python and freelancing side hustling for programmers on medium.com, as well as mentoring and coaching. With his impressive background and expertise, Timothy is a valuable asset to any organization.
last_token_usage=330

它现在可以使用新的自定义数据源推断答案，从而准确地生成以下输出。

我们可以更进一步。

Type prompt...Write a cover letter for timothy mugayi for an upwork python project to build a custom ChatGPT bot with access to external data sources
INFO:root:> [query] Total LLM token usage: 436 tokens
INFO:root:> [query] Total embedding token usage: 30 tokens

Dear [Hiring Manager],

I am writing to apply for the Python project to build a custom ChatGPT bot with access to external data sources. With over 15 years of experience in coding and building enterprise solutions for large corporations, I am confident that I am the ideal candidate for this position.

I am currently an Engineering Manager at OVO (PT Visionet Internasional), a subsidiary of GRAB. I have extensive experience in Python and have been writing on technical topics covering Python and freelancing side hustling for programmers on medium.com. I am also an avid mentor and coach, and I believe that my experience and skillset make me the perfect candidate for this project.

I am confident that I can deliver a high-quality product that meets the requirements of the project. I am also available to discuss the project further and answer any questions you may have.

Thank you for your time and consideration.

Sincerely,
Timothy Mugayi
last_token_usage=436
Type prompt...

LlamaIndex将在内部接受您的提示，在索引中搜索相关的块，然后将您的提示和相关的块传递给ChatGPT模型。上面的程序演示了LlamaIndex和GPT在回答问题时的基本首次使用。然而，你可以做的还有很多。只有在配置LlamaIndex以使用不同的大型语言模型（LLM）、为各种活动使用不同类型的索引或以编程方式用新索引更新旧索引时，你的创造力才会受到限制。

以下是显式更改LLM模型的示例。这一次，我们使用了另一个与LlamaIndex捆绑在一起的Python包langchain。

from langchain import OpenAI
from llama_index import LLMPredictor, GPTSimpleVectorIndex, PromptHelper

...

# define anoter LLM explicitly
llm_predictor = LLMPredictor(llm=OpenAI(temperature=0, model_name="text-davinci-003"))

# define prompt configuraiton
# set maximum input size
max_input_size = 4096
# set number of output tokens
num_output = 256
# set maximum chunk overlap
max_chunk_overlap = 20
prompt_helper = PromptHelper(max_input_size, num_output, max_chunk_overlap)

index = GPTSimpleVectorIndex(
    documents, llm_predictor=llm_predictor, prompt_helper=prompt_helper
)

如果你想密切关注你的OpenAI免费或付费信用，你可以导航到OpenAI仪表板，检查还剩多少信用。

创建索引、插入索引和查询索引都将使用令牌。因此，在构建自定义机器人程序时，确保您输出用于跟踪目的的令牌使用情况始终很重要。

last_token_usage = index.llm_predictor.last_token_usage

print(f"last_token_usage={last_token_usage}")

最后的想法

ChatGPT与LlamaIndex相结合可以帮助构建一个定制的ChatGPT聊天机器人，该聊天机器人可以根据自己的文档来源推断知识。虽然ChatGPT和其他LLM非常强大，但扩展LLM模型提供了更精细的体验，并为构建对话式聊天机器人打开了可能性，该聊天机器人可用于构建真实的业务用例，如客户支持帮助，甚至垃圾邮件分类器。如果我们可以提供实时数据，我们可以评估在一定时期内训练的ChatGPT模型的一些局限性。

有关完整的源代码，您可以参考

GitHub repo.

文章链接

https://developer.chat/how-build-your-own-custom-chatgpt-custom-knowledge-base

登录发表评论

【ChatGPT 】如何使用自定义知识库构建自己的自定义ChatGPT

category

1.通过Prompt Engineering提供数据

2.使用LlamaIndex扩展ChatGPT（GPT索引）

如何添加自定义数据源

先决条件

工作原理

最后的想法

标签

Search

category

1.通过Prompt Engineering提供数据

2.使用LlamaIndex扩展ChatGPT（GPT索引）

如何添加自定义数据源

先决条件

工作原理

最后的想法

标签