在本地上下文数据库中使用 Vanna¶
通常使用我们的托管服务更容易开始使用 Vanna。但是,如果您想在本地运行 Vanna,可以通过使用 ChromaDB 运行本地上下文数据库来实现。本 Notebook 将引导您完成相关步骤。
输入 [ ]
%pip install 'vanna[chromadb,snowflake,openai]'
输入 [1]
from vanna.local import LocalContext_OpenAI
配置 OpenAI API 密钥¶
如果您使用的是 Azure OpenAI,除了 API 密钥之外,还需要指定一些额外的参数
输入 [2]
# Use the regular OpenAI API
vn = LocalContext_OpenAI({"api_key": "sk-..."})
# Use the Azure OpenAI API
vn = LocalContext_OpenAI(config={
"api_type": "azure",
"api_base": "https://...",
"api_version": "2023-05-15",
"engine": "YOUR_ENGINE_HERE",
"api_key": "sk-..."
})
使用示例 SQL 查询进行训练¶
最简单的入门方法是粘贴一个 SQL 查询,让 Vanna 基于它进行训练。这将创建一个新的上下文数据库,并在其上训练一个模型。然后您可以使用该模型生成 SQL 查询。
输入 [3]
vn.train(sql="""
SELECT c.c_name as customer_name,
sum(l.l_extendedprice * (1 - l.l_discount)) as total_sales
FROM snowflake_sample_data.tpch_sf1.lineitem l join snowflake_sample_data.tpch_sf1.orders o
ON l.l_orderkey = o.o_orderkey join snowflake_sample_data.tpch_sf1.customer c
ON o.o_custkey = c.c_custkey
GROUP BY customer_name
ORDER BY total_sales desc limit 5;
""")
Using model gpt-3.5-turbo for 147.5 tokens (approx) Question generated with sql: What are the top 5 customers in terms of total sales? Adding SQL...
连接到数据库¶
这是用于运行生成的 SQL 查询的数据库。这里我们连接到 Snowflake。
输入 [ ]
vn.connect_to_snowflake(account='my-account', username='my-username', password='my-password', database='my-database')
开始提问¶
这里有一个自动反馈循环,当您提出的问题生成了可以执行的 SQL 时,它会自动添加到上下文数据库中并用于训练模型。
输入 [5]
vn.ask("What are the top 10 customers by sales?")
Number of requested results 10 is greater than number of elements in index 1, updating n_results = 1
Using model gpt-3.5-turbo for 167.0 tokens (approx) SELECT c.c_name as customer_name, sum(l.l_extendedprice * (1 - l.l_discount)) as total_sales FROM snowflake_sample_data.tpch_sf1.lineitem l join snowflake_sample_data.tpch_sf1.orders o ON l.l_orderkey = o.o_orderkey join snowflake_sample_data.tpch_sf1.customer c ON o.o_custkey = c.c_custkey GROUP BY customer_name ORDER BY total_sales desc limit 10;
客户名称 | 总销售额 | |
---|---|---|
0 | 客户#000143500 | 6757566.0218 |
1 | 客户#000095257 | 6294115.3340 |
2 | 客户#000087115 | 6184649.5176 |
3 | 客户#000131113 | 6080943.8305 |
4 | 客户#000134380 | 6075141.9635 |
5 | 客户#000103834 | 6059770.3232 |
6 | 客户#000069682 | 6057779.0348 |
7 | 客户#000102022 | 6039653.6335 |
8 | 客户#000098587 | 6027021.5855 |
9 | 客户#000064660 | 5905659.6159 |
Using model gpt-3.5-turbo for 259.25 tokens (approx) huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either: - Avoid using `tokenizers` before the fork if possible - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
训练计划¶
训练计划是提取数据库元数据用于训练的好方法。您可以指定要过滤的数据库和模式,它将尝试从查询历史中检索过去的查询,并从信息模式中检索表名和列名。
输入 [6]
training_plan = vn.get_training_plan_snowflake(filter_databases=['SNOWFLAKE_SAMPLE_DATA'], filter_schemas=['TPCH_SF1'])
training_plan
Trying query history Using model gpt-3.5-turbo for 147.25 tokens (approx) Using model gpt-3.5-turbo for 168.25 tokens (approx) Using model gpt-3.5-turbo for 147.0 tokens (approx) Using model gpt-3.5-turbo for 147.25 tokens (approx) Trying INFORMATION_SCHEMA.DATABASES Trying INFORMATION_SCHEMA.COLUMNS for SNOWFLAKE_SAMPLE_DATA
输出[6]
Train on SQL: What are the top 10 customers based on their total sales? Train on SQL: What are the top 10 countries with the highest total sales? Train on SQL: What are the top 5 customers based on their total sales? Train on SQL: What are the top 15 customers based on their total sales? Train on Information Schema: SNOWFLAKE_SAMPLE_DATA.TPCH_SF1 CUSTOMER Train on Information Schema: SNOWFLAKE_SAMPLE_DATA.TPCH_SF1 SUPPLIER Train on Information Schema: SNOWFLAKE_SAMPLE_DATA.TPCH_SF1 LINEITEM Train on Information Schema: SNOWFLAKE_SAMPLE_DATA.TPCH_SF1 PARTSUPP Train on Information Schema: SNOWFLAKE_SAMPLE_DATA.TPCH_SF1 PART Train on Information Schema: SNOWFLAKE_SAMPLE_DATA.TPCH_SF1 ORDERS Train on Information Schema: SNOWFLAKE_SAMPLE_DATA.TPCH_SF1 REGION Train on Information Schema: SNOWFLAKE_SAMPLE_DATA.TPCH_SF1 NATION
输入 [7]
vn.train(plan=training_plan)
提更多问题!¶
现在您拥有了一个完全训练好的模型,您可以继续提问,它将继续学习和改进。
输入 [ ]
vn.ask()