在本地上下文数据库中使用 Vanna¶

通常使用我们的托管服务更容易开始使用 Vanna。但是，如果您想在本地运行 Vanna，可以通过使用 ChromaDB 运行本地上下文数据库来实现。本 Notebook 将引导您完成相关步骤。

输入 [ ]

%pip install 'vanna[chromadb,snowflake,openai]'

输入 [1]

from vanna.local import LocalContext_OpenAI

配置 OpenAI API 密钥¶

如果您使用的是 Azure OpenAI，除了 API 密钥之外，还需要指定一些额外的参数

输入 [2]

# Use the regular OpenAI API
vn = LocalContext_OpenAI({"api_key": "sk-..."})

# Use the Azure OpenAI API
vn = LocalContext_OpenAI(config={
    "api_type": "azure",
    "api_base": "https://...",
    "api_version": "2023-05-15",
    "engine": "YOUR_ENGINE_HERE",
    "api_key": "sk-..."
})

使用示例 SQL 查询进行训练¶

最简单的入门方法是粘贴一个 SQL 查询，让 Vanna 基于它进行训练。这将创建一个新的上下文数据库，并在其上训练一个模型。然后您可以使用该模型生成 SQL 查询。

输入 [3]

vn.train(sql="""
SELECT c.c_name as customer_name,
       sum(l.l_extendedprice * (1 - l.l_discount)) as total_sales
FROM   snowflake_sample_data.tpch_sf1.lineitem l join snowflake_sample_data.tpch_sf1.orders o
        ON l.l_orderkey = o.o_orderkey join snowflake_sample_data.tpch_sf1.customer c
        ON o.o_custkey = c.c_custkey
GROUP BY customer_name
ORDER BY total_sales desc limit 5;
""")

Using model gpt-3.5-turbo for 147.5 tokens (approx)
Question generated with sql: What are the top 5 customers in terms of total sales? 
Adding SQL...

连接到数据库¶

这是用于运行生成的 SQL 查询的数据库。这里我们连接到 Snowflake。

输入 [ ]

vn.connect_to_snowflake(account='my-account', username='my-username', password='my-password', database='my-database')

开始提问¶

这里有一个自动反馈循环，当您提出的问题生成了可以执行的 SQL 时，它会自动添加到上下文数据库中并用于训练模型。

输入 [5]

vn.ask("What are the top 10 customers by sales?")

Number of requested results 10 is greater than number of elements in index 1, updating n_results = 1

Using model gpt-3.5-turbo for 167.0 tokens (approx)
SELECT c.c_name as customer_name,
       sum(l.l_extendedprice * (1 - l.l_discount)) as total_sales
FROM   snowflake_sample_data.tpch_sf1.lineitem l join snowflake_sample_data.tpch_sf1.orders o
        ON l.l_orderkey = o.o_orderkey join snowflake_sample_data.tpch_sf1.customer c
        ON o.o_custkey = c.c_custkey
GROUP BY customer_name
ORDER BY total_sales desc limit 10;

	客户名称	总销售额
0	客户#000143500	6757566.0218
1	客户#000095257	6294115.3340
2	客户#000087115	6184649.5176
3	客户#000131113	6080943.8305
4	客户#000134380	6075141.9635
5	客户#000103834	6059770.3232
6	客户#000069682	6057779.0348
7	客户#000102022	6039653.6335
8	客户#000098587	6027021.5855
9	客户#000064660	5905659.6159

Using model gpt-3.5-turbo for 259.25 tokens (approx)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)

No description has been provided for this image

训练计划¶

训练计划是提取数据库元数据用于训练的好方法。您可以指定要过滤的数据库和模式，它将尝试从查询历史中检索过去的查询，并从信息模式中检索表名和列名。

输入 [6]

training_plan = vn.get_training_plan_snowflake(filter_databases=['SNOWFLAKE_SAMPLE_DATA'], filter_schemas=['TPCH_SF1'])
training_plan

Trying query history
Using model gpt-3.5-turbo for 147.25 tokens (approx)
Using model gpt-3.5-turbo for 168.25 tokens (approx)
Using model gpt-3.5-turbo for 147.0 tokens (approx)
Using model gpt-3.5-turbo for 147.25 tokens (approx)
Trying INFORMATION_SCHEMA.DATABASES
Trying INFORMATION_SCHEMA.COLUMNS for SNOWFLAKE_SAMPLE_DATA

输出[6]

Train on SQL:  What are the top 10 customers based on their total sales?
Train on SQL:  What are the top 10 countries with the highest total sales?
Train on SQL:  What are the top 5 customers based on their total sales?
Train on SQL:  What are the top 15 customers based on their total sales?
Train on Information Schema: SNOWFLAKE_SAMPLE_DATA.TPCH_SF1 CUSTOMER
Train on Information Schema: SNOWFLAKE_SAMPLE_DATA.TPCH_SF1 SUPPLIER
Train on Information Schema: SNOWFLAKE_SAMPLE_DATA.TPCH_SF1 LINEITEM
Train on Information Schema: SNOWFLAKE_SAMPLE_DATA.TPCH_SF1 PARTSUPP
Train on Information Schema: SNOWFLAKE_SAMPLE_DATA.TPCH_SF1 PART
Train on Information Schema: SNOWFLAKE_SAMPLE_DATA.TPCH_SF1 ORDERS
Train on Information Schema: SNOWFLAKE_SAMPLE_DATA.TPCH_SF1 REGION
Train on Information Schema: SNOWFLAKE_SAMPLE_DATA.TPCH_SF1 NATION

输入 [7]

vn.train(plan=training_plan)

提更多问题！¶

现在您拥有了一个完全训练好的模型，您可以继续提问，它将继续学习和改进。

输入 [ ]

vn.ask()