Intelligent Data Querying in Snowflake Cortex
Unlocking the Power of RAG in Data Management
USEReady’s Guide to RAG in Snowflake Cortex
In today’s data-centric world, the ability to efficiently manage and extract insights from vast datasets is critical. Snowflake Cortex, an AI and machine learning (ML) platform within the Snowflake Data Cloud, streamlines this process by integrating ML functionalities directly into data workflows.
In this blog we will explore the use of Snowflake Cortex’s advanced Retrieval-Augmented Generation (RAG) for intelligent data querying. We’ll cover setting up a database for document management, building a vector store, and leveraging Cortex’s NLP capabilities to enhance data querying and analysis. Traditional methods with complex SQL queries and deep database knowledge often restrict non-technical users and make large dataset retrieval inefficient. Our solution employs RAG techniques and Snowflake Cortex models to simplify querying, improving accuracy and scalability as datasets grow.
Introduction to RAG and Snowflake Cortex
Benefits of Advanced RAG for Data Querying
Implementing RAG in Snowflake Cortex
Real-World Use Cases and Applications
Overview of RAG Technology
Step-by-Step RAG Implementation
Advantages for Data Querying Efficiency
Case Studies in Intelligent Data Querying
Snowflake Cortex – A Quick Intro
Snowflake Cortex is an advanced machine learning (ML) service integrated within the Snowflake Data Cloud, introduced to enhance data processing and analytics capabilities. Snowflake Cortex offers both Large Language Models (LLMs) and Embedding Models, enabling users to perform sophisticated data analysis and generate insights directly within the Snowflake ecosystem.
Large Language Models (LLMs)
LLMs in Snowflake Cortex are designed to understand and generate human-like text based on vast amounts of training data. These models can interpret complex queries, generate natural language responses, and perform a variety of natural language processing (NLP) tasks such as text summarization, translation, and sentiment analysis.
Advantages of LLMs
Sample Use Cases
Embedding Models
Embedding models in Snowflake Cortex transform data into numerical representations (embeddings) that capture semantic relationships between different pieces of information. These embeddings are crucial for tasks like similarity search, clustering, and recommendation systems.
Advantages of Embedding Models
Sample Use Cases
Our Solution Overview
Setting Up the Environment
Step 1
Data Collection and Local Setup: Begin by collecting the documents you intend to process. For this example, assume we have a set of PDFs. We’ll organize these documents and prepare them for processing within Snowflake.
Step 2
Creating a Database and Schema: To manage the documents and associated functions, create a new database and schema in Snowflake:
```sql
CREATE DATABASE CC_QUICKSTART_CORTEX_DOCS;
CREATE SCHEMA DATA;
```
This structure will store the PDFs, the functions that process them, and the table that will hold text embeddings.
Processing and Storing Document Data
Step 3
Table Function for PDF Processing: We need a table function that can read the PDF documents and split them into chunks for further processing. This function utilizes the PyPDF2 and Langchain Python libraries to handle the PDFs and generate text chunks.
Step 4
Creating a Stage and Uploading Documents: Create a stage for uploading your documents:
```sql
CREATE OR REPLACE STAGE docs
ENCRYPTION = (TYPE = 'SNOWFLAKE_SSE')
DIRECTORY = (ENABLE = true);
```
Upload the documents to this staging area through the Snowflake UI or CLI by navigating to your database and schema, selecting the “Stages” section, and uploading the PDFs.
Detailed Implementation for RAG on Cortex
Implementing RAG with Snowflake Cortex for PDFs & Structured Data
Integrating LLMs with structured data and PDFs in Snowflake using Cortex involves a systematic approach to store, process, and utilize document embeddings. Here’s how to achieve this:
Step 1
Creating the Storage Table
The first step is to establish a storage table in Snowflake designed to hold document chunks along with their vector embeddings. This table will serve as the foundation for building a vector store:
```sql
CREATE OR REPLACE TABLE DOCS_CHUNKS_TABLE (
RELATIVE_PATH VARCHAR(16777216), -- Relative path to the PDF file
SIZE NUMBER(38,0), -- Size of the PDF
FILE_URL VARCHAR(16777216), -- URL for the PDF
SCOPED_FILE_URL VARCHAR(16777216), -- Scoped URL for secure access
CHUNK VARCHAR(16777216), -- Text chunk from the PDF
CHUNK_VEC VECTOR(FLOAT, 768) -- 768-dimensional embedding vector
);
```
This table schema includes fields for file paths, sizes, URLs, and embeddings, making it suitable for storing both the metadata and the processed content of PDFs.
Step 2
Processing PDFs and Storing Embeddings
To populate the `DOCS_CHUNKS_TABLE`, you’ll need to process the PDFs to extract text chunks and generate embeddings using Snowflake Cortex. Here’s a step-by-step guide:
- Define a Function for PDF Processing:
A user-defined function (UDF) can be employed to read PDFs, chunk the text, and create embeddings.
Below is an example of such a function:
```sql
CREATE OR REPLACE FUNCTION pdf_text_chunker(file_url STRING)
RETURNS TABLE (chunk VARCHAR)
LANGUAGE PYTHON
RUNTIME_VERSION = '3.9'
HANDLER = 'pdf_text_chunker'
PACKAGES = ('snowflake-snowpark-python', 'PyPDF2', 'pandas', 'langchain')
AS
$$
from snowflake.snowpark.types import StringType, StructField, StructType
from langchain.text_splitter import RecursiveCharacterTextSplitter
from snowflake.snowpark.files import SnowflakeFile
import pandas as pd
import PyPDF2
import io
import logging
class pdf_text_chunker:
def read_pdf(self, file_url: str) -> str:
logger = logging.getLogger("udf_logger")
logger.info(f"Opening file {file_url}")
with SnowflakeFile.open(file_url, 'rb') as f:
reader = PyPDF2.PdfReader(f)
text = ""
for page in reader.pages:
text += page.extract_text()
return text
def process(self, file_url: str):
text = self.read_pdf(file_url)
text_splitter = RecursiveCharacterTextSplitter(
chunk_size = 4000, Adjust this as needed
chunk_overlap = 400, Allows overlap for contextual integrity
length_function = len
)
chunks = text_splitter.split_text(text)
df = pd.DataFrame(chunks, columns=['chunk'])
yield from df.itertuples(index=False, name=None)
$$;
```
- Insert Processed Data into the Storage Table:
The function processes PDFs and generates embeddings which are then inserted into the `DOCS_CHUNKS_TABLE`.
Here’s how to execute this:
```sql
INSERT INTO DOCS_CHUNKS_TABLE (relative_path, size, file_url, scoped_file_url, chunk, chunk_vec)
SELECT relative_path,
size,
file_url,
build_scoped_file_url(@pdf_docs, relative_path) AS scoped_file_url,
func.chunk AS chunk,
SNOWFLAKE.CORTEX.EMBED_TEXT_768('e5-base-v2', func.chunk) AS chunk_vec
FROM directory(@pdf_docs),
TABLE(pdf_text_chunker(build_scoped_file_url(@pdf_docs, relative_path))) AS func;
This query reads documents from the specified directory, processes each PDF to extract chunks, and generates 768-dimensional embeddings for each chunk using Snowflake Cortex.
Step 3
Processing Structured Data (CSV Files)
Similar to PDFs, structured data like CSV files can be processed to generate embeddings. Here’s how:
- Define a Function for CSV Processing:
Create a UDF to handle CSV files, extract relevant text, and split it into chunks.
For instance:
```sql
CREATE OR REPLACE FUNCTION csv_text_chunker(file_url STRING, text_column STRING)
RETURNS TABLE (chunk VARCHAR)
LANGUAGE PYTHON
RUNTIME_VERSION = '3.9'
HANDLER = 'csv_text_chunker'
PACKAGES = ('snowflake-snowpark-python', 'pandas', 'langchain')
AS
$$
import pandas as pd
import logging
from langchain.text_splitter import RecursiveCharacterTextSplitter
from snowflake.snowpark.files import SnowflakeFile
import io
class Reader:
def read_csv(self, file_url: str, text_column: str = None) -> str:
logger = logging.getLogger("udf_logger")
logging.basicConfig(level=logging.INFO)
logger.info(f"Opening file {file_url}")
buffer = SnowflakeFile.open(file_url)
df = pd.read_csv(buffer)
if text_column and text_column in df.columns:
text = " ".join(df[text_column].astype(str).tolist())
else:
text = " ".join(df.select_dtypes(include=['object', 'number']).astype(str).apply(" ".join, axis=1).tolist())
return text
reader = Reader()
class csv_text_chunker:
def process(self, file_url: str, text_column: str):
text = reader.read_csv(file_url, text_column)
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=4000,
chunk_overlap=400,
length_function=len
)
chunks = text_splitter.split_text(text)
df = pd.DataFrame(chunks, columns=['chunk'])
yield from df.itertuples(index=False, name=None)
$$;
```
- Insert Structured Data into the Storage Table:
Use a similar approach to insert processed CSV data into the `DOCS_CHUNKS_TABLE`:
```sql
INSERT INTO DOCS_CHUNKS_TABLE (relative_path, size, file_url, scoped_file_url, chunk, chunk_vec)
SELECT relative_path,
size,
file_url,
build_scoped_file_url(@docs, relative_path) AS scoped_file_url,
func.chunk AS chunk,
SNOWFLAKE.CORTEX.EMBED_TEXT_768('e5-base-v2', func.chunk) AS chunk_vec
FROM directory(@docs),
TABLE(csv_text_chunker(build_scoped_file_url(@docs, relative_path), 'text')) AS func;
This command processes each CSV file, extracts chunks of text, and generates the necessary embeddings.
Step 4
Querying and Utilizing the Embedded Data
Once the data is processed and stored, you can query the `DOCS_CHUNKS_TABLE` to retrieve text chunks and their embeddings.
For instance:
```sql
SELECT relative_path, size, chunk, chunk_vec FROM DOCS_CHUNKS_TABLE LIMIT 5;
```
This allows you to integrate the stored embeddings with LLMs for various applications such as document retrieval, summarization, or question-answering tasks using the vector embeddings stored in Snowflake.
Snowflake Cortex Strengths
Snowflake Cortex integrates advanced machine learning and AI functionalities within the Snowflake Data Cloud, allowing seamless data and ML workflows. Its key features include:
Leveraging Large Language Models (LLMs) within Snowflake Cortex
Snowflake Cortex offers several LLM-based functions, available as SQL functions and in Python, to enhance natural language processing capabilities directly within Snowflake:
- COMPLETE: Generates text completions based on a given prompt.
- EMBED_TEXT_768: Produces a 768-dimensional text embedding.
- EMBED_TEXT_1024: Generates a 1024-dimensional text embedding.
- EXTRACT_ANSWER: Extracts answers from unstructured data based on a given question.
- SENTIMENT: Analyzes sentiment, returning a score from -1 (negative) to 1 (positive).
- SUMMARIZE: Provides a summary of the given text.
- TRANSLATE: Translates text between supported languages.
Using the Functions
Text Completion:
```sql
SELECT SNOWFLAKE.CORTEX.COMPLETE('snowflake-arctic', 'What are large language models?');
```
Text Embedding:
```sql
SELECT SNOWFLAKE.CORTEX.EMBED_TEXT_768('e5-base-v2', 'hello world');
```
Answer Extraction:
```sql
SELECT SNOWFLAKE.CORTEX.EXTRACT_ANSWER(review_content, 'What dishes does this review mention?')
FROM reviews LIMIT 10;
```
Sentiment Analysis:
```sql
SELECT SNOWFLAKE.CORTEX.SENTIMENT(review_content), review_content FROM reviews LIMIT 10;
```
Text Summarization:
```sql
SELECT SNOWFLAKE.CORTEX.SUMMARIZE(review_content) FROM reviews LIMIT 10;
```
Text Translation:
```sql
SELECT SNOWFLAKE.CORTEX.TRANSLATE(review_content, 'en', 'de') FROM reviews LIMIT 10;
```
Advantages and Disadvantages of LLM Functions
Advantages
- Integration: Directly within Snowflake, minimizing data movement and enhancing security.
- Scalability: Utilizes Snowflake’s scalable infrastructure.
- Ease of Use: SQL-based interface accessible to users familiar with SQL.
- Efficiency: Streamlines workflows by reducing reliance on external tools.
Disadvantages
- Cost: Extensive use of LLM functions may incur additional costs.
- Latency: Real-time processing can introduce latency for complex tasks.
- Complexity: Requires understanding of model outputs and fine-tuning.
- Model Limitations: Performance varies based on task and data quality.
Conclusion
Harnessing Snowflake Cortex’s advanced RAG capabilities revolutionizes intelligent structured data querying within the Snowflake ecosystem. By integrating functions like text completion, embedding, and sentiment analysis, users can seamlessly perform advanced NLP tasks, driving more effective data-driven decisions and operational efficiencies. Applications range from automating customer support and enhancing data analysis to improving multilingual communication. This innovative approach simplifies user interaction, enhances query accuracy, and scales efficiently, demonstrating the profound potential of AI-driven methods to streamline data access and analysis.
Additional Resources: Snowflake Cortex Documentation