Harnessing Advanced RAG for Intelligent Data Querying within Snowflake Cortex

Blog | September 6, 2024 | By Alagappan Ramanathan, Sainadh Thikkireddi
Harnessing Advanced RAG for Intelligent Data Querying main img

Intelligent Data Querying in Snowflake Cortex

Unlocking the Power of RAG in Data Management

USEReady’s Guide to RAG in Snowflake Cortex

In today’s data-centric world, the ability to efficiently manage and extract insights from vast datasets is critical. Snowflake Cortex, an AI and machine learning (ML) platform within the Snowflake Data Cloud, streamlines this process by integrating ML functionalities directly into data workflows.  

In this blog we will explore the use of Snowflake Cortex’s advanced Retrieval-Augmented Generation (RAG) for intelligent data querying. We’ll cover setting up a database for document management, building a vector store, and leveraging Cortex’s NLP capabilities to enhance data querying and analysis. Traditional methods with complex SQL queries and deep database knowledge often restrict non-technical users and make large dataset retrieval inefficient. Our solution employs RAG techniques and Snowflake Cortex models to simplify querying, improving accuracy and scalability as datasets grow.


Introduction to RAG and Snowflake Cortex

Benefits of Advanced RAG for Data Querying

Implementing RAG in Snowflake Cortex

Real-World Use Cases and Applications

Overview of RAG Technology

Step-by-Step RAG Implementation

Advantages for Data Querying Efficiency

Case Studies in Intelligent Data Querying

Snowflake Cortex – A Quick Intro

Snowflake Cortex is an advanced machine learning (ML) service integrated within the Snowflake Data Cloud, introduced to enhance data processing and analytics capabilities. Snowflake Cortex offers both Large Language Models (LLMs) and Embedding Models, enabling users to perform sophisticated data analysis and generate insights directly within the Snowflake ecosystem.


Large Language Models (LLMs)

LLMs in Snowflake Cortex are designed to understand and generate human-like text based on vast amounts of training data. These models can interpret complex queries, generate natural language responses, and perform a variety of natural language processing (NLP) tasks such as text summarization, translation, and sentiment analysis. 

Advantages of LLMs

Contextual Understanding

Contextual Understanding

LLMs can comprehend the context of queries, providing more accurate and relevant responses.

Automation

Automation

They can automate tasks such as report generation, data annotation, and content creation.

User Engagement

User Engagement

Enhance user interfaces with conversational AI, making data interaction more intuitive.

Sample Use Cases

Customer Support

Customer Support

Automating responses to customer queries based on historical interaction data.

Content Generation

Content Generation

Creating reports or summaries from raw data automatically.

Data Insights

Data Insights

Interpreting complex data patterns and trends in natural language for non-technical users.


Embedding Models

Embedding models in Snowflake Cortex transform data into numerical representations (embeddings) that capture semantic relationships between different pieces of information. These embeddings are crucial for tasks like similarity search, clustering, and recommendation systems. 

Advantages of Embedding Models

Efficient Data Retrieval

Efficient Data Retrieval

Enable quick and relevant searches through large datasets by comparing embeddings.

Enhanced Analysis

Enhanced Analysis

Facilitate clustering and grouping of similar data points, aiding in pattern recognition.

Personalization

Personalization

Improve recommendation systems by understanding user preferences and behaviours through embeddings.

Sample Use Cases

Recommendation Engines

Recommendation Engines

Suggesting products or content to users based on their past interactions and preferences.

Document Search

Document Search

Enhancing search capabilities in document databases by finding semantically similar documents.

Fraud Detection

Fraud Detection

Identifying anomalous patterns in transaction data to detect fraudulent activities.


Our Solution Overview

Setting Up the Environment

Step 1

Data Collection and Local Setup: Begin by collecting the documents you intend to process. For this example, assume we have a set of PDFs. We’ll organize these documents and prepare them for processing within Snowflake. 

Step 2

Creating a Database and Schema: To manage the documents and associated functions, create a new database and schema in Snowflake:

```sql
CREATE DATABASE CC_QUICKSTART_CORTEX_DOCS;
CREATE SCHEMA DATA;
``` 

This structure will store the PDFs, the functions that process them, and the table that will hold text embeddings. 

Processing and Storing Document Data 

Step 3

Table Function for PDF Processing: We need a table function that can read the PDF documents and split them into chunks for further processing. This function utilizes the PyPDF2 and Langchain Python libraries to handle the PDFs and generate text chunks. 

Step 4

Creating a Stage and Uploading Documents: Create a stage for uploading your documents: 

```sql
CREATE OR REPLACE STAGE docs
ENCRYPTION = (TYPE = 'SNOWFLAKE_SSE')
DIRECTORY = (ENABLE = true);
```

Upload the documents to this staging area through the Snowflake UI or CLI by navigating to your database and schema, selecting the “Stages” section, and uploading the PDFs.


Detailed Implementation for RAG on Cortex 

Implementing RAG with Snowflake Cortex for PDFs & Structured Data 

Integrating LLMs with structured data and PDFs in Snowflake using Cortex involves a systematic approach to store, process, and utilize document embeddings. Here’s how to achieve this:

Step 1

Creating the Storage Table 

The first step is to establish a storage table in Snowflake designed to hold document chunks along with their vector embeddings. This table will serve as the foundation for building a vector store:

```sql
CREATE OR REPLACE TABLE DOCS_CHUNKS_TABLE ( 
    RELATIVE_PATH VARCHAR(16777216), -- Relative path to the PDF file 
    SIZE NUMBER(38,0),               -- Size of the PDF 
    FILE_URL VARCHAR(16777216),      -- URL for the PDF 
    SCOPED_FILE_URL VARCHAR(16777216), -- Scoped URL for secure access 
    CHUNK VARCHAR(16777216),         -- Text chunk from the PDF 
    CHUNK_VEC VECTOR(FLOAT, 768)     -- 768-dimensional embedding vector 
);
```

This table schema includes fields for file paths, sizes, URLs, and embeddings, making it suitable for storing both the metadata and the processed content of PDFs. 

Step 2

Processing PDFs and Storing Embeddings 

To populate the `DOCS_CHUNKS_TABLE`, you’ll need to process the PDFs to extract text chunks and generate embeddings using Snowflake Cortex. Here’s a step-by-step guide: 

  • Define a Function for PDF Processing:
    A user-defined function (UDF) can be employed to read PDFs, chunk the text, and create embeddings.

Below is an example of such a function:

  ```sql
  CREATE OR REPLACE FUNCTION pdf_text_chunker(file_url STRING)
  RETURNS TABLE (chunk VARCHAR)
  LANGUAGE PYTHON
  RUNTIME_VERSION = '3.9'
  HANDLER = 'pdf_text_chunker' 
  PACKAGES = ('snowflake-snowpark-python', 'PyPDF2', 'pandas', 'langchain')
  AS 
  $$ 
  from snowflake.snowpark.types import StringType, StructField, StructType 
  from langchain.text_splitter import RecursiveCharacterTextSplitter
  from snowflake.snowpark.files import SnowflakeFile
  import pandas as pd
  import PyPDF2
  import io
  import logging
 
  class pdf_text_chunker: 

      def read_pdf(self, file_url: str) -> str:
          logger = logging.getLogger("udf_logger")
          logger.info(f"Opening file {file_url}") 

          with SnowflakeFile.open(file_url, 'rb') as f:
              reader = PyPDF2.PdfReader(f)
              text = ""
              for page in reader.pages: 
                  text += page.extract_text()

          return text

      def process(self, file_url: str):
          text = self.read_pdf(file_url)           

          text_splitter = RecursiveCharacterTextSplitter(
              chunk_size = 4000,   Adjust this as needed
              chunk_overlap  = 400,   Allows overlap for contextual integrity
              length_function = len
          )          

          chunks = text_splitter.split_text(text)
          df = pd.DataFrame(chunks, columns=['chunk'])          

          yield from df.itertuples(index=False, name=None)

  $$;
  ```
  • Insert Processed Data into the Storage Table:
    The function processes PDFs and generates embeddings which are then inserted into the `DOCS_CHUNKS_TABLE`.

Here’s how to execute this:

  ```sql
INSERT INTO DOCS_CHUNKS_TABLE (relative_path, size, file_url, scoped_file_url, chunk, chunk_vec)
      SELECT relative_path,
             size,
             file_url,
             build_scoped_file_url(@pdf_docs, relative_path) AS scoped_file_url,
             func.chunk AS chunk,
             SNOWFLAKE.CORTEX.EMBED_TEXT_768('e5-base-v2', func.chunk) AS chunk_vec
      FROM directory(@pdf_docs),
           TABLE(pdf_text_chunker(build_scoped_file_url(@pdf_docs, relative_path))) AS func;

This query reads documents from the specified directory, processes each PDF to extract chunks, and generates 768-dimensional embeddings for each chunk using Snowflake Cortex. 

Step 3

Processing Structured Data (CSV Files) 

Similar to PDFs, structured data like CSV files can be processed to generate embeddings. Here’s how: 

  • Define a Function for CSV Processing:
    Create a UDF to handle CSV files, extract relevant text, and split it into chunks.

For instance:

  ```sql
  CREATE OR REPLACE FUNCTION csv_text_chunker(file_url STRING, text_column STRING)
  RETURNS TABLE (chunk VARCHAR)
  LANGUAGE PYTHON 
  RUNTIME_VERSION = '3.9' 
  HANDLER = 'csv_text_chunker'
  PACKAGES = ('snowflake-snowpark-python', 'pandas', 'langchain')
  AS
  $$
  import pandas as pd
  import logging
  from langchain.text_splitter import RecursiveCharacterTextSplitter
  from snowflake.snowpark.files import SnowflakeFile
  import io 

  class Reader:
      def read_csv(self, file_url: str, text_column: str = None) -> str:
          logger = logging.getLogger("udf_logger")
          logging.basicConfig(level=logging.INFO)
          logger.info(f"Opening file {file_url}")
           

          buffer = SnowflakeFile.open(file_url)
          df = pd.read_csv(buffer)           

          if text_column and text_column in df.columns:
              text = " ".join(df[text_column].astype(str).tolist())
          else:
              text = " ".join(df.select_dtypes(include=['object', 'number']).astype(str).apply(" ".join, axis=1).tolist()) 
          return text 

 reader = Reader() 

 class csv_text_chunker: 

      def process(self, file_url: str, text_column: str):
          text = reader.read_csv(file_url, text_column)

          text_splitter = RecursiveCharacterTextSplitter(
              chunk_size=4000,
              chunk_overlap=400,
              length_function=len
          )

          chunks = text_splitter.split_text(text) 

          df = pd.DataFrame(chunks, columns=['chunk'])

          yield from df.itertuples(index=False, name=None)
  $$;
  ```
  • Insert Structured Data into the Storage Table:
    Use a similar approach to insert processed CSV data into the `DOCS_CHUNKS_TABLE`:
```sql
INSERT INTO DOCS_CHUNKS_TABLE (relative_path, size, file_url, scoped_file_url, chunk, chunk_vec)
      SELECT relative_path,
             size,
             file_url,
             build_scoped_file_url(@docs, relative_path) AS scoped_file_url,
             func.chunk AS chunk,
             SNOWFLAKE.CORTEX.EMBED_TEXT_768('e5-base-v2', func.chunk) AS chunk_vec 

      FROM directory(@docs),
           TABLE(csv_text_chunker(build_scoped_file_url(@docs, relative_path), 'text')) AS func;

This command processes each CSV file, extracts chunks of text, and generates the necessary embeddings. 

Step 4

Querying and Utilizing the Embedded Data 

Once the data is processed and stored, you can query the `DOCS_CHUNKS_TABLE` to retrieve text chunks and their embeddings.

For instance:

```sql
SELECT relative_path, size, chunk, chunk_vec FROM DOCS_CHUNKS_TABLE LIMIT 5;
```

This allows you to integrate the stored embeddings with LLMs for various applications such as document retrieval, summarization, or question-answering tasks using the vector embeddings stored in Snowflake.


Snowflake Cortex Strengths

Snowflake Cortex integrates advanced machine learning and AI functionalities within the Snowflake Data Cloud, allowing seamless data and ML workflows. Its key features include:

Unified Data and ML Platform

Unified Data and ML Platform

Combines data storage, processing, and ML capabilities.

Scalability

Scalability

Leverages Snowflake’s cloud infrastructure for scalable operations.

Integrated ML Workflows

Integrated ML Workflows

Integrates with popular ML frameworks for comprehensive ML model lifecycle management.

Security and Governance

Security and Governance

Ensures data privacy and compliance within the ML lifecycle.


Leveraging Large Language Models (LLMs) within Snowflake Cortex 

Snowflake Cortex offers several LLM-based functions, available as SQL functions and in Python, to enhance natural language processing capabilities directly within Snowflake: 

  • COMPLETE: Generates text completions based on a given prompt.
  • EMBED_TEXT_768: Produces a 768-dimensional text embedding.
  • EMBED_TEXT_1024: Generates a 1024-dimensional text embedding.
  • EXTRACT_ANSWER: Extracts answers from unstructured data based on a given question.
  • SENTIMENT: Analyzes sentiment, returning a score from -1 (negative) to 1 (positive).
  • SUMMARIZE: Provides a summary of the given text.
  • TRANSLATE: Translates text between supported languages.

Using the Functions

Text Completion:

```sql
SELECT SNOWFLAKE.CORTEX.COMPLETE('snowflake-arctic', 'What are large language models?');
```

Text Embedding: 

```sql
SELECT SNOWFLAKE.CORTEX.EMBED_TEXT_768('e5-base-v2', 'hello world');
``` 

Answer Extraction: 

```sql
SELECT SNOWFLAKE.CORTEX.EXTRACT_ANSWER(review_content, 'What dishes does this review mention?')
FROM reviews LIMIT 10;
```

Sentiment Analysis:

```sql
SELECT SNOWFLAKE.CORTEX.SENTIMENT(review_content), review_content FROM reviews LIMIT 10;
```

Text Summarization:

```sql
SELECT SNOWFLAKE.CORTEX.SUMMARIZE(review_content) FROM reviews LIMIT 10;
```

Text Translation:

```sql
SELECT SNOWFLAKE.CORTEX.TRANSLATE(review_content, 'en', 'de') FROM reviews LIMIT 10;
```

Advantages and Disadvantages of LLM Functions 

Advantages

  • Integration: Directly within Snowflake, minimizing data movement and enhancing security.
  • Scalability: Utilizes Snowflake’s scalable infrastructure.
  • Ease of Use: SQL-based interface accessible to users familiar with SQL.
  • Efficiency: Streamlines workflows by reducing reliance on external tools.

Disadvantages

  • Cost: Extensive use of LLM functions may incur additional costs.
  • Latency: Real-time processing can introduce latency for complex tasks.
  • Complexity: Requires understanding of model outputs and fine-tuning.
  • Model Limitations: Performance varies based on task and data quality.

Conclusion

Harnessing Snowflake Cortex’s advanced RAG capabilities revolutionizes intelligent structured data querying within the Snowflake ecosystem. By integrating functions like text completion, embedding, and sentiment analysis, users can seamlessly perform advanced NLP tasks, driving more effective data-driven decisions and operational efficiencies. Applications range from automating customer support and enhancing data analysis to improving multilingual communication. This innovative approach simplifies user interaction, enhances query accuracy, and scales efficiently, demonstrating the profound potential of AI-driven methods to streamline data access and analysis. 

Additional Resources: Snowflake Cortex Documentation

alagappan-ramanathan
About the Author
Senior Data Scientist with over 5 years of experience, specializing in AutoML, NLP, NLG, Computer Vision, Large Language Models (LLMs), and Gen AI. Recognized for implementing multiple data science solutions to drive innovation. Proven leadership in steering data science initiatives across Pharma, Finance, and Retail sectors, utilizing advanced techniques for strategic insights. Proficient in cloud platforms like AWS and Azure, ensuring seamless project deployments. Demonstrated ability to leverage the power of LLMs for comprehensive and impactful data science solutions. Committed to pushing the boundaries of innovation through a holistic approach that integrates both proprietary and open-source technologies.
Alagappan RamanathanSr. Data Scientist – Decision Intelligence | USEReady
Sainadh Thikkireddi
About the Author
Sainadh Thikkireddi is a data scientist with over 4 years of experience crafting AI solutions from OCR engines to NLP-enabled RPA and anomaly detection. His journey so far has spanned diverse projects, bringing him a wealth of experience in LLLMs, MLOps, AWS and other tech on the bleeding edge of data innovation. Sainadh thrives on exploring the wilds frontiers of AI and simplifying intricate ideas more accessible for others.
Sainadh ThikkireddiML Engineer - Decision Intelligence | USEReady