Advanced Analytics, Artificial Intelligence
Extracting Insights from Sports News Articles Using Snowflake, llama-index, and OpenAI

In today’s world, staying updated with the latest news and information is crucial. One of the best sources for real-time news is the internet, where we can access a wide range of articles, blogs, and other content. However, manually sifting through all this information can be a daunting task. This is where the power of large language models comes into play.
In this blog post, we will show you how to connect to Snowflake, a cloud-based data warehousing platform, along with OpenAI’s LLM (Large Language Model) and the llama-index library, to efficiently query and analyze sports news articles from 2022. Alternatively, you can load up any data source from any database as well as point llama-index to a collection of PDFs, flat files, JSON, etc.
Setting the Stage
Before we dive into the technical details, let’s outline the components we’ll be using for this task:
- Snowflake Database: Snowflake is a cloud-based data warehousing platform that allows us to store and query large datasets efficiently.
- OpenAI’s LLM: OpenAI provides a powerful language model that can understand and generate human-like text, which is crucial for natural language processing tasks.
- llama-index: llama-index is a library that facilitates the creation of a VectorStoreIndex. This index enables us to ask natural language questions of a collection of articles and obtain relevant answers.
The Python Script
Here’s a Python script that demonstrates the entire process:
from __future__ import absolute_import import os import openai import logging import sys import llama_index import pandas import snowflake.connector from llama_index.schema import Document from llama_index.readers.database import DatabaseReader from llama_index import VectorStoreIndex from llama_index import download_loader from llama_index import StorageContext, load_index_from_storage ## either one of these works to set your OPENAI key os.environ["OPENAI_API_KEY"] = "your_key" openai.api_key = "your_key" ## set up connection to Snowflake con = snowflake.connector.connect( user='username', password='password', account='snowflakeaccount.us-east-1', database='db', schema='ARTICLE_WAREHOUSE' ) ## query the articles we want to add to our index cs = con.cursor() sql = "SELECT article_text FROM articles WHERE article_text <> '' AND article_text IS NOT NULL and article_text <> 'None'" cs.execute(sql) df = cs.fetch_pandas_all() # Create an empty list documents =[] # Iterate over each row for index, rows in df.iterrows(): # append the list to the final list documents.append(Document(text=rows.ARTICLE_TEXT)) index = VectorStoreIndex.from_documents(documents) # Persist index to disk for later index.storage_context.persist("article_index") # Rebuild storage context, this is only necessary to load from disk the next time you want to run queries # storage_context = StorageContext.from_defaults(persist_dir="article_index") # Load index from the storage context # new_index = load_index_from_storage(storage_context) # new_query_engine = new_index.as_query_engine() ## Ask any questions you'd like and get a response query_engine = index.as_query_engine() response = query_engine.query("Which soccer team is the most successful in the United States?") print(textwrap.fill(str(response), 100))
The Process Explained
Now, let’s break down the steps of this script:
- We set our OpenAI API key to access OpenAI’s LLM. This key allows us to make natural language queries to the model.
- We establish a connection to the Snowflake database, providing the necessary credentials, account details, database name, and schema.
- We retrieve a collection of sports news articles from the Snowflake database and create a list of documents.
- We use these documents to create a VectorStoreIndex using llama-index. This index organizes and stores the information, making it ready for querying.
- The VectorStoreIndex is persisted to disk, ensuring that we can access it later without reprocessing the data.
- We load the index from storage and set up the query engine.
- Finally, we pose a natural language question to the query engine and retrieve the response.
Below is a video of the query engine in action via a Shiny for Python app.
Conclusion
We’ve explored an approach that utilizes Snowflake, OpenAI’s LLM, and llama-index to query and analyze sports news articles from 2022. This process allows us to efficiently extract valuable insights and answers from a vast collection of articles. The combination of cloud-based data warehousing, language models, and indexing libraries offers a powerful solution for information retrieval and analysis.
With this approach, you can adapt and extend the script to perform various tasks, from analyzing trends in your own internal data to answering specific questions about different domains. The future of LLMs is certainly exciting and we’re looking forward to seeing all of the technological advances made with them.
•