This quickstart guides users through processing XML data using Snowflake's Snowpark XML.
Snowflake Notebooks and a US Treasury dataset downloaded from Investments of Federal Government Accounts in Federal Securities are used to demonstrate Snowpark XML capabilities.
Snowpark XML enables scalable, Spark-like processing of XML files using Snowpark's Python API and UDTF-based backend.
Users can leverage Snowpark XML capabilities to:
rowTag
.session.read.option(...).xml(...)
).rowTag
.VARIANT
, enabling easier querying.session.read.option(...).xml(...)
).This solution harnesses the power of Snowflake Snowpark XML to load, extract and render financial data in XML format.
Snowflake Notebooks serve as our interactive workspace. Through the skillful application of advanced Snowpark XML capabilities, we demonstrate how to process XML to support multiple use cases in financial services, including asset valuation, risk analysis, or regulatory reporting.
Login to your Snowflake account. Navigate to Worksheets, click "+" in the top-right corner to create a new Worksheet and choose "SQL Worksheet". Name it "Snowpark XML Processing - Setup".
Paste the following block to your worksheet.
USE ROLE ACCOUNTADMIN;
CREATE DATABASE IF NOT EXISTS XML_PROCESSING;
USE DATABASE XML_PROCESSING;
CREATE SCHEMA IF NOT EXISTS NOTEBOOKS;
CREATE WAREHOUSE IF NOT EXISTS XML_WH WAREHOUSE_SIZE=XSMALL INITIALLY_SUSPENDED=TRUE;
USE WAREHOUSE XML_WH;
CREATE FILE FORMAT IF NOT EXISTS my_xml_format
TYPE = XML
COMPRESSION = NONE;
CREATE STAGE IF NOT EXISTS XML_PROCESSING.DATA.XMLDATA_STAGE
DIRECTORY = (ENABLE = TRUE)
FILE_FORMAT = my_xml_format
ENCRYPTION = (TYPE = 'SNOWFLAKE_SSE');
You will create Snowflake objects (database, schema, warehouse) executing the above SQL block from your worksheet.
You can execute all at once " > Run All" or, one at the time, by selecting each statement and clicking on the "Run" button.
We are using the Investments of Federal Government Accounts in Federal Securities dataset from the US Treasury.
To download it, scroll down to "Data Preview", set the following and click "Download XML file":
A file "MTS_AgcyInvest_20150331_20250331.xml" (13.7MB) will be downloaded to your local "Downloads" folder.
Login to your Snowflake Account and from the left navigation menu select Data / Database "XML_PROCESSING" database, "DATA" schema and "XMLDATA_STAGE" stage
Click "+Files" in the top-right corner and "Browse" then select the downloaded file from your local /Downloads folder
Hit "Upload" and the file will be uploaded to the internal stage.
Select "Projects" / "Notebooks" from the left navigation menu. Click "+ Notebook" from the top-right corner.
Complete the opened form:
Click "Create" and a default notebook with 3 cells will be created. Remove those three cells from the right corner of each cell by choosing "Delete". Repeat it for each cell.
Click "+ Markdown" and paste to the cell the following code block:
# Loading XML Data from internal stage
Efficient ingestion of large or multiple XML files using rowTag.
Click "+ Python" and paste to the cell the following code block:
from snowflake.snowpark.context import get_active_session
session = get_active_session()
Click "+ Python" and paste to the cell the following code block:
stage_file_path = f'@XML_PROCESSING.DATA.XMLDATA_STAGE/MTS_AgcyInvest_20150331_20250331.xml'
df = session.read.option("rowTag", "row").xml(stage_file_path)
# df.show()
df.write.mode("overwrite").save_as_table("XML_PROCESSING.DATA.MTS_AgcyInvest")
# print(df.queries)
Click "+ SQL" and paste to the cell the following code block:
SELECT COUNT(*) FROM XML_PROCESSING.DATA.MTS_AgcyInvest;
This returns the total number of rows loaded.
This quickstart is focused on how to load XML data. Once loaded, XML data can be also processed with Snowpark or SQL XML functions like CHECK_XML, PARSE_XML, XMLGET, TO_XML. The output could be data ready for regulatory reporting.
Click "+ SQL" and paste to the cell the following code block:
SELECT
GET("'data-element'", 'record_calendar_month')::STRING AS record_calendar_month,
GET("'data-element'", 'record_calendar_day')::STRING AS record_calendar_day,
GET("'data-element'", 'begin_year_acct_bal_amt')::STRING AS begin_year_acct_bal_amt,
GET("'data-element'", 'begin_month_acct_bal_amt')::STRING AS begin_month_acct_bal_amt,
GET("'data-element'", 'close_month_acct_bal_amt')::STRING AS close_month_acct_bal_amt ,
GET("'data-element'", 'current_month_net_txn_amt')::STRING AS current_month_net_txn_amt,
GET("'data-element'", 'data_type_cd')::STRING AS data_type_cd,
GET("'data-element'", 'fytd_net_txn_amt')::STRING AS fytd_net_txn_amt,
GET("'data-element'", 'line_code_nbr')::STRING AS line_code_nbr,
GET("'data-element'", 'parent_id')::STRING AS parent_id,
GET("'data-element'", 'print_order_nbr')::STRING AS print_order_nbr,
GET("'data-element'", 'prior_fytd_net_txn_amt')::STRING AS prior_fytd_net_txn_amt,
GET("'data-element'", 'sequence_level_nbr')::STRING AS sequence_level_nbr,
GET("'data-element'", 'sequence_number_cd')::STRING AS sequence_number_cd,
GET("'data-element'", 'src_line_nbr')::STRING AS src_line_nbr,
GET("'data-element'", 'table_nbr')::STRING AS table_nbr
FROM XML_PROCESSING.DATA.MTS_AgcyInvest
WHERE GET("'data-element'", 'classification_desc') = 'Federal Housing Finance Agency'
AND GET("'data-element'", 'record_calendar_year') = '2016'
ORDER BY GET("'data-element'", 'record_date') ASC;
You got a notebook!!! Before running the notebook, install the latest snowpark-python package. In the top-right corner, click "Packages" and in "Find Packages" search window, type snowflake-snowpark-python" and choose "Latest".
You can run all cells using "Run all" from the top-right corner. You can run cells, one by one, using "run this cell only" for each cell.
You did it! You have successfully completed the "Modernize XML Processing for Financial Services with Snowpark XML" quickstart with notebooks and Snowpark XML.
session.read.option(...).xml(...)
).