Allow configurable embedded query client (Issue #1499) #1504

samredai · 2025-09-22T20:31:39Z

Summary

BaseQueryServiceClient

Issue #1499 describes the motivation behind this PR. Instead of requiring a query service matching a REST specification, this offers an injection point for custom implementations that's embedded within the main DJ server. The query service client was already an injected dependency, but this adds a query client abstract base class that the existing query service client wraps. This allows other query client implementations to also wrap the base class and be injected for use within the main server. Some examples of what can be done with this:

A query client that directly calls a vendor database (this PR adds an implementation that uses snowflake-connector-python, for example)
A query client that calls a DJ REST spec query service (the original and now default configuration)
A query client that calls some other non-DJ REST spec query service

Query Client Configuration

Another thing this adds is a configuration framework for configuring the query client. The implementation is chosen using a specific string value for QueryClientConfig.type or provided via an environment variable QUERY_CLIENT__TYPE. The other params required by that specific client type can also be passed in via environment variables. For example, here's a snowflake configuration:

export QUERY_CLIENT__TYPE=snowflake
export QUERY_CLIENT__CONNECTION__ACCOUNT=FOOACCOUNT
export QUERY_CLIENT__CONNECTION__USER=$SNOWFLAKE_USER
export QUERY_CLIENT__CONNECTION__PASSWORD=$SNOWFLAKE_PASSWORD
export QUERY_CLIENT__CONNECTION__WAREHOUSE=FOOWAREHOUSE
export QUERY_CLIENT__CONNECTION__DATABASE=FOODB
export QUERY_CLIENT__CONNECTION__SCHEMA=FOOTABLE

The snowflake implementation in this PR only implements getting column types for a table which is the minimum requirement for the query service and allows for registering tables to model on top of in DJ. It seems to be working as expected but could probably use some more validation of the column type conversions (converting snowflake column types to the corresponding DJ column type object).

Test Plan

Added tests, configured a snowflake account, and registered some tables in the DJ UI.

PR has an associated issue: Replace (or supplement) the OSS query service pattern with dependency injected clients for major data warehouse vendors #1499
make check passes
make test shows 100% unit test coverage

Deployment Plan

netlify · 2025-09-22T20:32:05Z

✅ Deploy Preview for thriving-cassata-78ae72 canceled.

Name	Link
🔨 Latest commit	`c4555b1`
🔍 Latest deploy log	https://app.netlify.com/projects/thriving-cassata-78ae72/deploys/68f64c3c99886e00085fef0b

samredai · 2025-09-22T20:38:16Z

datajunction-server/datajunction_server/config.py


-    # Query service
+    # Query service url (only used with "http" query client config)
    query_service: Optional[str] = None


I'm keeping this here for now so as not to break any existing server configuraitons, but eventually we can migrate this to use a QueryClientConfig of type "http" and provides the url as a param. That way no matter what kind of query client you're using you go through the same query client config logic.

agorajek

Hey this is awesome! I do have one question about what do you think we should do in terms of the legacy QueryServiceClient code. I posed this question in-line when I was reading the http implementation, but feel free to reply wherever.

cc @shangyian

agorajek · 2025-10-10T20:26:30Z

datajunction-server/datajunction_server/query_clients/http.py

+from datajunction_server.models.partition import PartitionBackfill
+from datajunction_server.models.query import QueryCreate, QueryWithResults
+from datajunction_server.query_clients.base import BaseQueryServiceClient
+from datajunction_server.service_clients import QueryServiceClient


Would it make sense to delete the code of QueryServiceClient sooner or later? Wondering if you are perhaps thinking of doing this as the next step or not doing at all and why?

Would deleting the legacy QueryServiceClient keep it it backwards compatible for existing setups?

Yeah that's exactly it, I wanted to keep this as obviously backwards compatible as possible until we can migrate internally to the new config structure. Once we do that, it will be an easy follow-up to update this and what I'm planning to do is basically just rename QueryServiceClient -> HttpQueryServiceClient and make it inherit from BaseQueryServiceClient.

vcknorket · 2025-10-14T01:44:09Z

Curious..instead of making credentials static... can it just be a payload / be more dynamic? this allows each query to be evaluated by the query engine (e.g. trino in our case) @samredai

samredai · 2025-10-14T19:48:19Z

Curious..instead of making credentials static... can it just be a payload / be more dynamic? this allows each query to be evaluated by the query engine (e.g. trino in our case) @samredai

Can you clarify what you mean by static? Do you pass in credentials to trino as part of the headers? In that case every method on the abstract base class BaseQueryServiceClient takes an optional request_headers object so you can pass that in each time you call the methods.

vcknorket · 2025-10-16T10:53:29Z

Curious..instead of making credentials static... can it just be a payload / be more dynamic? this allows each query to be evaluated by the query engine (e.g. trino in our case) @samredai

Can you clarify what you mean by static? Do you pass in credentials to trino as part of the headers? In that case every method on the abstract base class BaseQueryServiceClient takes an optional request_headers object so you can pass that in each time you call the methods.

Yes. We pass as part of headers as the actual security evaluation is done by trino. So...yes..we need to pass headers that have users email and their trino pwd/jwt so it can be passed on to our custom query service that can do the evaluations

agorajek

Hey @samredai - this is fantastic. I made few comments in-line. Most of them minor but one would be nice to address before you push this PR. I noticed you exposed the generic materialize method. I wonder if we should hide it for now and only expose the materialize_cube one - for easier feature management in the future.

agorajek · 2025-10-16T16:16:05Z

datajunction-server/Dockerfile

 COPY . /code
 RUN pip install --no-cache-dir --upgrade -r /code/requirements/docker.txt
-RUN pip install -e .
+RUN pip install -e .[all]


Btw, what's the difference with [all] ?

This is basically saying to install the optional packages defined in pyproject.toml under:

all = [ "snowflake-connector-python>=3.0.0", ]

datajunction-server/datajunction_server/config.py

agorajek · 2025-10-16T16:25:40Z

datajunction-server/datajunction_server/query_clients/http.py

+            request_headers=request_headers,
+        )
+
+    def materialize(


Since we started leaning towards cube materialization only (for the time being), let's maybe remove this call for now, from here and from the based class? It will be easier to add it back than to remove in the future.

shangyian

I like the overall idea! I'm also wondering if we should move towards an approach where the engines you can register into DJ have an engine_type, which corresponds to one of the supported query engines on the query client (and there can be as many as users need or have implemented). Then no one is really tied down to a single query engine, but we'll always pick the right query engine depending on what they've chosen.

shangyian · 2025-10-16T19:51:54Z

datajunction-server/datajunction_server/config.py

+    Configuration for query service clients.
+    """
+
+    # Type of query client: 'http', 'snowflake', 'bigquery', 'databricks', 'trino', etc.


Is it the case that most use cases will only use a single query engine? Or is it reasonable to say that people can mix and match query engines, which can be determined based on the type of engine that's registered in our engines table?

I think mixing and matching is reasonable, the question I have though is do we need to support that with these basic single engine query clients? I doubt we can standardize how the specific query engine would be chosen. Your idea of looking at the engine type make sense but I could see someone else wanting it to be driven by a request header. Is it reasonable for the OSS recommendation to be create a simple wrapper QueryServiceClient implementation that has whatever query routing logic someone would like? They could have a SuperQueryClient that makes the decision and based on that decision just instantiates and uses the right query service client. Something like if engine: bigquery is in the request header, use a BigQueryClient, if engine: databricks is in the request header use a DatabricksQueryClient, etc.

Co-authored-by: Olek Górajek <[email protected]>

samredai commented Sep 22, 2025

View reviewed changes

samredai force-pushed the query-client-dependency branch 6 times, most recently from 879847a to df2653b Compare September 30, 2025 13:31

samredai force-pushed the query-client-dependency branch 8 times, most recently from 5be6dc1 to 7127591 Compare October 9, 2025 02:34

samredai changed the title ~~DRAFT - Allow configurable embedded query client (Issue #1499)~~ Allow configurable embedded query client (Issue #1499) Oct 9, 2025

samredai marked this pull request as ready for review October 9, 2025 17:05

agorajek reviewed Oct 10, 2025

View reviewed changes

samredai added 4 commits October 14, 2025 15:46

Allow configurable embedded query client

06016f9

more coverage

826b8bf

Catch generic exception incase snowflake connector not installed

f90f0cc

Add test for snowflake types to DJ types mappings

db84f33

samredai force-pushed the query-client-dependency branch from 7127591 to db84f33 Compare October 14, 2025 19:46

Use global client with examples

10e3001

agorajek approved these changes Oct 16, 2025

View reviewed changes

shangyian reviewed Oct 16, 2025

View reviewed changes

samredai and others added 2 commits October 20, 2025 10:50

Update datajunction-server/datajunction_server/config.py

b1e2019

Co-authored-by: Olek Górajek <[email protected]>

Update datajunction-server/datajunction_server/config.py

c4555b1

Co-authored-by: Olek Górajek <[email protected]>

Allow configurable embedded query client (Issue #1499) #1504

Are you sure you want to change the base?

Allow configurable embedded query client (Issue #1499) #1504

Uh oh!

Conversation

samredai commented Sep 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

BaseQueryServiceClient

Query Client Configuration

Test Plan

Deployment Plan

Uh oh!

netlify bot commented Sep 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for thriving-cassata-78ae72 canceled.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

agorajek left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vcknorket commented Oct 14, 2025

Uh oh!

samredai commented Oct 14, 2025

Uh oh!

vcknorket commented Oct 16, 2025

Uh oh!

agorajek left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shangyian left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

samredai Oct 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

samredai commented Sep 22, 2025 •

edited

Loading

netlify bot commented Sep 22, 2025 •

edited

Loading

samredai Oct 20, 2025 •

edited

Loading