Skip to content

Conversation

@samredai
Copy link
Contributor

@samredai samredai commented Sep 22, 2025

Summary

BaseQueryServiceClient

Issue #1499 describes the motivation behind this PR. Instead of requiring a query service matching a REST specification, this offers an injection point for custom implementations that's embedded within the main DJ server. The query service client was already an injected dependency, but this adds a query client abstract base class that the existing query service client wraps. This allows other query client implementations to also wrap the base class and be injected for use within the main server. Some examples of what can be done with this:

  • A query client that directly calls a vendor database (this PR adds an implementation that uses snowflake-connector-python, for example)
  • A query client that calls a DJ REST spec query service (the original and now default configuration)
  • A query client that calls some other non-DJ REST spec query service

Query Client Configuration

Another thing this adds is a configuration framework for configuring the query client. The implementation is chosen using a specific string value for QueryClientConfig.type or provided via an environment variable QUERY_CLIENT__TYPE. The other params required by that specific client type can also be passed in via environment variables. For example, here's a snowflake configuration:

export QUERY_CLIENT__TYPE=snowflake
export QUERY_CLIENT__CONNECTION__ACCOUNT=FOOACCOUNT
export QUERY_CLIENT__CONNECTION__USER=$SNOWFLAKE_USER
export QUERY_CLIENT__CONNECTION__PASSWORD=$SNOWFLAKE_PASSWORD
export QUERY_CLIENT__CONNECTION__WAREHOUSE=FOOWAREHOUSE
export QUERY_CLIENT__CONNECTION__DATABASE=FOODB
export QUERY_CLIENT__CONNECTION__SCHEMA=FOOTABLE

The snowflake implementation in this PR only implements getting column types for a table which is the minimum requirement for the query service and allows for registering tables to model on top of in DJ. It seems to be working as expected but could probably use some more validation of the column type conversions (converting snowflake column types to the corresponding DJ column type object).

Test Plan

Added tests, configured a snowflake account, and registered some tables in the DJ UI.

Deployment Plan

@netlify
Copy link

netlify bot commented Sep 22, 2025

Deploy Preview for thriving-cassata-78ae72 canceled.

Name Link
🔨 Latest commit c4555b1
🔍 Latest deploy log https://app.netlify.com/projects/thriving-cassata-78ae72/deploys/68f64c3c99886e00085fef0b


# Query service
# Query service url (only used with "http" query client config)
query_service: Optional[str] = None
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm keeping this here for now so as not to break any existing server configuraitons, but eventually we can migrate this to use a QueryClientConfig of type "http" and provides the url as a param. That way no matter what kind of query client you're using you go through the same query client config logic.

@samredai samredai force-pushed the query-client-dependency branch 6 times, most recently from 879847a to df2653b Compare September 30, 2025 13:31
@samredai samredai force-pushed the query-client-dependency branch 8 times, most recently from 5be6dc1 to 7127591 Compare October 9, 2025 02:34
@samredai samredai changed the title DRAFT - Allow configurable embedded query client (Issue #1499) Allow configurable embedded query client (Issue #1499) Oct 9, 2025
@samredai samredai marked this pull request as ready for review October 9, 2025 17:05
Copy link
Member

@agorajek agorajek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey this is awesome! I do have one question about what do you think we should do in terms of the legacy QueryServiceClient code. I posed this question in-line when I was reading the http implementation, but feel free to reply wherever.

cc @shangyian

from datajunction_server.models.partition import PartitionBackfill
from datajunction_server.models.query import QueryCreate, QueryWithResults
from datajunction_server.query_clients.base import BaseQueryServiceClient
from datajunction_server.service_clients import QueryServiceClient
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to delete the code of QueryServiceClient sooner or later? Wondering if you are perhaps thinking of doing this as the next step or not doing at all and why?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would deleting the legacy QueryServiceClient keep it it backwards compatible for existing setups?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's exactly it, I wanted to keep this as obviously backwards compatible as possible until we can migrate internally to the new config structure. Once we do that, it will be an easy follow-up to update this and what I'm planning to do is basically just rename QueryServiceClient -> HttpQueryServiceClient and make it inherit from BaseQueryServiceClient.

@vcknorket
Copy link

Curious..instead of making credentials static... can it just be a payload / be more dynamic? this allows each query to be evaluated by the query engine (e.g. trino in our case) @samredai

@samredai samredai force-pushed the query-client-dependency branch from 7127591 to db84f33 Compare October 14, 2025 19:46
@samredai
Copy link
Contributor Author

Curious..instead of making credentials static... can it just be a payload / be more dynamic? this allows each query to be evaluated by the query engine (e.g. trino in our case) @samredai

Can you clarify what you mean by static? Do you pass in credentials to trino as part of the headers? In that case every method on the abstract base class BaseQueryServiceClient takes an optional request_headers object so you can pass that in each time you call the methods.

@vcknorket
Copy link

Curious..instead of making credentials static... can it just be a payload / be more dynamic? this allows each query to be evaluated by the query engine (e.g. trino in our case) @samredai

Can you clarify what you mean by static? Do you pass in credentials to trino as part of the headers? In that case every method on the abstract base class BaseQueryServiceClient takes an optional request_headers object so you can pass that in each time you call the methods.

Yes. We pass as part of headers as the actual security evaluation is done by trino. So...yes..we need to pass headers that have users email and their trino pwd/jwt so it can be passed on to our custom query service that can do the evaluations

Copy link
Member

@agorajek agorajek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @samredai - this is fantastic. I made few comments in-line. Most of them minor but one would be nice to address before you push this PR. I noticed you exposed the generic materialize method. I wonder if we should hide it for now and only expose the materialize_cube one - for easier feature management in the future.

COPY . /code
RUN pip install --no-cache-dir --upgrade -r /code/requirements/docker.txt
RUN pip install -e .
RUN pip install -e .[all]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, what's the difference with [all] ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is basically saying to install the optional packages defined in pyproject.toml under:

all = [
    "snowflake-connector-python>=3.0.0",
]

request_headers=request_headers,
)

def materialize(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we started leaning towards cube materialization only (for the time being), let's maybe remove this call for now, from here and from the based class? It will be easier to add it back than to remove in the future.

Copy link
Contributor

@shangyian shangyian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the overall idea! I'm also wondering if we should move towards an approach where the engines you can register into DJ have an engine_type, which corresponds to one of the supported query engines on the query client (and there can be as many as users need or have implemented). Then no one is really tied down to a single query engine, but we'll always pick the right query engine depending on what they've chosen.

Configuration for query service clients.
"""

# Type of query client: 'http', 'snowflake', 'bigquery', 'databricks', 'trino', etc.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it the case that most use cases will only use a single query engine? Or is it reasonable to say that people can mix and match query engines, which can be determined based on the type of engine that's registered in our engines table?

Copy link
Contributor Author

@samredai samredai Oct 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think mixing and matching is reasonable, the question I have though is do we need to support that with these basic single engine query clients? I doubt we can standardize how the specific query engine would be chosen. Your idea of looking at the engine type make sense but I could see someone else wanting it to be driven by a request header. Is it reasonable for the OSS recommendation to be create a simple wrapper QueryServiceClient implementation that has whatever query routing logic someone would like? They could have a SuperQueryClient that makes the decision and based on that decision just instantiates and uses the right query service client. Something like if engine: bigquery is in the request header, use a BigQueryClient, if engine: databricks is in the request header use a DatabricksQueryClient, etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants