-
Notifications
You must be signed in to change notification settings - Fork 18
Allow configurable embedded query client (Issue #1499) #1504
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
✅ Deploy Preview for thriving-cassata-78ae72 canceled.
|
|
|
||
| # Query service | ||
| # Query service url (only used with "http" query client config) | ||
| query_service: Optional[str] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm keeping this here for now so as not to break any existing server configuraitons, but eventually we can migrate this to use a QueryClientConfig of type "http" and provides the url as a param. That way no matter what kind of query client you're using you go through the same query client config logic.
879847a to
df2653b
Compare
5be6dc1 to
7127591
Compare
agorajek
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey this is awesome! I do have one question about what do you think we should do in terms of the legacy QueryServiceClient code. I posed this question in-line when I was reading the http implementation, but feel free to reply wherever.
cc @shangyian
| from datajunction_server.models.partition import PartitionBackfill | ||
| from datajunction_server.models.query import QueryCreate, QueryWithResults | ||
| from datajunction_server.query_clients.base import BaseQueryServiceClient | ||
| from datajunction_server.service_clients import QueryServiceClient |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to delete the code of QueryServiceClient sooner or later? Wondering if you are perhaps thinking of doing this as the next step or not doing at all and why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would deleting the legacy QueryServiceClient keep it it backwards compatible for existing setups?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that's exactly it, I wanted to keep this as obviously backwards compatible as possible until we can migrate internally to the new config structure. Once we do that, it will be an easy follow-up to update this and what I'm planning to do is basically just rename QueryServiceClient -> HttpQueryServiceClient and make it inherit from BaseQueryServiceClient.
|
Curious..instead of making credentials static... can it just be a payload / be more dynamic? this allows each query to be evaluated by the query engine (e.g. trino in our case) @samredai |
7127591 to
db84f33
Compare
Can you clarify what you mean by static? Do you pass in credentials to trino as part of the headers? In that case every method on the abstract base class |
Yes. We pass as part of headers as the actual security evaluation is done by trino. So...yes..we need to pass headers that have users email and their trino pwd/jwt so it can be passed on to our custom query service that can do the evaluations |
agorajek
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @samredai - this is fantastic. I made few comments in-line. Most of them minor but one would be nice to address before you push this PR. I noticed you exposed the generic materialize method. I wonder if we should hide it for now and only expose the materialize_cube one - for easier feature management in the future.
| COPY . /code | ||
| RUN pip install --no-cache-dir --upgrade -r /code/requirements/docker.txt | ||
| RUN pip install -e . | ||
| RUN pip install -e .[all] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, what's the difference with [all] ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is basically saying to install the optional packages defined in pyproject.toml under:
all = [
"snowflake-connector-python>=3.0.0",
]
| request_headers=request_headers, | ||
| ) | ||
|
|
||
| def materialize( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we started leaning towards cube materialization only (for the time being), let's maybe remove this call for now, from here and from the based class? It will be easier to add it back than to remove in the future.
shangyian
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the overall idea! I'm also wondering if we should move towards an approach where the engines you can register into DJ have an engine_type, which corresponds to one of the supported query engines on the query client (and there can be as many as users need or have implemented). Then no one is really tied down to a single query engine, but we'll always pick the right query engine depending on what they've chosen.
| Configuration for query service clients. | ||
| """ | ||
|
|
||
| # Type of query client: 'http', 'snowflake', 'bigquery', 'databricks', 'trino', etc. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it the case that most use cases will only use a single query engine? Or is it reasonable to say that people can mix and match query engines, which can be determined based on the type of engine that's registered in our engines table?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think mixing and matching is reasonable, the question I have though is do we need to support that with these basic single engine query clients? I doubt we can standardize how the specific query engine would be chosen. Your idea of looking at the engine type make sense but I could see someone else wanting it to be driven by a request header. Is it reasonable for the OSS recommendation to be create a simple wrapper QueryServiceClient implementation that has whatever query routing logic someone would like? They could have a SuperQueryClient that makes the decision and based on that decision just instantiates and uses the right query service client. Something like if engine: bigquery is in the request header, use a BigQueryClient, if engine: databricks is in the request header use a DatabricksQueryClient, etc.
Co-authored-by: Olek Górajek <[email protected]>
Co-authored-by: Olek Górajek <[email protected]>
Summary
BaseQueryServiceClient
Issue #1499 describes the motivation behind this PR. Instead of requiring a query service matching a REST specification, this offers an injection point for custom implementations that's embedded within the main DJ server. The query service client was already an injected dependency, but this adds a query client abstract base class that the existing query service client wraps. This allows other query client implementations to also wrap the base class and be injected for use within the main server. Some examples of what can be done with this:
Query Client Configuration
Another thing this adds is a configuration framework for configuring the query client. The implementation is chosen using a specific string value for
QueryClientConfig.typeor provided via an environment variableQUERY_CLIENT__TYPE. The other params required by that specific client type can also be passed in via environment variables. For example, here's a snowflake configuration:The snowflake implementation in this PR only implements getting column types for a table which is the minimum requirement for the query service and allows for registering tables to model on top of in DJ. It seems to be working as expected but could probably use some more validation of the column type conversions (converting snowflake column types to the corresponding DJ column type object).
Test Plan
Added tests, configured a snowflake account, and registered some tables in the DJ UI.
make checkpassesmake testshows 100% unit test coverageDeployment Plan