Skip to content

Commit 5545bde

Browse files
authored
Merge pull request #881 from airweave-ai/feat/sharepoint-and-airtable
Feat/sharepoint and airtable
2 parents 8d61e1f + e628d40 commit 5545bde

31 files changed

+4598
-76
lines changed

.cursor/rules/connector-development-end-to-end.mdc

Lines changed: 172 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,100 @@ Your task is to write the code. The human will handle testing and running comman
1919

2020
---
2121

22+
## Important Guidelines
23+
24+
These are the most common mistakes when building connectors:
25+
26+
### 1. Make Entities Information-Rich (Embeddable Fields)
27+
28+
**Rule:** Mark ~70% of entity fields as `embeddable=True`
29+
30+
**Why:** Without `embeddable=True`, fields are only keyword-searchable, not semantically searchable. Users won't be able to find relevant data.
31+
32+
**What to mark embeddable:**
33+
- ✅ All text content (descriptions, notes, comments, body)
34+
- ✅ All names and titles
35+
- ✅ All people (assignees, authors, owners, members)
36+
- ✅ All status/metadata (status, priority, tags, labels)
37+
- ✅ All timestamps (created_at, modified_at, due_dates)
38+
39+
**What NOT to mark embeddable:**
40+
- ❌ Internal IDs (entity_id, external_id, database IDs)
41+
- ❌ Binary metadata (sizes, checksums, mime_types)
42+
43+
**Bad Example:**
44+
```python
45+
# Avoid: Sparse entity - users can't search by anything except name
46+
class TaskEntity(ChunkEntity):
47+
name: str = AirweaveField(..., embeddable=True)
48+
description: str = Field(...) # Should be embeddable
49+
assignee: Dict = Field(...) # Should be embeddable
50+
```
51+
52+
**Good Example:**
53+
```python
54+
# Better: Information-rich - users can search everything
55+
class TaskEntity(ChunkEntity):
56+
name: str = AirweaveField(..., embeddable=True)
57+
description: str = AirweaveField(..., embeddable=True)
58+
assignee: Dict = AirweaveField(..., embeddable=True)
59+
status: str = AirweaveField(..., embeddable=True)
60+
external_id: str = Field(...) # ID correctly not embeddable
61+
```
62+
63+
### 2. Test Entity Types Your Source Actually Implements
64+
65+
**Rule:** Your Monke tests should create and verify the entity types that your source actually yields
66+
67+
**Why:** Untested entity types may break in production without detection.
68+
69+
**Important:** Only test entities that your source implementation yields. You don't need to test every theoretically possible entity type from the API—just the ones your connector actually implements.
70+
71+
**How to verify:**
72+
1. Open your source: `backend/airweave/platform/sources/{short_name}.py`
73+
2. Find all `yield` statements in `generate_entities()`
74+
3. List the entity types your source ACTUALLY yields (e.g., Task, Comment, File)
75+
4. Your `bongos/{short_name}.py::create_entities()` should create at least one of each yielded type
76+
5. Your `create_entities()` should return descriptors for all yielded types
77+
78+
**Example:** If your SharePoint source only yields `ListItem`, `Page`, and `DriveItem` entities (not `User`, `Group`, `Site`), then your Monke bongo only needs to create those three types—not the entire SharePoint API surface.
79+
80+
**Bad Example:**
81+
```python
82+
# Avoid: Only creates tasks, ignores comments and files
83+
async def create_entities(self):
84+
for i in range(self.entity_count):
85+
task = await self._create_task(...)
86+
all_entities.append(task)
87+
# Source yields comments and files, but we don't test them
88+
return all_entities
89+
```
90+
91+
**Good Example:**
92+
```python
93+
# Better: Creates all entity types from source
94+
async def create_entities(self):
95+
for i in range(self.entity_count):
96+
# Create parent
97+
task = await self._create_task(...)
98+
all_entities.append(task)
99+
100+
# Create comments (source yields them)
101+
for j in range(2):
102+
comment = await self._create_comment(task["id"], ...)
103+
all_entities.append(comment)
104+
105+
# Create file (source yields them)
106+
file = await self._upload_file(task["id"], ...)
107+
all_entities.append(file)
108+
109+
return all_entities # Returns tasks, comments, AND files
110+
```
111+
112+
---
113+
114+
---
115+
22116
## Phase 1: Research & Planning
23117

24118
### Step 1: Understand the API
@@ -106,12 +200,42 @@ Decide:
106200
3. Add **nested entities** (comments, messages)
107201
4. Add **file entities** if the API supports attachments
108202

203+
** Make entities information-rich with embeddable fields**
204+
109205
**Key principles:**
110-
- Use `AirweaveField(..., embeddable=True)` for searchable text
206+
- **USE `AirweaveField(..., embeddable=True)` FOR ~70% OF FIELDS** - this is what makes entities searchable!
207+
- Mark ALL user-visible content as `embeddable=True`:
208+
- Text content (descriptions, notes, comments, body)
209+
- Names and titles
210+
- People (assignees, authors, owners)
211+
- Status/metadata (status, priority, tags)
212+
- Timestamps (created_at, modified_at, due_dates)
213+
- Only use `Field()` without embeddable for:
214+
- Internal IDs (entity_id, external_id)
215+
- Binary metadata (sizes, checksums, mime_types)
111216
- Always include `created_at` and `modified_at` with proper flags
112-
- Use `Field(...)` for non-searchable metadata
113217
- Inherit from `ChunkEntity` or `FileEntity`
114218

219+
**Anti-pattern to avoid:**
220+
```python
221+
# ❌ BAD: Sparse entity with only name embeddable
222+
class TaskEntity(ChunkEntity):
223+
name: str = AirweaveField(..., embeddable=True)
224+
description: str = Field(...) # ❌ Should be embeddable!
225+
assignee: Dict = Field(...) # ❌ Should be embeddable!
226+
```
227+
228+
**Good pattern:**
229+
```python
230+
# ✅ GOOD: Information-rich entity with most fields embeddable
231+
class TaskEntity(ChunkEntity):
232+
name: str = AirweaveField(..., embeddable=True)
233+
description: str = AirweaveField(..., embeddable=True) # ✅
234+
assignee: Dict = AirweaveField(..., embeddable=True) # ✅
235+
status: str = AirweaveField(..., embeddable=True) # ✅
236+
external_id: str = Field(...) # ✅ ID not embeddable
237+
```
238+
115239
**Example structure:**
116240

117241
```python
@@ -411,26 +535,45 @@ async def generate_comment(model: str, token: str) -> dict:
411535

412536
**Reference:** See `monke-testing-guide.mdc` Part 1
413537

414-
**This is the most critical file.** It must create ALL entity types.
538+
**It must create ALL entity types defined in your source connector.**
539+
540+
**Before starting:**
541+
1. Open `backend/airweave/platform/sources/{short_name}.py`
542+
2. Find `generate_entities()` method
543+
3. List EVERY entity type that is yielded:
544+
```python
545+
# Example from your source:
546+
yield WorkspaceEntity(...) # ← Must create in tests
547+
yield ProjectEntity(...) # ← Must create in tests
548+
yield TaskEntity(...) # ← Must create in tests
549+
yield CommentEntity(...) # ← Must create in tests
550+
yield FileEntity(...) # ← Must create in tests
551+
```
552+
4. Your `create_entities()` MUST create instances of ALL these types
415553

416554
**Implementation order:**
417555

418556
1. Create the class skeleton
419557
2. Implement `_ensure_workspace()` and `_ensure_project()` helpers
420-
3. Implement `create_entities()` - **MUST create all entity types**
558+
3. **Implement `create_entities()` - MUST create ALL entity types (not just tasks!)**
421559
4. Implement `update_entities()`
422560
5. Implement `delete_specific_entities()`
423561
6. Implement `delete_entities()`
424562
7. Implement `cleanup()`
425563
8. Add rate limiting and error handling
426564

427-
**Critical: Test ALL Entity Types**
565+
** Test ALL Entity Types**
566+
567+
**Validation before proceeding:**
568+
- [ ] Count: How many entity types does your source yield?
569+
- [ ] Count: How many entity types does `create_entities()` create?
570+
- [ ] These numbers MUST match (excluding workspace/project if they're not stored)
428571

429572
```python
430573
async def create_entities(self) -> List[Dict[str, Any]]:
431574
"""Create comprehensive test entities.
432575

433-
CRITICAL: Must create instances of EVERY entity type that
576+
Must create instances of EVERY entity type that
434577
the source connector syncs.
435578
"""
436579
all_entities = []
@@ -454,7 +597,7 @@ async def create_entities(self) -> List[Dict[str, Any]]:
454597
})
455598

456599
# ==========================================
457-
# CRITICAL: Create child entities
600+
# Create child entities
458601
# ==========================================
459602

460603
# Create 2 comments per task
@@ -568,21 +711,31 @@ Wait for human feedback. If tests fail, review the error logs and fix the code.
568711
### Checklist
569712

570713
**Source Connector:**
571-
- [ ] All entity types are implemented
714+
- [ ] All entity types are implemented in `entities/{short_name}.py`
715+
- [ ] **semantically relevant entity fields use `AirweaveField(..., embeddable=True)`**
716+
- [ ] All text content fields are embeddable
717+
- [ ] All people fields are embeddable
718+
- [ ] All status/metadata fields are embeddable
719+
- [ ] All timestamps are embeddable
720+
- [ ] Only IDs and binary metadata use `Field()` without embeddable
721+
- [ ] Verified: ~70% of fields are embeddable
572722
- [ ] All entities have `created_at` or `modified_at` timestamps
573723
- [ ] Token refresh is properly handled
574-
- [ ] Rate limiting is implemented
724+
- [ ] Rate limiting is implemented (if API requires it)
575725
- [ ] Pagination is handled correctly
576726
- [ ] Errors are handled gracefully (don't fail entire sync)
577727
- [ ] Breadcrumbs track entity hierarchy
578728
- [ ] File entities use `process_file_entity()`
579729
- [ ] OAuth config is in `dev.integrations.yaml`
580730

581731
**Monke Tests:**
582-
- [ ] Bongo creates ALL entity types (not just tasks)
732+
- [ ] **Bongo creates ALL entity types from source**
733+
- [ ] Listed all entity types yielded in source's `generate_entities()`
734+
- [ ] Confirmed `create_entities()` creates each type
735+
- [ ] Entity type count matches between source and tests
583736
- [ ] Each entity has unique verification token
584737
- [ ] Tokens are embedded in searchable content
585-
- [ ] Generation schemas defined for all types
738+
- [ ] Generation schemas defined for all entity types
586739
- [ ] Test config has comprehensive test flow
587740
- [ ] All entity types are verified after sync
588741
- [ ] Update flow tests incremental sync
@@ -687,7 +840,7 @@ Reference the Asana source as an example: @asana.py
687840
```
688841
Now implement the Monke tests.
689842

690-
CRITICAL: The bongo MUST create instances of EVERY entity type, including
843+
The bongo MUST create instances of EVERY entity type, including
691844
comments and files, not just the top-level tasks.
692845

693846
Start with:
@@ -707,12 +860,18 @@ Building a complete connector requires:
707860

708861
1. **Research:** Understand the API, entity hierarchy, auth, and rate limits
709862
2. **Entities:** Define schemas for ALL entity types with proper timestamps
863+
- **Mark ~70% of fields as `embeddable=True` for semantic search**
710864
3. **Source:** Implement hierarchical entity generation with token refresh
711865
4. **Testing:** Create Monke tests that verify EVERY entity type
866+
- **Test ALL entity types your source yields, not just tasks**
712867
5. **Validation:** Run E2E tests and verify all entities appear in search
713868
6. **Refinement:** Debug issues, optimize performance, handle edge cases
714869

715-
**The key to success:** Test comprehensively. Don't just test tasks—test comments, files, and all nested entities. If your connector syncs it, your tests should verify it.
870+
**The two keys to success:**
871+
872+
1. **Information-Rich Entities**: Mark most fields as `embeddable=True` so users can semantically search your data. Sparse entities with only names embeddable are barely useful.
873+
874+
2. **Comprehensive Testing**: Don't just test tasks—test comments, files, and all nested entities. If your connector syncs it, your tests must verify it. Count entity types in your source's `generate_entities()` and match that count in your Monke tests.
716875

717876
---
718877

.cursor/rules/integrations-yaml.mdc

Lines changed: 29 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,12 +20,15 @@ integrations:
2020
content_type: "application/x-www-form-urlencoded"
2121
client_credential_location: "body" # or "header"
2222
scope: "scope1 scope2 scope3" # service-specific permissions
23+
requires_pkce: false # optional, default false. Set true for PKCE-required providers (e.g., Airtable)
2324
additional_frontend_params: # optional service-specific parameters
2425
param1: "value1"
2526
param2: "value2"
2627
```
2728

28-
### Example Integration
29+
### Example Integrations
30+
31+
**Standard OAuth (Gmail):**
2932
```yaml
3033
gmail:
3134
auth_type: "oauth2_with_refresh"
@@ -42,6 +45,31 @@ gmail:
4245
prompt: "consent"
4346
```
4447

48+
**OAuth with PKCE (Airtable):**
49+
```yaml
50+
airtable:
51+
oauth_type: "with_refresh"
52+
url: "https://airtable.com/oauth2/v1/authorize"
53+
backend_url: "https://airtable.com/oauth2/v1/token"
54+
grant_type: "authorization_code"
55+
client_id: "your-client-id"
56+
client_secret: "your-client-secret"
57+
content_type: "application/x-www-form-urlencoded"
58+
client_credential_location: "header"
59+
scope: "schema.bases:read data.records:read"
60+
requires_pkce: true # PKCE (Proof Key for Code Exchange) prevents authorization code interception
61+
```
62+
63+
**PKCE Flow Details:**
64+
When `requires_pkce: true`, the OAuth flow includes:
65+
1. System generates a `code_verifier` (random string) during authorization
66+
2. Computes `code_challenge` = SHA256(code_verifier) and sends in auth URL
67+
3. Stores `code_verifier` in `init_session.overrides` for later retrieval
68+
4. During token exchange, sends the original `code_verifier` to prove authenticity
69+
5. Provider verifies SHA256(code_verifier) matches the original code_challenge
70+
71+
This prevents authorization code interception attacks by ensuring the token exchange request comes from the same client that initiated authorization.
72+
4573
## Folder Structure
4674
The files appear to be part of a structured monorepo with:
4775

0 commit comments

Comments
 (0)