@@ -19,6 +19,100 @@ Your task is to write the code. The human will handle testing and running comman
1919
2020---
2121
22+ ## Important Guidelines
23+
24+ These are the most common mistakes when building connectors:
25+
26+ ### 1. Make Entities Information-Rich (Embeddable Fields)
27+
28+ **Rule:** Mark ~70% of entity fields as `embeddable=True`
29+
30+ **Why:** Without `embeddable=True`, fields are only keyword-searchable, not semantically searchable. Users won't be able to find relevant data.
31+
32+ **What to mark embeddable:**
33+ - ✅ All text content (descriptions, notes, comments, body)
34+ - ✅ All names and titles
35+ - ✅ All people (assignees, authors, owners, members)
36+ - ✅ All status/metadata (status, priority, tags, labels)
37+ - ✅ All timestamps (created_at, modified_at, due_dates)
38+
39+ **What NOT to mark embeddable:**
40+ - ❌ Internal IDs (entity_id, external_id, database IDs)
41+ - ❌ Binary metadata (sizes, checksums, mime_types)
42+
43+ **Bad Example:**
44+ ```python
45+ # Avoid: Sparse entity - users can't search by anything except name
46+ class TaskEntity(ChunkEntity):
47+ name: str = AirweaveField(..., embeddable=True)
48+ description: str = Field(...) # Should be embeddable
49+ assignee: Dict = Field(...) # Should be embeddable
50+ ```
51+
52+ **Good Example:**
53+ ```python
54+ # Better: Information-rich - users can search everything
55+ class TaskEntity(ChunkEntity):
56+ name: str = AirweaveField(..., embeddable=True)
57+ description: str = AirweaveField(..., embeddable=True)
58+ assignee: Dict = AirweaveField(..., embeddable=True)
59+ status: str = AirweaveField(..., embeddable=True)
60+ external_id: str = Field(...) # ID correctly not embeddable
61+ ```
62+
63+ ### 2. Test Entity Types Your Source Actually Implements
64+
65+ **Rule:** Your Monke tests should create and verify the entity types that your source actually yields
66+
67+ **Why:** Untested entity types may break in production without detection.
68+
69+ **Important:** Only test entities that your source implementation yields. You don't need to test every theoretically possible entity type from the API—just the ones your connector actually implements.
70+
71+ **How to verify:**
72+ 1. Open your source: `backend/airweave/platform/sources/{short_name}.py`
73+ 2. Find all `yield` statements in `generate_entities()`
74+ 3. List the entity types your source ACTUALLY yields (e.g., Task, Comment, File)
75+ 4. Your `bongos/{short_name}.py::create_entities()` should create at least one of each yielded type
76+ 5. Your `create_entities()` should return descriptors for all yielded types
77+
78+ **Example:** If your SharePoint source only yields `ListItem`, `Page`, and `DriveItem` entities (not `User`, `Group`, `Site`), then your Monke bongo only needs to create those three types—not the entire SharePoint API surface.
79+
80+ **Bad Example:**
81+ ```python
82+ # Avoid: Only creates tasks, ignores comments and files
83+ async def create_entities(self):
84+ for i in range(self.entity_count):
85+ task = await self._create_task(...)
86+ all_entities.append(task)
87+ # Source yields comments and files, but we don't test them
88+ return all_entities
89+ ```
90+
91+ **Good Example:**
92+ ```python
93+ # Better: Creates all entity types from source
94+ async def create_entities(self):
95+ for i in range(self.entity_count):
96+ # Create parent
97+ task = await self._create_task(...)
98+ all_entities.append(task)
99+
100+ # Create comments (source yields them)
101+ for j in range(2):
102+ comment = await self._create_comment(task["id"], ...)
103+ all_entities.append(comment)
104+
105+ # Create file (source yields them)
106+ file = await self._upload_file(task["id"], ...)
107+ all_entities.append(file)
108+
109+ return all_entities # Returns tasks, comments, AND files
110+ ```
111+
112+ ---
113+
114+ ---
115+
22116## Phase 1: Research & Planning
23117
24118### Step 1: Understand the API
@@ -106,12 +200,42 @@ Decide:
1062003. Add **nested entities** (comments, messages)
1072014. Add **file entities** if the API supports attachments
108202
203+ ** Make entities information-rich with embeddable fields**
204+
109205**Key principles:**
110- - Use `AirweaveField(..., embeddable=True)` for searchable text
206+ - **USE `AirweaveField(..., embeddable=True)` FOR ~70% OF FIELDS** - this is what makes entities searchable!
207+ - Mark ALL user-visible content as `embeddable=True`:
208+ - Text content (descriptions, notes, comments, body)
209+ - Names and titles
210+ - People (assignees, authors, owners)
211+ - Status/metadata (status, priority, tags)
212+ - Timestamps (created_at, modified_at, due_dates)
213+ - Only use `Field()` without embeddable for:
214+ - Internal IDs (entity_id, external_id)
215+ - Binary metadata (sizes, checksums, mime_types)
111216- Always include `created_at` and `modified_at` with proper flags
112- - Use `Field(...)` for non-searchable metadata
113217- Inherit from `ChunkEntity` or `FileEntity`
114218
219+ **Anti-pattern to avoid:**
220+ ```python
221+ # ❌ BAD: Sparse entity with only name embeddable
222+ class TaskEntity(ChunkEntity):
223+ name: str = AirweaveField(..., embeddable=True)
224+ description: str = Field(...) # ❌ Should be embeddable!
225+ assignee: Dict = Field(...) # ❌ Should be embeddable!
226+ ```
227+
228+ **Good pattern:**
229+ ```python
230+ # ✅ GOOD: Information-rich entity with most fields embeddable
231+ class TaskEntity(ChunkEntity):
232+ name: str = AirweaveField(..., embeddable=True)
233+ description: str = AirweaveField(..., embeddable=True) # ✅
234+ assignee: Dict = AirweaveField(..., embeddable=True) # ✅
235+ status: str = AirweaveField(..., embeddable=True) # ✅
236+ external_id: str = Field(...) # ✅ ID not embeddable
237+ ```
238+
115239**Example structure:**
116240
117241```python
@@ -411,26 +535,45 @@ async def generate_comment(model: str, token: str) -> dict:
411535
412536**Reference:** See `monke-testing-guide.mdc` Part 1
413537
414- **This is the most critical file.** It must create ALL entity types.
538+ **It must create ALL entity types defined in your source connector.**
539+
540+ **Before starting:**
541+ 1. Open `backend/airweave/platform/sources/{short_name}.py`
542+ 2. Find `generate_entities()` method
543+ 3. List EVERY entity type that is yielded:
544+ ```python
545+ # Example from your source:
546+ yield WorkspaceEntity(...) # ← Must create in tests
547+ yield ProjectEntity(...) # ← Must create in tests
548+ yield TaskEntity(...) # ← Must create in tests
549+ yield CommentEntity(...) # ← Must create in tests
550+ yield FileEntity(...) # ← Must create in tests
551+ ```
552+ 4. Your `create_entities()` MUST create instances of ALL these types
415553
416554**Implementation order:**
417555
4185561. Create the class skeleton
4195572. Implement `_ensure_workspace()` and `_ensure_project()` helpers
420- 3. Implement `create_entities()` - ** MUST create all entity types**
558+ 3. ** Implement `create_entities()` - MUST create ALL entity types (not just tasks!) **
4215594. Implement `update_entities()`
4225605. Implement `delete_specific_entities()`
4235616. Implement `delete_entities()`
4245627. Implement `cleanup()`
4255638. Add rate limiting and error handling
426564
427- **Critical: Test ALL Entity Types**
565+ ** Test ALL Entity Types**
566+
567+ **Validation before proceeding:**
568+ - [ ] Count: How many entity types does your source yield?
569+ - [ ] Count: How many entity types does `create_entities()` create?
570+ - [ ] These numbers MUST match (excluding workspace/project if they're not stored)
428571
429572```python
430573async def create_entities(self) -> List[Dict[str, Any]]:
431574 """Create comprehensive test entities.
432575
433- CRITICAL: Must create instances of EVERY entity type that
576+ Must create instances of EVERY entity type that
434577 the source connector syncs.
435578 """
436579 all_entities = []
@@ -454,7 +597,7 @@ async def create_entities(self) -> List[Dict[str, Any]]:
454597 })
455598
456599 # ==========================================
457- # CRITICAL: Create child entities
600+ # Create child entities
458601 # ==========================================
459602
460603 # Create 2 comments per task
@@ -568,21 +711,31 @@ Wait for human feedback. If tests fail, review the error logs and fix the code.
568711### Checklist
569712
570713**Source Connector:**
571- - [ ] All entity types are implemented
714+ - [ ] All entity types are implemented in `entities/{short_name}.py`
715+ - [ ] **semantically relevant entity fields use `AirweaveField(..., embeddable=True)`**
716+ - [ ] All text content fields are embeddable
717+ - [ ] All people fields are embeddable
718+ - [ ] All status/metadata fields are embeddable
719+ - [ ] All timestamps are embeddable
720+ - [ ] Only IDs and binary metadata use `Field()` without embeddable
721+ - [ ] Verified: ~70% of fields are embeddable
572722- [ ] All entities have `created_at` or `modified_at` timestamps
573723- [ ] Token refresh is properly handled
574- - [ ] Rate limiting is implemented
724+ - [ ] Rate limiting is implemented (if API requires it)
575725- [ ] Pagination is handled correctly
576726- [ ] Errors are handled gracefully (don't fail entire sync)
577727- [ ] Breadcrumbs track entity hierarchy
578728- [ ] File entities use `process_file_entity()`
579729- [ ] OAuth config is in `dev.integrations.yaml`
580730
581731**Monke Tests:**
582- - [ ] Bongo creates ALL entity types (not just tasks)
732+ - [ ] **Bongo creates ALL entity types from source**
733+ - [ ] Listed all entity types yielded in source's `generate_entities()`
734+ - [ ] Confirmed `create_entities()` creates each type
735+ - [ ] Entity type count matches between source and tests
583736- [ ] Each entity has unique verification token
584737- [ ] Tokens are embedded in searchable content
585- - [ ] Generation schemas defined for all types
738+ - [ ] Generation schemas defined for all entity types
586739- [ ] Test config has comprehensive test flow
587740- [ ] All entity types are verified after sync
588741- [ ] Update flow tests incremental sync
@@ -687,7 +840,7 @@ Reference the Asana source as an example: @asana.py
687840```
688841Now implement the Monke tests.
689842
690- CRITICAL: The bongo MUST create instances of EVERY entity type, including
843+ The bongo MUST create instances of EVERY entity type, including
691844comments and files, not just the top-level tasks.
692845
693846Start with:
@@ -707,12 +860,18 @@ Building a complete connector requires:
707860
7088611. **Research:** Understand the API, entity hierarchy, auth, and rate limits
7098622. **Entities:** Define schemas for ALL entity types with proper timestamps
863+ - **Mark ~70% of fields as `embeddable=True` for semantic search**
7108643. **Source:** Implement hierarchical entity generation with token refresh
7118654. **Testing:** Create Monke tests that verify EVERY entity type
866+ - **Test ALL entity types your source yields, not just tasks**
7128675. **Validation:** Run E2E tests and verify all entities appear in search
7138686. **Refinement:** Debug issues, optimize performance, handle edge cases
714869
715- **The key to success:** Test comprehensively. Don't just test tasks—test comments, files, and all nested entities. If your connector syncs it, your tests should verify it.
870+ **The two keys to success:**
871+
872+ 1. **Information-Rich Entities**: Mark most fields as `embeddable=True` so users can semantically search your data. Sparse entities with only names embeddable are barely useful.
873+
874+ 2. **Comprehensive Testing**: Don't just test tasks—test comments, files, and all nested entities. If your connector syncs it, your tests must verify it. Count entity types in your source's `generate_entities()` and match that count in your Monke tests.
716875
717876---
718877
0 commit comments