Skip to content

Commit 46d7a4d

Browse files
authored
Merge pull request #39 from akvo/consistent-metrics-order
Consistent metrics order
2 parents c1db729 + f64acd3 commit 46d7a4d

File tree

6 files changed

+41
-35
lines changed

6 files changed

+41
-35
lines changed

.github/workflows/test.yml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,10 @@ jobs:
3737
# Test backend health
3838
curl -v http://localhost:80/api/health
3939
40+
- name: Run backend tests
41+
run: |
42+
./backend/test.sh
43+
4044
- name: Show logs if failure
4145
if: failure()
4246
run: docker compose logs

backend/RAG_evaluation/README.md

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -387,14 +387,14 @@ BROWSER_SLOW_MO=1000
387387
- Reference answers properly entered and used
388388

389389
**8 Metrics Verified:**
390-
1. Faithfulness
391-
2. Answer Relevancy
392-
3. Context Precision Without Reference
393-
4. Context Relevancy
394-
5. Answer Similarity 📚
395-
6. Answer Correctness 📚
396-
7. Context Precision 📚
397-
8. Context Recall 📚
390+
1.🧠 Faithfulness
391+
2.🧠 Context Relevancy
392+
3.Answer Relevancy
393+
4.🧠 Context Precision Without Reference
394+
5.🧠📚 Context Recall
395+
6.🧠📚 Context Precision
396+
7.📚 Answer Similarity
397+
8.📚 Answer Correctness
398398

399399
*(📚 = Reference-based metrics requiring reference answers)*
400400

backend/RAG_evaluation/streamlit_app/components/metrics_explanation.py

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -109,13 +109,13 @@ def get_metric_description(metric_name: str) -> str:
109109
"""
110110
descriptions = {
111111
'faithfulness': "How well grounded the response is in the retrieved context",
112-
'answer_relevancy': "How relevant the response is to the original query",
113112
'context_relevancy': "How relevant the retrieved context is to the query",
113+
'answer_relevancy': "How relevant the response is to the original query",
114114
'context_precision_without_reference': "Precision of context retrieval without reference answers",
115-
'answer_similarity': "Semantic similarity between generated and reference answers",
116-
'answer_correctness': "Factual accuracy against reference answers",
115+
'context_recall': "How well retrieved contexts cover the reference answer",
117116
'context_precision': "More accurate precision using reference answers",
118-
'context_recall': "How well retrieved contexts cover the reference answer"
117+
'answer_similarity': "Semantic similarity between generated and reference answers",
118+
'answer_correctness': "Factual accuracy against reference answers"
119119
}
120120

121121
return descriptions.get(metric_name, "Evaluation metric")
@@ -154,13 +154,13 @@ def get_metric_help_text(metric_name: str) -> str:
154154
"""
155155
help_texts = {
156156
'faithfulness': "Measures how well the generated answer is supported by the retrieved context. Higher scores indicate better factual consistency.",
157-
'answer_relevancy': "Evaluates how well the answer addresses the original question. Higher scores indicate more relevant responses.",
158157
'context_relevancy': "Assesses the relevance of retrieved context to the query. Higher scores indicate better context retrieval.",
158+
'answer_relevancy': "Evaluates how well the answer addresses the original question. Higher scores indicate more relevant responses.",
159159
'context_precision_without_reference': "Measures precision of context retrieval without requiring reference answers. Higher scores indicate more precise retrieval.",
160-
'answer_similarity': "Compares semantic similarity between generated and reference answers. Higher scores indicate closer alignment.",
161-
'answer_correctness': "Evaluates factual accuracy against reference answers. Higher scores indicate better correctness.",
160+
'context_recall': "Measures how well retrieved contexts cover information in the reference answer. Higher scores indicate better coverage.",
162161
'context_precision': "More accurate precision measurement using reference answers for comparison. Higher scores indicate better precision.",
163-
'context_recall': "Measures how well retrieved contexts cover information in the reference answer. Higher scores indicate better coverage."
162+
'answer_similarity': "Compares semantic similarity between generated and reference answers. Higher scores indicate closer alignment.",
163+
'answer_correctness': "Evaluates factual accuracy against reference answers. Higher scores indicate better correctness."
164164
}
165165

166166
return help_texts.get(metric_name, "RAGAS evaluation metric")

backend/RAG_evaluation/streamlit_app/constants.py

Lines changed: 15 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -19,16 +19,16 @@
1919
# Metric categories
2020
BASIC_METRICS: List[str] = [
2121
'faithfulness',
22+
'context_relevancy',
2223
'answer_relevancy',
23-
'context_precision_without_reference',
24-
'context_relevancy'
24+
'context_precision_without_reference'
2525
]
2626

2727
REFERENCE_METRICS: List[str] = [
28+
'context_recall',
29+
'context_precision',
2830
'answer_similarity',
29-
'answer_correctness',
30-
'context_precision',
31-
'context_recall'
31+
'answer_correctness'
3232
]
3333

3434
ALL_METRICS: List[str] = BASIC_METRICS + REFERENCE_METRICS
@@ -189,24 +189,24 @@
189189
SHORT_METRICS_EXPLANATIONS: Dict[str, str] = {
190190
'reference_free': """
191191
**Reference-Free Metrics** (work without reference answers):
192-
- **Faithfulness** 🧠: How well grounded the response is in the retrieved context
192+
- **🧠 Faithfulness**: How well grounded the response is in the retrieved context
193+
- **🧠 Context Relevancy**: How relevant the retrieved context is to the query
193194
- **Answer Relevancy**: How relevant the response is to the original query
194-
- **Context Relevancy** 🧠: How relevant the retrieved context is to the query
195-
- **Context Precision Without Reference** 🧠: Precision of context retrieval without reference answers
195+
- **🧠 Context Precision Without Reference**: Precision of context retrieval without reference answers
196196
197197
**Reference-Based Metrics** (require reference answers for comparison):
198-
- **Answer Similarity** 📚: Semantic similarity between generated and reference answers
199-
- **Answer Correctness** 📚: Factual accuracy against reference answers
200-
- **Context Precision** 🧠📚: More accurate precision using reference answers
201-
- **Context Recall** 🧠📚: How well retrieved contexts cover the reference answer
198+
- **🧠📚 Context Recall**: How well retrieved contexts cover the reference answer
199+
- **🧠📚 Context Precision**: More accurate precision using reference answers
200+
- **📚 Answer Similarity**: Semantic similarity between generated and reference answers
201+
- **📚 Answer Correctness**: Factual accuracy against reference answers
202202
203203
🧠 = Context-dependent | 📚 = Reference-based | *All metrics range from 0.0 to 1.0, with higher scores indicating better performance.*
204204
""",
205205
'basic_only': """
206206
**Context-dependent metrics** 🧠 require retrieved context/documents:
207-
- **Faithfulness**: How well grounded the response is in the retrieved context
208-
- **Context Relevancy**: How relevant the retrieved context is to the query
209-
- **Context Precision Without Reference**: Precision of context retrieval without reference answers
207+
- **🧠 Faithfulness**: How well grounded the response is in the retrieved context
208+
- **🧠 Context Relevancy**: How relevant the retrieved context is to the query
209+
- **🧠 Context Precision Without Reference**: Precision of context retrieval without reference answers
210210
211211
**Response-only metrics** evaluate the generated response quality:
212212
- **Answer Relevancy**: How relevant the response is to the original query

backend/RAG_evaluation/tests/test_eight_metrics_e2e.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -326,13 +326,13 @@ async def _verify_eight_metrics(self, page):
326326
# Search for actual display names as they appear in the Streamlit UI
327327
expected_metrics = [
328328
('faithfulness', 'Faithfulness'),
329-
('answer_relevancy', 'Answer Relevancy'),
330329
('context_relevancy', 'Context Relevancy'),
330+
('answer_relevancy', 'Answer Relevancy'),
331331
('context_precision_without_reference', 'Context Precision Without Reference'),
332+
('context_recall', 'Context Recall'),
332333
('context_precision', 'Context Precision'),
333334
('answer_similarity', 'Answer Similarity'),
334-
('answer_correctness', 'Answer Correctness'),
335-
('context_recall', 'Context Recall')
335+
('answer_correctness', 'Answer Correctness')
336336
]
337337

338338
found_metrics = []

backend/entrypoint.sh

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,9 @@
44
set -e
55

66
echo "Waiting for MySQL..."
7-
while ! nc -z db 3306; do
7+
DB_HOST=${MYSQL_SERVER:-db}
8+
DB_PORT=${MYSQL_PORT:-3306}
9+
while ! nc -z $DB_HOST $DB_PORT; do
810
sleep 1
911
done
1012
echo "MySQL started"

0 commit comments

Comments
 (0)