Skip to content

Commit 8751f05

Browse files
authored
Merge pull request #80 from stanford-oval/dev-python-pkg
Wrap the project as a python package and support pip install
2 parents f9ffc0d + 3fa0e0e commit 8751f05

27 files changed

+346
-243
lines changed

README.md

Lines changed: 97 additions & 71 deletions
Original file line numberDiff line numberDiff line change
@@ -6,15 +6,16 @@
66

77
<p align="center">
88
| <a href="http://storm.genie.stanford.edu"><b>Research preview</b></a> | <a href="https://arxiv.org/abs/2402.14207"><b>Paper</b></a> | <a href="https://storm-project.stanford.edu/"><b>Website</b></a> |
9-
9+
</p>
1010

1111
**Latest News** 🔥
1212

13+
- [2024/07] You can now install our package with `pip install knowledge-storm`!
1314
- [2024/07] We add `VectorRM` to support grounding on user-provided documents, complementing existing support of search engines (`YouRM`, `BingSearch`). (check out [#58](https://github.com/stanford-oval/storm/pull/58))
1415
- [2024/07] We release demo light for developers a minimal user interface built with streamlit framework in Python, handy for local development and demo hosting (checkout [#54](https://github.com/stanford-oval/storm/pull/54))
1516
- [2024/06] We will present STORM at NAACL 2024! Find us at Poster Session 2 on June 17 or check our [presentation material](assets/storm_naacl2024_slides.pdf).
16-
- [2024/05] We add Bing Search support in [rm.py](src/rm.py). Test STORM with `GPT-4o` - we now configure the article generation part in our demo using `GPT-4o` model.
17-
- [2024/04] We release refactored version of STORM codebase! We define [interface](src/interface.py) for STORM pipeline and reimplement STORM-wiki (check out [`src/storm_wiki`](src/storm_wiki)) to demonstrate how to instantiate the pipeline. We provide API to support customization of different language models and retrieval/search integration.
17+
- [2024/05] We add Bing Search support in [rm.py](knowledge_storm/rm.py). Test STORM with `GPT-4o` - we now configure the article generation part in our demo using `GPT-4o` model.
18+
- [2024/04] We release refactored version of STORM codebase! We define [interface](knowledge_storm/interface.py) for STORM pipeline and reimplement STORM-wiki (check out [`src/storm_wiki`](knowledge_storm/storm_wiki)) to demonstrate how to instantiate the pipeline. We provide API to support customization of different language models and retrieval/search integration.
1819

1920
## Overview [(Try STORM now!)](https://storm.genie.stanford.edu/)
2021

@@ -46,25 +47,89 @@ Based on the separation of the two stages, STORM is implemented in a highly modu
4647

4748

4849

49-
## Getting started
50+
## Installation
5051

51-
### 1. Setup
5252

53-
Below, we provide a quick start guide to run STORM locally.
53+
To install the knowledge storm library, use `pip install knowledge-storm`.
5454

55+
You could also install the source code which allows you to modify the behavior of STORM engine directly.
5556
1. Clone the git repository.
56-
```shell
57-
git clone https://github.com/stanford-oval/storm.git
58-
cd storm
59-
```
57+
```shell
58+
git clone https://github.com/stanford-oval/storm.git
59+
cd storm
60+
```
6061

6162
2. Install the required packages.
6263
```shell
6364
conda create -n storm python=3.11
6465
conda activate storm
6566
pip install -r requirements.txt
6667
```
67-
3. Set up OpenAI API key (if you want to use OpenAI models to power STORM) and [You.com search API](https://api.you.com/) key. Create a file `secrets.toml` under the root directory and add the following content:
68+
69+
70+
## API
71+
The STORM knowledge curation engine is defined as a simple Python `STORMWikiRunner` class.
72+
73+
As STORM is working in the information curation layer, you need to set up the information retrieval module and language model module to create a `STORMWikiRunner` instance. Here is an example of using You.com search engine and OpenAI models.
74+
```python
75+
import os
76+
from knowledge_storm import STORMWikiRunnerArguments, STORMWikiRunner, STORMWikiLMConfigs
77+
from knowledge_storm.lm import OpenAIModel
78+
from knowledge_storm.rm import YouRM
79+
80+
lm_configs = STORMWikiLMConfigs()
81+
openai_kwargs = {
82+
'api_key': os.getenv("OPENAI_API_KEY"),
83+
'temperature': 1.0,
84+
'top_p': 0.9,
85+
}
86+
# STORM is a LM system so different components can be powered by different models to reach a good balance between cost and quality.
87+
# For a good practice, choose a cheaper/faster model for `conv_simulator_lm` which is used to split queries, synthesize answers in the conversation.
88+
# Choose a more powerful model for `article_gen_lm` to generate verifiable text with citations.
89+
gpt_35 = OpenAIModel(model='gpt-3.5-turbo', max_tokens=500, **openai_kwargs)
90+
gpt_4 = OpenAIModel(model='gpt-4-o', max_tokens=3000, **openai_kwargs)
91+
lm_configs.set_conv_simulator_lm(gpt_35)
92+
lm_configs.set_question_asker_lm(gpt_35)
93+
lm_configs.set_outline_gen_lm(gpt_4)
94+
lm_configs.set_article_gen_lm(gpt_4)
95+
lm_configs.set_article_polish_lm(gpt_4)
96+
# Check out the STORMWikiRunnerArguments class for more configurations.
97+
engine_args = STORMWikiRunnerArguments(...)
98+
rm = YouRM(ydc_api_key=os.getenv('YDC_API_KEY'), k=engine_args.search_top_k)
99+
runner = STORMWikiRunner(engine_args, lm_configs, rm)
100+
```
101+
102+
Currently, our package support:
103+
- `OpenAIModel`, `AzureOpenAIModel`, `ClaudeModel`, `VLLMClient`, `TGIClient`, `TogetherClient`, `OllamaClient` as language model components
104+
- `YouRM`, `BingSearch`, `VectorRM` as retrieval module components
105+
106+
:star2: **PRs for integrating more language models into [knowledge_storm/lm.py](knowledge_storm/lm.py) and search engines/retrievers into [knowledge_storm/rm.py](knowledge_storm/rm.py) are highly appreciated!**
107+
108+
The `STORMWikiRunner` instance can be evoked with the simple `run` method:
109+
```python
110+
topic = input('Topic: ')
111+
runner.run(
112+
topic=topic,
113+
do_research=True,
114+
do_generate_outline=True,
115+
do_generate_article=True,
116+
do_polish_article=True,
117+
)
118+
runner.post_run()
119+
runner.summary()
120+
```
121+
- `do_research`: if True, simulate conversations with difference perspectives to collect information about the topic; otherwise, load the results.
122+
- `do_generate_outline`: if True, generate an outline for the topic; otherwise, load the results.
123+
- `do_generate_article`: if True, generate an article for the topic based on the outline and the collected information; otherwise, load the results.
124+
- `do_polish_article`: if True, polish the article by adding a summarization section and (optionally) removing duplicate content; otherwise, load the results.
125+
126+
127+
## Quick Start with Example Scripts
128+
129+
We provide scripts in our [examples folder](examples) as a quick start to run STORM with different configurations.
130+
131+
**To run STORM with `gpt` family models with default configurations:**
132+
1. We suggest using `secrets.toml` to set up the API keys. Create a file `secrets.toml` under the root directory and add the following content:
68133
```shell
69134
# Set up OpenAI API key.
70135
OPENAI_API_KEY="your_openai_api_key"
@@ -77,74 +142,31 @@ Below, we provide a quick start guide to run STORM locally.
77142
# Set up You.com search API key.
78143
YDC_API_KEY="your_youcom_api_key"
79144
```
145+
2. Run the following command.
146+
```
147+
python examples/run_storm_wiki_gpt.py \
148+
--output-dir $OUTPUT_DIR \
149+
--retriever you \
150+
--do-research \
151+
--do-generate-outline \
152+
--do-generate-article \
153+
--do-polish-article
154+
```
80155

156+
**To run STORM using your favorite language models or grounding on your own corpus:** Check out [examples/README.md](examples/README.md).
81157

82-
### 2. Running STORM-wiki locally
83-
84-
**To run STORM with `gpt` family models with default configurations**: Make sure you have set up the OpenAI API key and run the following command.
85-
86-
```
87-
python examples/run_storm_wiki_gpt.py \
88-
--output-dir $OUTPUT_DIR \
89-
--retriever you \
90-
--do-research \
91-
--do-generate-outline \
92-
--do-generate-article \
93-
--do-polish-article
94-
```
95-
- `--do-research`: if True, simulate conversation to research the topic; otherwise, load the results.
96-
- `--do-generate-outline`: If True, generate an outline for the topic; otherwise, load the results.
97-
- `--do-generate-article`: If True, generate an article for the topic; otherwise, load the results.
98-
- `--do-polish-article`: If True, polish the article by adding a summarization section and (optionally) removing duplicate content.
99-
100-
101-
We provide more example scripts under [`examples`](examples) to demonstrate how you can run STORM using your favorite language models or grounding on your own corpus.
102-
103-
104-
## Customize STORM
105158

106-
### Customization of the Pipeline
159+
## Customization of the Pipeline
107160

108-
Besides running scripts in `examples`, you can customize STORM based on your own use case. STORM engine consists of 4 modules:
161+
If you have installed the source code, you can customize STORM based on your own use case. STORM engine consists of 4 modules:
109162

110163
1. Knowledge Curation Module: Collects a broad coverage of information about the given topic.
111164
2. Outline Generation Module: Organizes the collected information by generating a hierarchical outline for the curated knowledge.
112165
3. Article Generation Module: Populates the generated outline with the collected information.
113166
4. Article Polishing Module: Refines and enhances the written article for better presentation.
114167

115-
The interface for each module is defined in `src/interface.py`, while their implementations are instantiated in `src/storm_wiki/modules/*`. These modules can be customized according to your specific requirements (e.g., generating sections in bullet point format instead of full paragraphs).
116-
117-
:star2: **You can share your customization of `Engine` by making PRs to this repo!**
118-
119-
### Customization of Retriever Module
120-
121-
As a knowledge curation engine, STORM grabs information from the Retriever module. The Retriever modules are implemented in [`src/rm.py`](src/rm.py). Currently, STORM supports the following retrievers:
168+
The interface for each module is defined in `knowledge_storm/interface.py`, while their implementations are instantiated in `knowledge_storm/storm_wiki/modules/*`. These modules can be customized according to your specific requirements (e.g., generating sections in bullet point format instead of full paragraphs).
122169

123-
- `YouRM`: You.com search engine API
124-
- `BingSearch`: Bing Search API
125-
- `VectorRM`: a retrieval model that retrieves information from user provide corpus
126-
127-
:star2: **PRs for integrating more search engines/retrievers are highly appreciated!**
128-
129-
### Customization of Language Models
130-
131-
STORM provides the following language model implementations in [`src/lm.py`](src/lm.py):
132-
133-
- `OpenAIModel`
134-
- `ClaudeModel`
135-
- `VLLMClient`
136-
- `TGIClient`
137-
- `TogetherClient`
138-
139-
:star2: **PRs for integrating more language model clients are highly appreciated!**
140-
141-
:bulb: **For a good practice,**
142-
143-
- choose a cheaper/faster model for `conv_simulator_lm` which is used to split queries, synthesize answers in the conversation.
144-
- if you need to conduct the actual writing step, choose a more powerful model for `article_gen_lm`. Based on our experiments, weak models are bad at generating text with citations.
145-
- for open models, adding one-shot example can help it better follow instructions.
146-
147-
Please refer to the scripts in the [`examples`](examples) directory for concrete guidance on customizing the language model used in the pipeline.
148170

149171
## Replicate NAACL2024 result
150172

@@ -157,7 +179,7 @@ Please switch to the branch `NAACL-2024-code-backup`
157179

158180
The FreshWiki dataset used in our experiments can be found in [./FreshWiki](FreshWiki).
159181

160-
Run the following commands under [./src](src).
182+
Run the following commands under [./src](knowledge_storm).
161183

162184
#### Pre-writing Stage
163185
For batch experiment on FreshWiki dataset:
@@ -196,7 +218,7 @@ python -m scripts.run_writing --input-source console --engine gpt-4 --do-polish-
196218
The generated article will be saved in `{output_dir}/{topic}/storm_gen_article.txt` and the references corresponding to citation index will be saved in `{output_dir}/{topic}/url_to_info.json`. If `--do-polish-article` is set, the polished article will be saved in `{output_dir}/{topic}/storm_gen_article_polished.txt`.
197219

198220
### Customize the STORM Configurations
199-
We set up the default LLM configuration in `LLMConfigs` in [src/modules/utils.py](src/modules/utils.py). You can use `set_conv_simulator_lm()`,`set_question_asker_lm()`, `set_outline_gen_lm()`, `set_article_gen_lm()`, `set_article_polish_lm()` to override the default configuration. These functions take in an instance from `dspy.dsp.LM` or `dspy.dsp.HFModel`.
221+
We set up the default LLM configuration in `LLMConfigs` in [src/modules/utils.py](knowledge_storm/modules/utils.py). You can use `set_conv_simulator_lm()`,`set_question_asker_lm()`, `set_outline_gen_lm()`, `set_article_gen_lm()`, `set_article_polish_lm()` to override the default configuration. These functions take in an instance from `dspy.dsp.LM` or `dspy.dsp.HFModel`.
200222
201223
202224
### Automatic Evaluation
@@ -224,7 +246,11 @@ For rubric grading, we use the [prometheus-13b-v1.0](https://huggingface.co/prom
224246
225247
</details>
226248
227-
## Contributions
249+
## Roadmap & Contributions
250+
Our team is actively working on:
251+
1. Human-in-the-Loop Functionalities: Supporting user participation in the knowledge curation process.
252+
2. Information Abstraction: Developing abstractions for curated information to support presentation formats beyond the Wikipedia-style report.
253+
228254
If you have any questions or suggestions, please feel free to open an issue or pull request. We welcome contributions to improve the system and the codebase!
229255
230256
Contact person: [Yijia Shao](mailto:[email protected]) and [Yucheng Jiang](mailto:[email protected])

examples/run_storm_wiki_claude.py

Lines changed: 5 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -17,14 +17,12 @@
1717
"""
1818

1919
import os
20-
import sys
2120
from argparse import ArgumentParser
2221

23-
sys.path.append('./src')
24-
from lm import ClaudeModel
25-
from rm import YouRM, BingSearch
26-
from storm_wiki.engine import STORMWikiRunnerArguments, STORMWikiRunner, STORMWikiLMConfigs
27-
from utils import load_api_key
22+
from knowledge_storm import STORMWikiRunnerArguments, STORMWikiRunner, STORMWikiLMConfigs
23+
from knowledge_storm.lm import ClaudeModel
24+
from knowledge_storm.rm import YouRM, BingSearch
25+
from knowledge_storm.utils import load_api_key
2826

2927

3028
def main(args):
@@ -116,4 +114,4 @@ def main(args):
116114
parser.add_argument('--remove-duplicate', action='store_true',
117115
help='If True, remove duplicate content from the article.')
118116

119-
main(parser.parse_args())
117+
main(parser.parse_args())

examples/run_storm_wiki_gpt.py

Lines changed: 19 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -20,38 +20,42 @@
2020
"""
2121

2222
import os
23-
import sys
2423
from argparse import ArgumentParser
2524

26-
sys.path.append('./src')
27-
from lm import OpenAIModel
28-
from rm import YouRM, BingSearch
29-
from storm_wiki.engine import STORMWikiRunnerArguments, STORMWikiRunner, STORMWikiLMConfigs
30-
from utils import load_api_key
25+
from knowledge_storm import STORMWikiRunnerArguments, STORMWikiRunner, STORMWikiLMConfigs
26+
from knowledge_storm.lm import OpenAIModel, AzureOpenAIModel
27+
from knowledge_storm.rm import YouRM, BingSearch
28+
from knowledge_storm.utils import load_api_key
3129

3230

3331
def main(args):
3432
load_api_key(toml_file_path='secrets.toml')
3533
lm_configs = STORMWikiLMConfigs()
3634
openai_kwargs = {
3735
'api_key': os.getenv("OPENAI_API_KEY"),
38-
'api_provider': os.getenv('OPENAI_API_TYPE'),
3936
'temperature': 1.0,
4037
'top_p': 0.9,
41-
'api_base': os.getenv('AZURE_API_BASE'),
42-
'api_version': os.getenv('AZURE_API_VERSION'),
4338
}
4439

40+
ModelClass = OpenAIModel if os.getenv('OPENAI_API_TYPE') == 'openai' else AzureOpenAIModel
41+
# If you are using Azure service, make sure the model name matches your own deployed model name.
42+
# The default name here is only used for demonstration and may not match your case.
43+
gpt_35_model_name = 'gpt-3.5-turbo' if os.getenv('OPENAI_API_TYPE') == 'openai' else 'gpt-35-turbo'
44+
gpt_4_model_name = 'gpt-4o'
45+
if os.getenv('OPENAI_API_TYPE') == 'azure':
46+
openai_kwargs['api_base'] = os.getenv('AZURE_API_BASE')
47+
openai_kwargs['api_version'] = os.getenv('AZURE_API_VERSION')
48+
4549
# STORM is a LM system so different components can be powered by different models.
4650
# For a good balance between cost and quality, you can choose a cheaper/faster model for conv_simulator_lm
4751
# which is used to split queries, synthesize answers in the conversation. We recommend using stronger models
4852
# for outline_gen_lm which is responsible for organizing the collected information, and article_gen_lm
4953
# which is responsible for generating sections with citations.
50-
conv_simulator_lm = OpenAIModel(model='gpt-3.5-turbo', max_tokens=500, **openai_kwargs)
51-
question_asker_lm = OpenAIModel(model='gpt-3.5-turbo', max_tokens=500, **openai_kwargs)
52-
outline_gen_lm = OpenAIModel(model='gpt-4-0125-preview', max_tokens=400, **openai_kwargs)
53-
article_gen_lm = OpenAIModel(model='gpt-4-0125-preview', max_tokens=700, **openai_kwargs)
54-
article_polish_lm = OpenAIModel(model='gpt-4-0125-preview', max_tokens=4000, **openai_kwargs)
54+
conv_simulator_lm = ModelClass(model=gpt_35_model_name, max_tokens=500, **openai_kwargs)
55+
question_asker_lm = ModelClass(model=gpt_35_model_name, max_tokens=500, **openai_kwargs)
56+
outline_gen_lm = ModelClass(model=gpt_4_model_name, max_tokens=400, **openai_kwargs)
57+
article_gen_lm = ModelClass(model=gpt_4_model_name, max_tokens=700, **openai_kwargs)
58+
article_polish_lm = ModelClass(model=gpt_4_model_name, max_tokens=4000, **openai_kwargs)
5559

5660
lm_configs.set_conv_simulator_lm(conv_simulator_lm)
5761
lm_configs.set_question_asker_lm(question_asker_lm)
@@ -122,4 +126,4 @@ def main(args):
122126
parser.add_argument('--remove-duplicate', action='store_true',
123127
help='If True, remove duplicate content from the article.')
124128

125-
main(parser.parse_args())
129+
main(parser.parse_args())

0 commit comments

Comments
 (0)