Skip to content

Commit

Permalink
Merge pull request #375 from wangxinbiao/main
Browse files Browse the repository at this point in the history
fix:optimization of QA splitting
  • Loading branch information
nkwangleiGIT authored Dec 19, 2023
2 parents 7a8279f + cc220f5 commit 016c6dd
Show file tree
Hide file tree
Showing 26 changed files with 1,555 additions and 910 deletions.
26 changes: 2 additions & 24 deletions data-processing/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -10,35 +10,13 @@ RUN export DEBIAN_FRONTEND=noninteractive \
&& ln -fs /usr/share/zoneinfo/Asia/Shanghai /etc/localtime \
&& dpkg-reconfigure --frontend noninteractive tzdata \
&& apt-get install -y python3-distutils curl python3-pip \
&& apt-get install -y wget
&& apt-get install -y wget \
&& apt-get install -y opencc

RUN wget https://github.com/explosion/spacy-models/releases/download/zh_core_web_sm-3.5.0/zh_core_web_sm-3.5.0-py3-none-any.whl -O /tmp/zh_core_web_sm-3.5.0-py3-none-any.whl \
&& pip3 install /tmp/zh_core_web_sm-3.5.0-py3-none-any.whl -i https://pypi.org/simple \
&& rm /tmp/zh_core_web_sm-3.5.0-py3-none-any.whl


ENV MINIO_ACCESSKEY=minio_accesskey
ENV MINIO_SECRETKEY=minio_secretkey
ENV MINIO_API_URL=localhost:9000
ENV MINIO_SECURE=False
ENV MINIO_DATASET_PREFIX=dataset

ENV LLM_USE_TYPE=xxxxx
ENV LLM_QA_RETRY_COUNT=xxxxx
ENV OPEN_AI_DEFAULT_KEY=xxxxx
ENV OPEN_AI_DEFAULT_BASE_URL=xxxxx
ENV OPEN_AI_DEFAULT_MODEL=xxxxx
ENV ZHIPUAI_API_KEY=xxxxx

ENV KNOWLEDGE_CHUNK_SIZE=500
ENV KNOWLEDGE_CHUNK_OVERLAP=50

ENV PG_HOST=localhost
ENV PG_PORT=5432
ENV PG_USER=postgres
ENV PG_PASSWORD=xxxxx
ENV PG_DATABASE=data_process

EXPOSE 28888

ADD . /arcadia_app/
Expand Down
123 changes: 123 additions & 0 deletions data-processing/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,123 @@
# Data Processing

## Current Version Main Features

Data Processing is used for data processing through MinIO, databases, Web APIs, etc. The data types handled include:
- txt
- json
- doc
- html
- excel
- csv
- pdf
- markdown
- ppt

### Current Text Type Processing

The data processing process includes: cleaning abnormal data, filtering, de-duplication, and anonymization.

## Design

![Design](../assets/data_process.drawio.png)

## Local Development
### Software Requirements

Before setting up the local data-process environment, please make sure the following software is installed:

- Python 3.10.x

### Environment Setup

Install the Python dependencies in the requirements.txt file

### Running

Run the server.py file in the data_manipulation directory

# isort
isort is a tool for sorting imports alphabetically within your Python code. It helps maintain a consistent and clean import order.

## install
```shell
pip install isort
```

## isort a file
```shell
isort server.py
```

## isort a directory
```shell
isort data_manipulation
```


# config.yml
## dev phase
The example config.yml is as the following:
```yaml
minio:
access_key: '${MINIO_ACCESSKEY: hpU4SCmj5jixxx}'
secret_key: '${MINIO_SECRETKEY: xxx}'
api_url: '${MINIO_API_URL: 172.22.96.136.nip.io}'
secure: '${MINIO_SECURE: True}'
dataset_prefix: '${MINIO_DATASET_PREFIX: dataset}'

llm:
qa_retry_count: '${LLM_QA_RETRY_COUNT: 100}'

knowledge:
chunk_size: '${KNOWLEDGE_CHUNK_SIZE: 500}'
chunk_overlap: '${KNOWLEDGE_CHUNK_OVERLAP: 50}'

backendPg:
host: '${PG_HOST: localhost}'
port: '${PG_PORT: 5432}'
user: '${PG_USER: postgres}'
password: '${PG_PASSWORD: 123456}'
database: '${PG_DATABASE: arcadia}'

kubernetes:
default_config: '${DEFAULT_CONFIG: arcadia-config}'
pod_namespace: '${POD_NAMESPACE: arcadia}'
```
\${MINIO_ACCESSKEY: hpU4SCmj5jixxx}
MINIO_ACCESSKEY is the environment variable name.
hpU4SCmj5jixxx is the default value if the environment variable is not set.
## release phase
The example config.yml is as the following:
```yaml
minio:
access_key: 'hpU4SCmj5jixxx'
secret_key: 'minio_sk'
api_url: '172.22.96.136.nip.io'
secure: 'True'
dataset_prefix: 'dataset'

llm:
qa_retry_count: '100'

knowledge:
chunk_size: '500'
chunk_overlap: '50'

backendPg:
host: 'localhost'
port: '5432'
user: 'admin'
password: '123456'
database: 'arcadia'

kubernetes:
default_config: 'arcadia-config'
pod_namespace: 'arcadia'
```
In the K8s, you can use the config map to point to the /arcadia_app/data_manipulation/config.yml file.
Loading

0 comments on commit 016c6dd

Please sign in to comment.