Skip to content

Commit

Permalink
Merge branch 'main' into stable
Browse files Browse the repository at this point in the history
  • Loading branch information
breakthewall committed Jan 15, 2025
2 parents 1413635 + 6ea2189 commit b47d2a8
Show file tree
Hide file tree
Showing 24 changed files with 629 additions and 438 deletions.
7 changes: 7 additions & 0 deletions environment.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -9,3 +9,10 @@ dependencies:
- openpyxl
- statsmodels
- matplotlib-base
- scikit-learn
- seaborn
- xgboost
- scipy
- ipython
- ipywidgets
- glob2
122 changes: 122 additions & 0 deletions icfree/learner/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,122 @@
# Configuration Documentation

## Dataframe Structure

### Rows
- Each row in the dataframe represents an experiment, with features X (e.g., concentration/volumes of cell-free mixes) followed by labels y (e.g., yields).
- **Example**: Row 0 contains the features and labels for Experiment 1.

### Columns
- **Feature Columns (X):**
- The first set of columns contains feature values. These are numerical or categorical variables used as input for modeling.
- **Example**: `feature1`, `feature2`, ..., `featureN`
- **Label Columns (y):**
- The subsequent columns represent the labels or target variables for the same experiments, i.e., model output.
- **Example**: `label1`, `label2`, ..., `labelM`

### Key Characteristics
- **Shape**: `(n_experiments, n_features + n_labels)`
- `n_experiments`: The number of rows (experiments).
- `n_features`: The number of feature columns (X).
- `n_labels`: The number of label columns (y).

**Example**:
| feature1 | feature2 | feature3 | label1 | label2 |
|----------|----------|----------|--------|--------|
| 0.1 | 1.5 | 0.7 | 0 | 1 |
| 0.3 | 1.7 | 0.5 | 1 | 0 |

---

## Configuration Variables Documentation

### `data_folder: str`
- **Required**: Yes
- **Description**: The path to the folder containing the data files.
- **Example**: `"data/top50"`

### `parameter_file: str`
- **Required**: Yes
- **Description**: The path to the file containing the parameter values for the experiments.
- **Example**: `"data/param.tsv"`

### `output_folder: str`
- **Required**: Yes
- **Description**: The path to the folder where the output files will be saved.
- **Example**: `"output"`

### `name_list: str`
- **Required**: No
- **Description**: A comma-separated string of column names or identifiers, converted to a list of strings representing columns that contain labels (y). This separates y columns from the rest (X features).
- **Example**: `Yield1,Yield2,Yield3,Yield4,Yield5`

### `test: bool`
- **Required**: No
- **Description**: A flag for validating the model; not required to run inside the active learning loop. If not set, skip the validating step.
- **Example**: `--test`

### `nb_rep: int`
- **Required**: No
- **Description**: The number of test repetitions for validating the model behavior. 80% of data is randomly separated for training, and 20% is used for testing.
- **Example**: `100`

### `flatten: bool`
- **Required**: No
- **Description**: A flag to indicate whether to flatten Y data. If set, treats each repetition in the same experiment independently; multiple same X values with different y outputs are modeled. Else, calculates the average of y across repetitions and only model with y average.
- **Example**: `--flatten`

### `seed: int`
- **Required**: No
- **Description**: The random seed value used for reproducibility in random operations.
- **Example**: `85`

### `nb_new_data_predict: int`
- **Required**: No
- **Description**: The number of new data points sampled from all possible cases.
- **Example**: `1000`

### `nb_new_data: int`
- **Required**: No
- **Description**: The number of new data points selected from the generated ones. These are the data points labeled after active learning loops. `nb_new_data_predict` must be greater than `nb_new_data` to be meaningful.
- **Example**: `50`

### `parameter_step: int`
- **Required**: No
- **Description**: The step size used to decrement the maximum predefined concentration sequentially. For example, if the maximum concentration is `max`, the sequence of concentrations is calculated as: `max - 1 * parameter_step`, `max - 2 * parameter_step`, `max - 3 * parameter_step`, and so on. Each concentration is a candidate for experimental testing. Smaller steps result in more possible combinations to sample.
- **Example**: `10`

### `n_group: int`
- **Required**: No
- **Description**: Parameter for the cluster margin algorithm, specifying the number of groups into which generated data will be clustered.
- **Example**: `15`

### `km: int`
- **Required**: No
- **Description**: Parameter for the cluster margin algorithm, specifying the number of data points for the first selection. Ensure `nb_new_data_predict > ks > km`.
- **Example**: `50`

### `ks: int`
- **Required**: No
- **Description**: Parameter for the cluster margin algorithm, specifying the number of data points for the second selection. This is also similar to `nb_new_data`.
- **Example**: `20`

### `plot: bool`
- **Required**: No
- **Description**: A flag to indicate whether to generate all plots for analysis visualization.
- **Example**: `--plot`

### `save_plot: bool`
- **Required**: No
- **Description**: A flag to indicate whether to save all generated plots.
- **Example**: `--save_plot`

### `verbose: bool`
- **Required**: No
- **Description**: A flag to indicate whether to print all messages to the console.
- **Example**: `--verbose`

---

## Example Configuration File

[config.csv](config.csv)
254 changes: 254 additions & 0 deletions icfree/learner/__main__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,254 @@
import argparse
import re
from os import path as os_path
import numpy as np
import pandas as pd
from sklearn.exceptions import ConvergenceWarning
from sklearn.preprocessing import MaxAbsScaler
import warnings
from icfree.learner.library import *

warnings.filterwarnings("ignore", category=ConvergenceWarning)


def csv_to_dict(file_path):
import pandas as pd
# Read the CSV file
data = pd.read_csv(file_path, header=None)
# Create a dictionary from the two columns
param_dict = dict(zip(data.iloc[:, 0], data.iloc[:, 1]))
return param_dict

def parse_readme(file_path):
with open(file_path, 'r') as file:
content = file.read()

# Regular expression to extract parameter sections
param_regex = r"### `(?P<name>\w+): (?P<type>\w+)`(\n- \*\*Required\*\*: (?P<required>.+?))*\n- \*\*Description\*\*: (?P<description>.+?)(?:\n- \*\*Example\*\*: (?P<example>.+?))?\n"
matches = re.finditer(param_regex, content, re.DOTALL)

params_dict = {}
for match in matches:
name = match.group('name')
params_dict[name] = {
'required': match.group('required') if match.groupdict().get('required') else None,
'type': match.group('type'),
'description': match.group('description').strip().replace('%', '%%'),
'example': match.group('example').strip() if match.groupdict().get('example') else None
}
params_dict[name]['example'] = params_dict[name]['example'].replace('`', '') if params_dict[name]['example'] else None

return params_dict

def string_to_type(type_string):
# Supported types in Python
type_mapping = {
'str': str,
'int': int,
'float': float
}
return type_mapping.get(type_string, str) # Default to str if type is unknown

def parse_arguments():
parser = argparse.ArgumentParser(description="Script for active learning and model training.")

cur_folder = os_path.join(
os_path.dirname(
os_path.dirname(__file__)
),
'learner'
)
readme_path = os_path.join(cur_folder, 'README.md')
# config_path = os_path.join(cur_folder, 'config.csv')

# params = csv_to_dict(config_path)
doc = parse_readme(readme_path)

# Create argparse parser
for arg in doc:
required = doc[arg]['required'].lower() == 'yes'
type = string_to_type(doc[arg]['type'])
help = doc[arg]['description']
default = doc[arg]['example']
if doc[arg]['type'] == 'bool':
parser.add_argument(f'--{arg}', help=help, action='store_true')
else:
if not required:
help += f" (Default: {doc[arg]['example']})"
parser.add_argument(f'--{arg}', required=required, type=type, help=help, default=default)

# parser.add_argument('--data_folder', required=doc['data_folder']['required'].lower()=='yes', type=eval(doc['data_folder']['type']), help=doc['data_folder']['description'].replace('%', '%%'))
# parser.add_argument('--parameter_file', required=doc['parameter_file']['required'].lower()=='yes', type=eval(doc['parameter_file']['type']), help=doc['parameter_file']['description'].replace('%', '%%'))
# parser.add_argument('--output_folder', required=doc['output_folder']['required'].lower()=='yes', type=eval(doc['output_folder']['type']), help=doc['output_folder']['description'].replace('%', '%%'))
# parser.add_argument('--name_list', required=doc['name_list']['required'].lower()=='yes', type=eval(doc['name_list']['type']), default=doc['name_list']['example'].replace('`', ''), help=doc['name_list']['description'].replace('%', '%%')+f" (default: {doc['name_list']['example']})")
# parser.add_argument('--nb_rep', required=doc['nb_rep']['required'].lower()=='yes', type=eval(doc['nb_rep']['type']), default=doc['nb_rep']['example'].replace('`', ''), help=doc['nb_rep']['description'].replace('%', '%%'))
# parser.add_argument('--flatten', required=doc['flatten']['required'].lower()=='yes', help=doc['flatten']['description'].replace('%', '%%'), action='store_true')
# parser.add_argument('--seed', required=doc['seed']['required'].lower()=='yes', type=eval(doc['seed']['type']), default=doc['seed']['example'].replace('`', ''), help=doc['seed']['description'].replace('%', '%%'))
# parser.add_argument('--nb_new_data_predict', required=doc['nb_new_data_predict']['required'].lower()=='yes', type=eval(doc['nb_new_data_predict']['type']), default=doc['nb_new_data_predict']['example'].replace('`', ''), help=doc['nb_new_data_predict']['description'].replace('%', '%%'))
# parser.add_argument('--nb_new_data', required=doc['nb_new_data']['required'].lower()=='yes', type=eval(doc['nb_new_data']['type']), default=doc['nb_new_data']['example'].replace('`', ''), help=doc['nb_new_data']['description'].replace('%', '%%'))
# parser.add_argument('--parameter_step', required=doc['parameter_step']['required'].lower()=='yes', type=eval(doc['parameter_step']['type']), default=doc['parameter_step']['example'].replace('`', ''), help=doc['parameter_step']['description'].replace('%', '%%'))
# parser.add_argument('--test', required=doc['test']['required'].lower()=='yes', help=doc['test']['description'].replace('%', '%%'), action='store_true')
# parser.add_argument('--n_group', required=doc['n_group']['required'].lower()=='yes', type=eval(doc['n_group']['type']), default=doc['n_group']['example'].replace('`', ''), help=doc['n_group']['description'].replace('%', '%%'))
# parser.add_argument('--ks', required=doc['ks']['required'].lower()=='yes', type=eval(doc['ks']['type']), default=doc['ks']['example'].replace('`', ''), help=doc['ks']['description'].replace('%', '%%'))
# parser.add_argument('--km', required=doc['km']['required'].lower()=='yes', type=eval(doc['km']['type']), default=doc['km']['example'].replace('`', ''), help=doc['km']['description'].replace('%', '%%'))
# parser.add_argument('--plot', required=doc['plot']['required'].lower()=='yes', help=doc['plot']['description'].replace('%', '%%'), action='store_true')
# parser.add_argument('--save_plot', required=doc['save_plot']['required'].lower()=='yes', help=doc['save_plot']['description'].replace('%', '%%'), action='store_true')

args = parser.parse_args()

# Convert comma-separated lists to actual Python lists
args.name_list = args.name_list.split(',')

return args

def save_and_or_plot(plot, save_plot, outfile=None, verbose=False):
if save_plot:
if verbose:
# Print message of saving with OK at the end of the line when it's done
print(f"Saving plot to {outfile}...", end=' ')
plt.savefig(outfile)
if verbose:
print_CheckMark()
if plot:
plt.show()
plt.close()

def print_pending_step(msg):
print(f"\033[1m{msg}...\033[0m", end=' ', flush=True)

def print_OK():
print('\033[1m\033[92mOK\033[0m')

def print_CheckMark():
print(u'\033[92m\N{check mark}\033[0m')

def main():
args = parse_arguments()

data_folder = args.data_folder
name_list = args.name_list
parameter_file = args.parameter_file
nb_rep = args.nb_rep
flatten = args.flatten
seed = args.seed
nb_new_data_predict = args.nb_new_data_predict
nb_new_data = args.nb_new_data
parameter_step = args.parameter_step
test = args.test
output_folder = args.output_folder
n_group = args.n_group
ks = args.ks
km = args.km
plot = args.plot
save_plot = args.save_plot
verbose = args.verbose

# Proceed with the rest of the script logic
element_list, element_max, sampling_condition = import_parameter(parameter_file, parameter_step)

print_pending_step("Importing data...")
data, size_list = import_data(data_folder, verbose)
if len(size_list) == 0:
print("No data found")
print("Exiting...")
exit()
print_OK()
print_pending_step("Checking data...")
check_column_names(data, element_list, verbose)
print_OK()

no_element = len(element_list)
y = np.array(data[name_list])
y_mean = np.nanmean(y, axis = 1)
y_std = np.nanstd(y, axis = 1)
X = data.iloc[:,0:no_element]

params = {'kernel': [
C()*Matern(length_scale=10, nu=2.5)+ WhiteKernel(noise_level=1e-3, noise_level_bounds=(1e-3, 1e1))
],
# 'alpha':[0.01, 0.025, 0.05, 0.075, 0.1, 0.15, 0.2, 0.3, 0.4, 0.5]}
'alpha':[0.05]}

print_pending_step("Formatting data...")
X_train, X_test, y_train, y_test = split_and_flatten(X, y, ratio = 0, flatten = flatten)
scaler = MaxAbsScaler()
X_train_norm = scaler.fit_transform(X_train)
print_OK()
print_pending_step("Creating the model...")
model = BayesianModels(n_folds= 10, model_type = 'gp', params=params)
print_OK()
print_pending_step("Training the model...")
model.train(X_train_norm, y_train, verbose = verbose)
print_OK()

if test:
print_pending_step("Testing the model...")
best_param = {'alpha': [model.best_params['alpha']],'kernel': [model.best_params['kernel']]}
res = []
for i in range(nb_rep):
X_train, X_test, y_train, y_test = split_and_flatten(X, y, ratio = 0.2, flatten = flatten)

scaler = MaxAbsScaler()
X_train_norm = scaler.fit_transform(X_train)
X_test_norm = scaler.transform(X_test)

eva_model = BayesianModels(model_type ='gp', params=best_param)
eva_model.train(X_train_norm, y_train, verbose = False)
y_pred, std_pred = eva_model.predict(X_test_norm)
res.append(r2_score(y_test, y_pred))

plt.hist(res, bins = 20, color='orange')
plt.title(f'Histogram of R2 for different testing subset, median= {np.median(res):.2f}', size = 12)
print_OK()

print_pending_step("Predicting new samples to test...")
X_new = sampling_without_repeat(sampling_condition, num_samples=nb_new_data_predict, existing_data=X_train, seed=seed)
X_new_norm = scaler.transform(X_new)
y_pred, std_pred = model.predict(X_new_norm)
clusters = cluster(X_new_norm, n_group)

ei = expected_improvement(y_pred, std_pred, max(y_train))
if verbose:
print("EI: ", end='')
ei_top, y_ei, ratio_ei, ei_cluster = find_top_elements(X_new, y_pred, clusters, ei, km, return_ratio=True, verbose=verbose)
ei_top_norm = scaler.transform(ei_top)
print_OK()

print_pending_step("Saving results...")
# Create outfolder if it does not exist
if not os_path.exists(output_folder):
os.makedirs(output_folder)

if plot or save_plot:
title = plot_selected_point(y_pred, std_pred, y_ei, 'EI selected')
save_and_or_plot(plot, save_plot, os_path.join(output_folder, title), verbose)

size_list.append(nb_new_data)
y_mean = np.append(y_mean, y_ei)
title = plot_each_round(y_mean, size_list, True)
save_and_or_plot(plot, save_plot, os_path.join(output_folder, title), verbose)

title = plot_train_test(X_train_norm, ei_top_norm, element_list)
save_and_or_plot(plot, save_plot, os_path.join(output_folder, title), verbose)

title = plot_heatmap(ei_top_norm, y_ei, element_list, 'EI')
save_and_or_plot(plot, save_plot, os_path.join(output_folder, title), verbose)

X_ei = pd.DataFrame(ei_top, columns=element_list)
outfile = os_path.join(output_folder, 'next_sampling_ei'+ str(km) + '.csv')
if verbose:
print(f"Saving next sampling points to {outfile}...", end=' ')
X_ei.to_csv(outfile, index=False)
if verbose:
print_CheckMark()
print_OK()


if __name__ == "__main__":
main()






Loading

0 comments on commit b47d2a8

Please sign in to comment.