Merge branch 'main' into stable

brsynth · Jan 15, 2025 · b47d2a8 · b47d2a8
2 parents 1413635 + 6ea2189
commit b47d2a8
Show file tree

Hide file tree

Showing 24 changed files with 629 additions and 438 deletions.
diff --git a/environment.yaml b/environment.yaml
@@ -9,3 +9,10 @@ dependencies:
   - openpyxl
   - statsmodels
   - matplotlib-base
+  - scikit-learn
+  - seaborn
+  - xgboost
+  - scipy
+  - ipython
+  - ipywidgets 
+  - glob2  
diff --git a/icfree/learner/README.md b/icfree/learner/README.md
@@ -0,0 +1,122 @@
+# Configuration Documentation
+
+## Dataframe Structure
+
+### Rows
+- Each row in the dataframe represents an experiment, with features X (e.g., concentration/volumes of cell-free mixes) followed by labels y (e.g., yields).
+- **Example**: Row 0 contains the features and labels for Experiment 1.
+
+### Columns
+- **Feature Columns (X):**
+  - The first set of columns contains feature values. These are numerical or categorical variables used as input for modeling.
+  - **Example**: `feature1`, `feature2`, ..., `featureN`
+- **Label Columns (y):**
+  - The subsequent columns represent the labels or target variables for the same experiments, i.e., model output.
+  - **Example**: `label1`, `label2`, ..., `labelM`
+
+### Key Characteristics
+- **Shape**: `(n_experiments, n_features + n_labels)`
+  - `n_experiments`: The number of rows (experiments).
+  - `n_features`: The number of feature columns (X).
+  - `n_labels`: The number of label columns (y).
+
+**Example**:
+| feature1 | feature2 | feature3 | label1 | label2 |
+|----------|----------|----------|--------|--------|
+|   0.1    |   1.5    |   0.7    |   0    |   1    |
+|   0.3    |   1.7    |   0.5    |   1    |   0    |
+
+---
+
+## Configuration Variables Documentation
+
+### `data_folder: str`
+- **Required**: Yes
+- **Description**: The path to the folder containing the data files.
+- **Example**: `"data/top50"`
+
+### `parameter_file: str`
+- **Required**: Yes
+- **Description**: The path to the file containing the parameter values for the experiments.
+- **Example**: `"data/param.tsv"`
+
+### `output_folder: str`
+- **Required**: Yes
+- **Description**: The path to the folder where the output files will be saved.
+- **Example**: `"output"`
+
+### `name_list: str`
+- **Required**: No
+- **Description**: A comma-separated string of column names or identifiers, converted to a list of strings representing columns that contain labels (y). This separates y columns from the rest (X features).
+- **Example**: `Yield1,Yield2,Yield3,Yield4,Yield5`
+
+### `test: bool`
+- **Required**: No
+- **Description**: A flag for validating the model; not required to run inside the active learning loop. If not set, skip the validating step.
+- **Example**: `--test`
+
+### `nb_rep: int`
+- **Required**: No
+- **Description**: The number of test repetitions for validating the model behavior. 80% of data is randomly separated for training, and 20% is used for testing.
+- **Example**: `100`
+
+### `flatten: bool`
+- **Required**: No
+- **Description**: A flag to indicate whether to flatten Y data. If set, treats each repetition in the same experiment independently; multiple same X values with different y outputs are modeled. Else, calculates the average of y across repetitions and only model with y average.
+- **Example**: `--flatten`
+
+### `seed: int`
+- **Required**: No
+- **Description**: The random seed value used for reproducibility in random operations.
+- **Example**: `85`
+
+### `nb_new_data_predict: int`
+- **Required**: No
+- **Description**: The number of new data points sampled from all possible cases.
+- **Example**: `1000`
+
+### `nb_new_data: int`
+- **Required**: No
+- **Description**: The number of new data points selected from the generated ones. These are the data points labeled after active learning loops. `nb_new_data_predict` must be greater than `nb_new_data` to be meaningful.
+- **Example**: `50`
+
+### `parameter_step: int`
+- **Required**: No
+- **Description**: The step size used to decrement the maximum predefined concentration sequentially. For example, if the maximum concentration is `max`, the sequence of concentrations is calculated as: `max - 1 * parameter_step`, `max - 2 * parameter_step`, `max - 3 * parameter_step`, and so on. Each concentration is a candidate for experimental testing. Smaller steps result in more possible combinations to sample.
+- **Example**: `10`
+
+### `n_group: int`
+- **Required**: No
+- **Description**: Parameter for the cluster margin algorithm, specifying the number of groups into which generated data will be clustered.
+- **Example**: `15`
+
+### `km: int`
+- **Required**: No
+- **Description**: Parameter for the cluster margin algorithm, specifying the number of data points for the first selection. Ensure `nb_new_data_predict > ks > km`.
+- **Example**: `50`
+
+### `ks: int`
+- **Required**: No
+- **Description**: Parameter for the cluster margin algorithm, specifying the number of data points for the second selection. This is also similar to `nb_new_data`.
+- **Example**: `20`
+
+### `plot: bool`
+- **Required**: No
+- **Description**: A flag to indicate whether to generate all plots for analysis visualization.
+- **Example**: `--plot`
+
+### `save_plot: bool`
+- **Required**: No
+- **Description**: A flag to indicate whether to save all generated plots.
+- **Example**: `--save_plot`
+
+### `verbose: bool`
+- **Required**: No
+- **Description**: A flag to indicate whether to print all messages to the console.
+- **Example**: `--verbose`
+
+---
+
+## Example Configuration File
+
+[config.csv](config.csv)
diff --git a/icfree/learner/__main__.py b/icfree/learner/__main__.py
@@ -0,0 +1,254 @@
+import argparse
+import re
+from os import path as os_path
+import numpy as np
+import pandas as pd
+from sklearn.exceptions import ConvergenceWarning
+from sklearn.preprocessing import MaxAbsScaler
+import warnings
+from icfree.learner.library import *
+
+warnings.filterwarnings("ignore", category=ConvergenceWarning)
+
+
+def csv_to_dict(file_path):
+    import pandas as pd
+    # Read the CSV file
+    data = pd.read_csv(file_path, header=None)
+    # Create a dictionary from the two columns
+    param_dict = dict(zip(data.iloc[:, 0], data.iloc[:, 1]))
+    return param_dict
+
+def parse_readme(file_path):
+    with open(file_path, 'r') as file:
+        content = file.read()
+
+    # Regular expression to extract parameter sections
+    param_regex = r"### `(?P<name>\w+): (?P<type>\w+)`(\n- \*\*Required\*\*: (?P<required>.+?))*\n- \*\*Description\*\*: (?P<description>.+?)(?:\n- \*\*Example\*\*: (?P<example>.+?))?\n"
+    matches = re.finditer(param_regex, content, re.DOTALL)
+
+    params_dict = {}
+    for match in matches:
+        name = match.group('name')
+        params_dict[name] = {
+            'required': match.group('required') if match.groupdict().get('required') else None,
+            'type': match.group('type'),
+            'description': match.group('description').strip().replace('%', '%%'),
+            'example': match.group('example').strip() if match.groupdict().get('example') else None
+        }
+        params_dict[name]['example'] = params_dict[name]['example'].replace('`', '') if params_dict[name]['example'] else None
+
+    return params_dict
+
+def string_to_type(type_string):
+    # Supported types in Python
+    type_mapping = {
+        'str': str,
+        'int': int,
+        'float': float
+    }
+    return type_mapping.get(type_string, str)  # Default to str if type is unknown
+
+def parse_arguments():
+    parser = argparse.ArgumentParser(description="Script for active learning and model training.")
+
+    cur_folder = os_path.join(
+        os_path.dirname(
+            os_path.dirname(__file__)
+        ),
+        'learner'
+    )
+    readme_path = os_path.join(cur_folder, 'README.md')
+    # config_path = os_path.join(cur_folder, 'config.csv')
+
+    # params = csv_to_dict(config_path)
+    doc = parse_readme(readme_path)
+
+    # Create argparse parser   
+    for arg in doc:
+        required = doc[arg]['required'].lower() == 'yes'
+        type = string_to_type(doc[arg]['type'])
+        help = doc[arg]['description']
+        default = doc[arg]['example']
+        if doc[arg]['type'] == 'bool':
+            parser.add_argument(f'--{arg}', help=help, action='store_true')
+        else:
+            if not required:
+                help += f" (Default: {doc[arg]['example']})"
+            parser.add_argument(f'--{arg}', required=required, type=type, help=help, default=default)
+
+    # parser.add_argument('--data_folder', required=doc['data_folder']['required'].lower()=='yes', type=eval(doc['data_folder']['type']), help=doc['data_folder']['description'].replace('%', '%%'))
+    # parser.add_argument('--parameter_file', required=doc['parameter_file']['required'].lower()=='yes', type=eval(doc['parameter_file']['type']), help=doc['parameter_file']['description'].replace('%', '%%'))
+    # parser.add_argument('--output_folder', required=doc['output_folder']['required'].lower()=='yes', type=eval(doc['output_folder']['type']), help=doc['output_folder']['description'].replace('%', '%%'))
+    # parser.add_argument('--name_list', required=doc['name_list']['required'].lower()=='yes', type=eval(doc['name_list']['type']), default=doc['name_list']['example'].replace('`', ''), help=doc['name_list']['description'].replace('%', '%%')+f" (default: {doc['name_list']['example']})")
+    # parser.add_argument('--nb_rep', required=doc['nb_rep']['required'].lower()=='yes', type=eval(doc['nb_rep']['type']), default=doc['nb_rep']['example'].replace('`', ''), help=doc['nb_rep']['description'].replace('%', '%%'))
+    # parser.add_argument('--flatten', required=doc['flatten']['required'].lower()=='yes', help=doc['flatten']['description'].replace('%', '%%'), action='store_true')
+    # parser.add_argument('--seed', required=doc['seed']['required'].lower()=='yes', type=eval(doc['seed']['type']), default=doc['seed']['example'].replace('`', ''), help=doc['seed']['description'].replace('%', '%%'))
+    # parser.add_argument('--nb_new_data_predict', required=doc['nb_new_data_predict']['required'].lower()=='yes', type=eval(doc['nb_new_data_predict']['type']), default=doc['nb_new_data_predict']['example'].replace('`', ''), help=doc['nb_new_data_predict']['description'].replace('%', '%%'))
+    # parser.add_argument('--nb_new_data', required=doc['nb_new_data']['required'].lower()=='yes', type=eval(doc['nb_new_data']['type']), default=doc['nb_new_data']['example'].replace('`', ''), help=doc['nb_new_data']['description'].replace('%', '%%'))
+    # parser.add_argument('--parameter_step', required=doc['parameter_step']['required'].lower()=='yes', type=eval(doc['parameter_step']['type']), default=doc['parameter_step']['example'].replace('`', ''), help=doc['parameter_step']['description'].replace('%', '%%'))
+    # parser.add_argument('--test', required=doc['test']['required'].lower()=='yes', help=doc['test']['description'].replace('%', '%%'), action='store_true')
+    # parser.add_argument('--n_group', required=doc['n_group']['required'].lower()=='yes', type=eval(doc['n_group']['type']), default=doc['n_group']['example'].replace('`', ''), help=doc['n_group']['description'].replace('%', '%%'))
+    # parser.add_argument('--ks', required=doc['ks']['required'].lower()=='yes', type=eval(doc['ks']['type']), default=doc['ks']['example'].replace('`', ''), help=doc['ks']['description'].replace('%', '%%'))
+    # parser.add_argument('--km', required=doc['km']['required'].lower()=='yes', type=eval(doc['km']['type']), default=doc['km']['example'].replace('`', ''), help=doc['km']['description'].replace('%', '%%'))
+    # parser.add_argument('--plot', required=doc['plot']['required'].lower()=='yes', help=doc['plot']['description'].replace('%', '%%'), action='store_true')
+    # parser.add_argument('--save_plot', required=doc['save_plot']['required'].lower()=='yes', help=doc['save_plot']['description'].replace('%', '%%'), action='store_true')
+
+    args = parser.parse_args()
+
+    # Convert comma-separated lists to actual Python lists
+    args.name_list = args.name_list.split(',')
+
+    return args
+
+def save_and_or_plot(plot, save_plot, outfile=None, verbose=False):
+    if save_plot:
+        if verbose:
+            # Print message of saving with OK at the end of the line when it's done
+            print(f"Saving plot to {outfile}...", end=' ')
+        plt.savefig(outfile)
+        if verbose:
+            print_CheckMark()
+    if plot:
+        plt.show()
+    plt.close()
+
+def print_pending_step(msg):
+    print(f"\033[1m{msg}...\033[0m", end=' ', flush=True)
+
+def print_OK():
+    print('\033[1m\033[92mOK\033[0m')
+
+def print_CheckMark():
+    print(u'\033[92m\N{check mark}\033[0m')
+
+def main():
+    args = parse_arguments()
+
+    data_folder = args.data_folder
+    name_list = args.name_list
+    parameter_file = args.parameter_file
+    nb_rep = args.nb_rep
+    flatten = args.flatten
+    seed = args.seed
+    nb_new_data_predict = args.nb_new_data_predict
+    nb_new_data = args.nb_new_data
+    parameter_step = args.parameter_step
+    test = args.test
+    output_folder = args.output_folder
+    n_group = args.n_group
+    ks = args.ks
+    km = args.km
+    plot = args.plot
+    save_plot = args.save_plot
+    verbose = args.verbose
+
+    # Proceed with the rest of the script logic
+    element_list, element_max, sampling_condition = import_parameter(parameter_file, parameter_step)
+
+    print_pending_step("Importing data...")
+    data, size_list = import_data(data_folder, verbose)
+    if len(size_list) == 0:
+        print("No data found")
+        print("Exiting...")
+        exit()
+    print_OK()
+    print_pending_step("Checking data...")
+    check_column_names(data, element_list, verbose)
+    print_OK()
+
+    no_element = len(element_list)
+    y = np.array(data[name_list])
+    y_mean = np.nanmean(y, axis = 1)
+    y_std = np.nanstd(y, axis = 1)
+    X = data.iloc[:,0:no_element]
+
+    params = {'kernel': [
+                    C()*Matern(length_scale=10, nu=2.5)+ WhiteKernel(noise_level=1e-3, noise_level_bounds=(1e-3, 1e1))
+                ],
+    #            'alpha':[0.01, 0.025, 0.05, 0.075, 0.1, 0.15, 0.2, 0.3, 0.4, 0.5]}
+                'alpha':[0.05]}
+
+    print_pending_step("Formatting data...")
+    X_train, X_test, y_train, y_test = split_and_flatten(X, y, ratio = 0, flatten = flatten)
+    scaler = MaxAbsScaler()
+    X_train_norm = scaler.fit_transform(X_train)
+    print_OK()
+    print_pending_step("Creating the model...")
+    model = BayesianModels(n_folds= 10, model_type = 'gp', params=params)
+    print_OK()
+    print_pending_step("Training the model...")
+    model.train(X_train_norm, y_train, verbose = verbose)
+    print_OK()
+
+    if test:
+        print_pending_step("Testing the model...")
+        best_param = {'alpha': [model.best_params['alpha']],'kernel': [model.best_params['kernel']]}
+        res = []
+        for i in range(nb_rep):
+            X_train, X_test, y_train, y_test = split_and_flatten(X, y, ratio = 0.2, flatten = flatten)
+
+            scaler = MaxAbsScaler()
+            X_train_norm = scaler.fit_transform(X_train)
+            X_test_norm = scaler.transform(X_test)
+
+            eva_model = BayesianModels(model_type ='gp', params=best_param)
+            eva_model.train(X_train_norm, y_train, verbose = False)
+            y_pred, std_pred = eva_model.predict(X_test_norm)
+            res.append(r2_score(y_test, y_pred))
+
+        plt.hist(res, bins = 20, color='orange')
+        plt.title(f'Histogram of R2 for different testing subset, median= {np.median(res):.2f}', size = 12)
+        print_OK()
+
+    print_pending_step("Predicting new samples to test...")
+    X_new = sampling_without_repeat(sampling_condition, num_samples=nb_new_data_predict, existing_data=X_train, seed=seed)
+    X_new_norm = scaler.transform(X_new)
+    y_pred, std_pred = model.predict(X_new_norm)
+    clusters = cluster(X_new_norm, n_group)
+
+    ei = expected_improvement(y_pred, std_pred, max(y_train))
+    if verbose:
+        print("EI: ", end='')
+    ei_top, y_ei, ratio_ei, ei_cluster = find_top_elements(X_new, y_pred, clusters, ei, km, return_ratio=True, verbose=verbose)
+    ei_top_norm = scaler.transform(ei_top)
+    print_OK()
+
+    print_pending_step("Saving results...")
+    # Create outfolder if it does not exist
+    if not os_path.exists(output_folder):
+        os.makedirs(output_folder)
+
+    if plot or save_plot:
+        title = plot_selected_point(y_pred, std_pred, y_ei, 'EI selected')
+        save_and_or_plot(plot, save_plot, os_path.join(output_folder, title), verbose)
+
+        size_list.append(nb_new_data)
+        y_mean = np.append(y_mean, y_ei)
+        title = plot_each_round(y_mean, size_list, True)
+        save_and_or_plot(plot, save_plot, os_path.join(output_folder, title), verbose)
+
+        title = plot_train_test(X_train_norm, ei_top_norm, element_list)
+        save_and_or_plot(plot, save_plot, os_path.join(output_folder, title), verbose)
+
+        title = plot_heatmap(ei_top_norm, y_ei, element_list, 'EI')
+        save_and_or_plot(plot, save_plot, os_path.join(output_folder, title), verbose)
+
+    X_ei = pd.DataFrame(ei_top, columns=element_list)
+    outfile = os_path.join(output_folder, 'next_sampling_ei'+ str(km) + '.csv')
+    if verbose:
+        print(f"Saving next sampling points to {outfile}...", end=' ')
+    X_ei.to_csv(outfile, index=False)
+    if verbose:
+        print_CheckMark()
+    print_OK()
+
+
+if __name__ == "__main__":
+    main()
+
+
+
+
+
+