Skip to content

Commit

Permalink
v0.2.8
Browse files Browse the repository at this point in the history
  • Loading branch information
wilsonjr committed Jun 9, 2023
2 parents 8d60b80 + ee0e576 commit ad8467b
Show file tree
Hide file tree
Showing 23 changed files with 2,476 additions and 1,471 deletions.
7 changes: 7 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -13,3 +13,10 @@ evaluation/comparison-techniques/*
evaluation/comparison-drill/*
evaluation/comparison-np/*
.DS_Store
reinstall.sh
*.npy
test.*
.vscode
.ipynb_checkpoints
datasets/*
*.csv
15 changes: 7 additions & 8 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
FROM python:3.8-slim
# FROM python:3.8-slim
FROM --platform=linux/amd64 python:3.8-slim

RUN apt-get update && \
apt-get install -y build-essential && \
Expand All @@ -9,19 +10,17 @@ RUN apt-get update && \
# Install miniconda
ENV CONDA_DIR /opt/conda
RUN wget --quiet https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda.sh && \
/bin/bash ~/miniconda.sh -b -p /opt/conda
/bin/bash ~/miniconda.sh -b -f -p /usr/local/

# Put conda in path so we can use conda activate
ENV PATH=$CONDA_DIR/bin:$PATH
RUN pip install --upgrade pip

RUN conda install numpy
RUN conda install scipy
RUN conda install scikit-learn
RUN conda install eigen
RUN conda install pybind11
RUN conda config --add channels conda-forge

RUN conda install humap

COPY . /app
WORKDIR /app

RUN python setup.py build_ext -I/opt/conda/include/eigen3 install
RUN python minimal_test.py
76 changes: 14 additions & 62 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,17 +14,17 @@
.. |conda_downloads| image:: https://anaconda.org/conda-forge/humap/badges/downloads.svg
.. _conda_downloads: https://anaconda.org/conda-forge/humap

.. image:: images/fmnist-cover.png
.. image:: images/humap-2M.gif
:alt: HUMAP exploration on Fashion MNIST dataset

=====
HUMAP
=====

Hierarchical Manifold Approximation and Projection (HUMAP) is a technique based on `UMAP <https://github.com/lmcinnes/umap/>`_ for hierarchical non-linear dimensionality reduction. HUMAP allows to:
Hierarchical Manifold Approximation and Projection (HUMAP) is a technique based on `UMAP <https://github.com/lmcinnes/umap/>`_ for hierarchical dimensionality reduction. HUMAP allows to:


1. Focus on important information while reducing the visual burden when exploring whole datasets;
1. Focus on important information while reducing the visual burden when exploring huge datasets;
2. Drill-down the hierarchy according to information demand.

The details of the algorithm can be found in our paper on `ArXiv <https://arxiv.org/abs/2106.07718>`_. This repository also features a C++ UMAP implementation.
Expand All @@ -34,7 +34,7 @@ The details of the algorithm can be found in our paper on `ArXiv <https://arxiv.
Installation
-----------

HUMAP was written in C++ for performance purposes, and it has an intuitive Python interface. It depends upon common machine learning libraries, such as ``scikit-learn`` and ``NumPy``. It also needs the ``pybind11`` due to the interface between C++ and Python.
HUMAP was written in C++ for performance purposes, and provides an intuitive Python interface. It depends upon common machine learning libraries, such as ``scikit-learn`` and ``NumPy``. It also needs the ``pybind11`` due to the interface between C++ and Python.


Requirements:
Expand All @@ -60,9 +60,9 @@ Alternatively (and preferable), you can use conda to install:
conda install humap
**For Windows users**:
**If using pip**:

If using *pip* The `Eigen <https://eigen.tuxfamily.org/>`_ library does not have to be installed. Just add the files to C:\\Eigen or use the manual installation to change Eigen location.
HUMAP depends on `Eigen <https://eigen.tuxfamily.org/>`_. Thus, make it sure to place the headers in **/usr/local/include** if using Unix or **C:\\Eigen** if using Windows.

**Manual installation**:

Expand All @@ -81,7 +81,7 @@ For manually installing HUMAP, download the project and proceed as follows:
Usage examples
--------------

HUMAP package follows the same idea of sklearn classes, in which you need to fit and transform data.
The simplest usage of HUMAP is as it follows:

**Fitting the hierarchy**

Expand All @@ -93,66 +93,18 @@ HUMAP package follows the same idea of sklearn classes, in which you need to fit
X, y = fetch_openml('mnist_784', version=1, return_X_y=True)
hUmap = humap.HUMAP()
# build a hierarchy with three levels
hUmap = humap.HUMAP([0.2, 0.2])
hUmap.fit(X, y)
.. image:: images/mnist_top.png
:alt: HUMAP embedding of top-level MNIST digits

By now, you can control six parameters related to the hierarchy construction and the embedding performed by UMAP.

- ``levels``: Controls the number of hierarchical levels + the first one (whole dataset). This parameter also controls how many data points are in each hierarchical level. The default is ``[0.2, 0.2]``, meaning the HUMAP will produce three levels: The first one with the whole dataset, the second one with 20% of the first level, and the third with 20% of the second level.

- ``n_neighbors``: This parameter controls the number of neighbors for approximating the manifold structures. Larger values produce embedding that preserves more of the global relations. In HUMAP, we recommend and set the default value to be ``100``.

- ``min_dist``: This parameter, used in UMAP dimensionality reduction, controls the allowance to cluster data points together. According to UMAP documentation, larger values allow evenly distributed embeddings, while smaller values encode the local structures better. We set this parameter as ``0.15`` as default.

- ``knn_algorithm``: Controls which knn approximation will be used, in which ``NNDescent`` is the default. Another option is ``ANNOY`` or ``FLANN`` if you have Python installations of these algorithms at the expense of slower run-time executions than NNDescent.

- ``init``: Controls the method for initing the low-dimensional representation. We set ``Spectral`` as default since it yields better global structure preservation. You can also use ``random`` initialization.

- ``verbose``: Controls the verbosity of the algorithm.


**Embedding a hierarchical level**

After fitting the dataset, you can generate the embedding for a hierarchical level by specifying the level.

.. code:: python
embedding_l2 = hUmap.transform(2)
y_l2 = hUmap.labels(2)
Notice that the ``.labels()`` method only works for levels equal or greater than one.


**Drilling down the hierarchy by embedding a subset of data points based on indices**

.. image:: images/example_drill.png
:alt: Embedding data subsets throughout HUMAP hierarchy

When interested in a set of data samples, HUMAP allows for drilling down the hierarchy for those samples.


.. code:: python
embedding, y, indices = hUmap.transform(2, indices=indices_of_interest)
This method returns the ``embedding`` coordinates, the labels (``y``), and the data points' ``indices`` in the current level. Notice that the current level is now level 1 since we used the hierarchy level ``2`` for drilling down operation.


**Drilling down the hierarchy by embedding a subset of data points based on labels**

You can apply the same concept as above to embed data points based on labels.

.. code:: python
embedding, y, indices = hUmap.transform(2, indices=np.array([4, 9]), class_based=True)
# embed level 2
embedding2 = hUmap.transform(2)
Refer to *notebooks/* for complete examples.

**C++ UMAP implementation**

You can also fit a one-level HUMAP hierarchy, which essentially corresponds to a UMAP projection.
You can also fit a one-level HUMAP hierarchy, which essentially fits UMAP projection.

.. code:: python
Expand Down Expand Up @@ -181,7 +133,7 @@ Please, use the following reference to cite HUMAP in your work:
License
-------

HUMAP follows the 3-clause BSD license and it uses the open-source NNDescent implementation from `EFANNA <https://github.com/ZJULearning/efanna>`_. It also uses a C++ implementation of `UMAP <http://github.com/lmcinnes/umap>`_ for embedding hierarchy levels; this project would not be possible without UMAP's fantastic technique and package.
HUMAP follows the 3-clause BSD license and it uses the open-source NNDescent implementation from `EFANNA <https://github.com/ZJULearning/efanna>`_. It also uses a C++ implementation of `UMAP <http://github.com/lmcinnes/umap>`_ for embedding hierarchy levels.

E-mail me (wilson_jr at outlook.com) if you like to contribute.

Expand Down
Binary file added dump.rdb
Binary file not shown.
37 changes: 29 additions & 8 deletions humap/humap.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,9 +6,10 @@
import numpy as np

from scipy.optimize import curve_fit

from sklearn.utils import check_array

import logging

class HUMAP(object):
"""
Class for wrapping the pybind11 interface of HUMAP C++ implementation
Expand All @@ -34,19 +35,24 @@ class HUMAP(object):
* FLANN (Python instalation required)
init (str): (optional, default 'Spectral')
init (str): (optional, default 'Random')
Initialization method for the low dimensional embedding. Options include:
* Spectral
* Spectral
* random
reproducible (bool): (optional, default 'False')
If the results among different runs need to be reproducible. It affects the runtime execution.
verbose (bool): (optional, default True)
verbose (bool): (optional, default False)
Controls logging.
"""
def __init__(self, levels=np.array([0.2, 0.2]), n_neighbors=100, min_dist=0.15, knn_algorithm='NNDescent', init="Spectral", verbose=True, reproducible=False):
def __init__(self, levels=np.array([0.2, 0.2]), n_neighbors=100, min_dist=0.15, knn_algorithm='NNDescent', init="Random", verbose=False, reproducible=False):

if init != 'Random':
logging.warn("Sorry, only Random initialization is available at this time.")
init = 'Random'

self.levels = levels
self.n_levels = len(levels)+1
self.n_neighbors = n_neighbors
Expand Down Expand Up @@ -199,7 +205,8 @@ def transform(self, level, **kwargs):

y = self.h_umap.get_labels_selected()
indices_cluster = self.h_umap.get_indices_selected()
return [embedding, y, indices_cluster]
indices_fixed = self.h_umap.get_indices_fixed()
return [embedding, y, indices_cluster, indices_fixed]

except:
raise TypeError("Accepted parameters: indices and class_based.")
Expand Down Expand Up @@ -274,6 +281,20 @@ def set_info_file(self, info_file=""):
def set_n_epochs(self, epochs):
self.h_umap.set_n_epochs(epochs)

def get_knn(self, level):

if level < 0 or level >= self.n_levels:
raise ValueError("level must be in [0, n_levels-1]")
else:
return self.h_umap.get_knn(level)

def get_knn_distances(self, level):

if level < 0 or level >= self.n_levels:
raise ValueError("level must be in [0, n_levels-1]")
else:
return self.h_umap.get_knn_dists(level)


def influence(self, level):
r"""
Expand Down Expand Up @@ -343,11 +364,11 @@ class UMAP(HUMAP):
* Spectral
* random
verbose (bool): (optional, default True)
verbose (bool): (optional, default False)
Controls logging.
"""
def __init__(self, n_neighbors=100, min_dist=0.15, knn_algorithm='NNDescent', init="Spectral", verbose=True, reproducible=False):
def __init__(self, n_neighbors=100, min_dist=0.15, knn_algorithm='NNDescent', init="Spectral", verbose=False, reproducible=False):
super().__init__(np.array([]), n_neighbors, min_dist, knn_algorithm, init, verbose, reproducible)

def fit_transform(self, X):
Expand Down
Binary file added images/humap-2M.gif
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
4 changes: 4 additions & 0 deletions minimal_test.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
import humap


print("imported :)")
Original file line number Diff line number Diff line change
Expand Up @@ -126,7 +126,7 @@
],
"metadata": {
"kernelspec": {
"display_name": "Python 3.7.3 ('base')",
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
Expand All @@ -140,7 +140,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.3"
"version": "3.8.13"
},
"vscode": {
"interpreter": {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -185,7 +185,7 @@
],
"metadata": {
"kernelspec": {
"display_name": "Python 3.7.3 ('base')",
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
Expand All @@ -199,7 +199,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.3"
"version": "3.8.13"
},
"vscode": {
"interpreter": {
Expand Down
Loading

0 comments on commit ad8467b

Please sign in to comment.