Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request : Enhanced Geospatial and Temporal Search #740

Open
nicolas-geysse opened this issue Jun 25, 2024 · 2 comments
Open

Feature request : Enhanced Geospatial and Temporal Search #740

nicolas-geysse opened this issue Jun 25, 2024 · 2 comments

Comments

@nicolas-geysse
Copy link

Here's a plan to enhance TxtAI with geospatial and temporal search capabilities:

1. Extend indexing for geospatial data:

  • Use GeoPandas for geospatial data handling, as it integrates well with NetworkX.
  • Implement a GeospatialGraph class that extends TxtAI's existing Graph:
import geopandas as gpd
from txtai.graph import Graph

class GeospatialGraph(Graph):
    def __init__(self):
        super().__init__()
        self.gdf = gpd.GeoDataFrame()

    def add_node(self, node_id, geometry, **attr):
        super().add_node(node_id, **attr)
        self.gdf = self.gdf.append({'node_id': node_id, 'geometry': geometry}, ignore_index=True)

    def spatial_query(self, geometry, predicate='intersects'):
        return self.gdf[self.gdf.geometry.geom_method(predicate, geometry)]['node_id'].tolist()

2. Implement temporal search functionalities:

  • Use pandas for temporal data handling, as it's already part of TxtAI's ecosystem.
  • Extend the Graph class to include temporal attributes:
import pandas as pd

class TemporalGraph(Graph):
    def __init__(self):
        super().__init__()
        self.temporal_index = pd.DatetimeIndex([])

    def add_node(self, node_id, timestamp, **attr):
        super().add_node(node_id, **attr)
        self.temporal_index = self.temporal_index.append(pd.DatetimeIndex([timestamp]))

    def temporal_query(self, start_time, end_time):
        mask = (self.temporal_index >= start_time) & (self.temporal_index <= end_time)
        return self.temporal_index[mask].tolist()

3. Integrate with existing semantic search:

  • Create a combined SpatioTemporalSemanticGraph class:
from txtai.embeddings import Embeddings

class SpatioTemporalSemanticGraph(GeospatialGraph, TemporalGraph):
    def __init__(self):
        super().__init__()
        self.embeddings = Embeddings()

    def add_node(self, node_id, geometry, timestamp, text, **attr):
        super().add_node(node_id, geometry, timestamp, **attr)
        self.embeddings.index([(node_id, text, None)])

    def search(self, query, geometry=None, start_time=None, end_time=None, limit=10):
        results = self.embeddings.search(query, limit)
        
        if geometry:
            spatial_results = set(self.spatial_query(geometry))
            results = [r for r in results if r[0] in spatial_results]
        
        if start_time and end_time:
            temporal_results = set(self.temporal_query(start_time, end_time))
            results = [r for r in results if r[0] in temporal_results]
        
        return results

This implementation:

  1. Uses GeoPandas for geospatial indexing, which is compatible with NetworkX.
  2. Utilizes pandas for temporal indexing, which is already part of TxtAI's ecosystem.
  3. Integrates seamlessly with TxtAI's existing semantic search capabilities.
  4. Provides a simple interface for combined spatio-temporal-semantic queries.

To use this enhanced graph:

graph = SpatioTemporalSemanticGraph()
graph.add_node("1", Point(0, 0), pd.Timestamp("2023-01-01"), "Sample text")
results = graph.search("sample", 
                       geometry=Point(0, 0).buffer(1), 
                       start_time=pd.Timestamp("2022-01-01"), 
                       end_time=pd.Timestamp("2024-01-01"))

This approach extends TxtAI's capabilities while maintaining simplicity and integration with its existing ecosystem.

Citations:
[1] https://networkx.org/documentation/stable/auto_examples/geospatial/index.html
[2] https://networkx.org/documentation/stable/auto_examples/geospatial/extended_description.html
[3] geopandas/geopandas#1592
[4] https://napo.github.io/geospatial_course_unitn/lessons/05-street-network-analysis
[5] https://pypi.org/project/networkx-temporal/
[6] https://www.timescale.com/blog/tools-for-working-with-time-series-analysis-in-python/
[7] https://pythongis.org/part1/chapter-03/nb/03-temporal-data.html
[8] https://github.com/MaxBenChrist/awesome_time_series_in_python
[9] https://unit8co.github.io/darts/
[10] https://www.timescale.com/blog/how-to-work-with-time-series-in-python/
[11] https://github.com/sacridini/Awesome-Geospatial
[12] https://www.mdpi.com/1999-4893/10/2/37

@nicolas-geysse
Copy link
Author

A more (?) related way to attributes and types:
"Extension of Data Sources" point, focusing on integrating geospatial and temporal data while preserving attributes and types:

Extension of Data Sources
• Addition: Data import with attribute and type preservation
• Addition: Support for geospatial and temporal data
• Libraries: qwikidata, geopandas, pandas
• Benefits: Enrichment of graphs with structured and geotemporal data

Implementation:

  1. Extend TxtAI's Graph class to support geospatial and temporal data:
from txtai.graph import Graph
import geopandas as gpd
import pandas as pd
from qwikidata.linked_data_interface import get_entity_dict_from_api

class EnhancedGraph(Graph):
    def __init__(self):
        super().__init__()
        self.gdf = gpd.GeoDataFrame()
        self.temporal_data = pd.DataFrame()

    def add_geospatial_node(self, node_id, geometry, **attrs):
        self.graph.add_node(node_id, geometry=geometry, **attrs)
        self.gdf = self.gdf.append({'node_id': node_id, 'geometry': geometry, **attrs}, ignore_index=True)

    def add_temporal_node(self, node_id, timestamp, **attrs):
        self.graph.add_node(node_id, timestamp=timestamp, **attrs)
        self.temporal_data = self.temporal_data.append({'node_id': node_id, 'timestamp': timestamp, **attrs}, ignore_index=True)

    def import_wikidata(self, entity_id):
        entity_dict = get_entity_dict_from_api(entity_id)
        node_id = entity_dict['id']
        attrs = {claim['mainsnak']['property']: claim['mainsnak']['datavalue']['value'] 
                 for claim in entity_dict['claims'] if 'datavalue' in claim['mainsnak']}
        self.graph.add_node(node_id, **attrs)
        return node_id

    def to_geopandas(self):
        return self.gdf

    def to_temporal_pandas(self):
        return self.temporal_data
  1. Implement methods to import and integrate different data types:
    def import_geojson(self, file_path):
        gdf = gpd.read_file(file_path)
        for idx, row in gdf.iterrows():
            self.add_geospatial_node(idx, row.geometry, **row.to_dict())

    def import_temporal_csv(self, file_path, timestamp_col, node_id_col):
        df = pd.read_csv(file_path, parse_dates=[timestamp_col])
        for idx, row in df.iterrows():
            self.add_temporal_node(row[node_id_col], row[timestamp_col], **row.to_dict())

    def spatial_query(self, geometry):
        return self.gdf[self.gdf.intersects(geometry)]

    def temporal_query(self, start_time, end_time):
        mask = (self.temporal_data['timestamp'] >= start_time) & (self.temporal_data['timestamp'] <= end_time)
        return self.temporal_data.loc[mask]
  1. Usage example:
graph = EnhancedGraph()

# Import geospatial data
graph.import_geojson("cities.geojson")

# Import temporal data
graph.import_temporal_csv("events.csv", timestamp_col="event_date", node_id_col="event_id")

# Import Wikidata
node_id = graph.import_wikidata("Q64")

# Perform spatial and temporal queries
cities_in_area = graph.spatial_query(some_polygon)
events_in_timeframe = graph.temporal_query(pd.Timestamp("2023-01-01"), pd.Timestamp("2023-12-31"))

# Convert to GeoDataFrame or DataFrame for further analysis
gdf = graph.to_geopandas()
temporal_df = graph.to_temporal_pandas()

This implementation enhances TxtAI's graph capabilities by:

  1. Supporting geospatial and temporal data alongside the existing graph structure.
  2. Preserving attributes and types when importing data from various sources.
  3. Providing methods to query and analyze the data based on spatial and temporal criteria.
  4. Integrating with external data sources like Wikidata.

Regarding the initial type problem:
This implementation indirectly addresses the initial type problem by providing a more robust framework for handling different types of data, including the ability to preserve and query based on node types and attributes. While it doesn't directly solve the specific issue of adding a 'type' attribute to nodes, it provides a flexible structure that can easily accommodate such attributes and more complex data types.

The approach is well-integrated with TxtAI's ecosystem, extending its Graph class and using compatible libraries like geopandas and pandas. It also leverages NetworkX's underlying graph structure while adding geospatial and temporal capabilities on top of it.

Citations:
[1] https://github.com/neuml/txtai/blob/master/examples/38_Introducing_the_Semantic_Graph.ipynb
[2] https://github.com/neuml/txtai/blob/master/examples/57_Build_knowledge_graphs_with_LLM_driven_entity_extraction.ipynb
[3] https://neuml.hashnode.dev/generate-knowledge-with-semantic-graphs-and-rag
[4] https://neuml.hashnode.dev/introducing-the-semantic-graph
[5] https://neuml.github.io/txtai/examples/
[6] https://networkx.org/documentation/stable/auto_examples/geospatial/index.html
[7] https://networkx.org/documentation/stable/auto_examples/geospatial/extended_description.html
[8] geopandas/geopandas#1592
[9] https://napo.github.io/geospatial_course_unitn/lessons/05-street-network-analysis
[10] https://towardsdatascience.com/3d-spatial-data-integration-with-python-7ef8ef14589a?gi=568600818a62
[11] https://pandas.pydata.org/docs/getting_started/intro_tutorials/09_timeseries.html
[12] https://pandas.pydata.org/docs/user_guide/timeseries.html
[13] https://pythongis.org/part1/chapter-03/nb/03-temporal-data.html
[14] https://pandas.pydata.org/pandas-docs/version/1.2.0/getting_started/intro_tutorials/09_timeseries.html
[15] https://www.dataquest.io/blog/tutorial-time-series-analysis-with-pandas/

@yukiman76
Copy link
Contributor

How about something like this

import uuid
from txtai.embeddings import Embeddings
from datetime import datetime, timedelta


class TemporalEmbeddings(Embeddings):
    """
    Extends txtAI Embeddings to handle temporal date 
    """

    def get_dt_object(self, date_string):
        return datetime.strptime(date_string, "%Y-%m-%dT%H:%M:%SZ")

    def __init__(self, config=None, models=None, **kwargs):
        super().__init__(config, models, **kwargs)

    def index(self, documents):
        # we need to make sure each document has a timestamp
        for document in documents:
            print(document)
            if document.get("doc_date") == "":
                raise Exception("All documents must have a doc_date field")

        # Call parent class indexing
        super().index(documents)

    def search(
        self,
        text,
        limit=3,
        min_score=0.40,
        start_date=None,
        end_date=None,
    ):
        if not start_date:
            # we willl default to 1 year ago
            start_date = datetime.now() - timedelta(days=365)

        if not end_date:
            end_date = datetime.now().isoformat()

        query = """SELECT id, text, score, doc_date
            FROM txtai
            WHERE similar(:x)
            AND score > :s
            AND doc_date > :sdt
            AND doc_date <= :edt
            """

        docs = super().search(
            query,
            limit=limit,
            parameters={"x": text,
                        "s": min_score, 
                        "sdt": start_date, 
                        "edt": end_date},
        )

        return docs

    def search_direct(self, query, limit=None, weights=None, index=None, parameters=None, graph=False):
        """simple way to bypass the search construct override, for direct control of query"""
        docs = super().search(
            query=query, 
            limit=limit, 
            weights=weights, 
            index=index, 
            parameters=parameters, 
            graph=graph
        )

        return docs

if __name__ == "__main__":
    embeddings = TemporalEmbeddings(
        hybrid=True,
        autoid="uuid5",
        path="intfloat/e5-large",
        instructions={"query": "query: ", "data": "passage: "},
        content=True,
        gpu=True,
    )

    # Example documents with timestamps
    documents = [
        {
            "id": str(uuid.uuid4()),
            "text": "The latest AI breakthrough in natural language processing",
            "doc_date": (datetime.now() - timedelta(days=22)).isoformat(),
        },
        {
            "id": str(uuid.uuid4()),
            "text": "Historical overview of machine learning developments",
            "doc_date": (datetime.now() - timedelta(days=20)).isoformat(),
        },
        {
            "id": str(uuid.uuid4()),
            "text": "Future predictions for artificial intelligence",
            "doc_date": (datetime.now() - timedelta(days=3)).isoformat(),
        },
    ]
    approximate_year_ago = datetime.now() - timedelta(days=3365)
    # Index documents with temporal information
    embeddings.index(documents)
    print(embeddings.count())
    print(embeddings.search("artificial"))
    results = embeddings.search(
        "machine",
        min_score=0.6, 
        start_date=approximate_year_ago.isoformat()
    )
    print("Search results:", results)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants