-
Notifications
You must be signed in to change notification settings - Fork 638
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature request : Enhanced Geospatial and Temporal Search #740
Comments
A more (?) related way to attributes and types: Extension of Data Sources Implementation:
from txtai.graph import Graph
import geopandas as gpd
import pandas as pd
from qwikidata.linked_data_interface import get_entity_dict_from_api
class EnhancedGraph(Graph):
def __init__(self):
super().__init__()
self.gdf = gpd.GeoDataFrame()
self.temporal_data = pd.DataFrame()
def add_geospatial_node(self, node_id, geometry, **attrs):
self.graph.add_node(node_id, geometry=geometry, **attrs)
self.gdf = self.gdf.append({'node_id': node_id, 'geometry': geometry, **attrs}, ignore_index=True)
def add_temporal_node(self, node_id, timestamp, **attrs):
self.graph.add_node(node_id, timestamp=timestamp, **attrs)
self.temporal_data = self.temporal_data.append({'node_id': node_id, 'timestamp': timestamp, **attrs}, ignore_index=True)
def import_wikidata(self, entity_id):
entity_dict = get_entity_dict_from_api(entity_id)
node_id = entity_dict['id']
attrs = {claim['mainsnak']['property']: claim['mainsnak']['datavalue']['value']
for claim in entity_dict['claims'] if 'datavalue' in claim['mainsnak']}
self.graph.add_node(node_id, **attrs)
return node_id
def to_geopandas(self):
return self.gdf
def to_temporal_pandas(self):
return self.temporal_data
def import_geojson(self, file_path):
gdf = gpd.read_file(file_path)
for idx, row in gdf.iterrows():
self.add_geospatial_node(idx, row.geometry, **row.to_dict())
def import_temporal_csv(self, file_path, timestamp_col, node_id_col):
df = pd.read_csv(file_path, parse_dates=[timestamp_col])
for idx, row in df.iterrows():
self.add_temporal_node(row[node_id_col], row[timestamp_col], **row.to_dict())
def spatial_query(self, geometry):
return self.gdf[self.gdf.intersects(geometry)]
def temporal_query(self, start_time, end_time):
mask = (self.temporal_data['timestamp'] >= start_time) & (self.temporal_data['timestamp'] <= end_time)
return self.temporal_data.loc[mask]
graph = EnhancedGraph()
# Import geospatial data
graph.import_geojson("cities.geojson")
# Import temporal data
graph.import_temporal_csv("events.csv", timestamp_col="event_date", node_id_col="event_id")
# Import Wikidata
node_id = graph.import_wikidata("Q64")
# Perform spatial and temporal queries
cities_in_area = graph.spatial_query(some_polygon)
events_in_timeframe = graph.temporal_query(pd.Timestamp("2023-01-01"), pd.Timestamp("2023-12-31"))
# Convert to GeoDataFrame or DataFrame for further analysis
gdf = graph.to_geopandas()
temporal_df = graph.to_temporal_pandas() This implementation enhances TxtAI's graph capabilities by:
Regarding the initial type problem: The approach is well-integrated with TxtAI's ecosystem, extending its Graph class and using compatible libraries like geopandas and pandas. It also leverages NetworkX's underlying graph structure while adding geospatial and temporal capabilities on top of it. Citations: |
How about something like this import uuid
from txtai.embeddings import Embeddings
from datetime import datetime, timedelta
class TemporalEmbeddings(Embeddings):
"""
Extends txtAI Embeddings to handle temporal date
"""
def get_dt_object(self, date_string):
return datetime.strptime(date_string, "%Y-%m-%dT%H:%M:%SZ")
def __init__(self, config=None, models=None, **kwargs):
super().__init__(config, models, **kwargs)
def index(self, documents):
# we need to make sure each document has a timestamp
for document in documents:
print(document)
if document.get("doc_date") == "":
raise Exception("All documents must have a doc_date field")
# Call parent class indexing
super().index(documents)
def search(
self,
text,
limit=3,
min_score=0.40,
start_date=None,
end_date=None,
):
if not start_date:
# we willl default to 1 year ago
start_date = datetime.now() - timedelta(days=365)
if not end_date:
end_date = datetime.now().isoformat()
query = """SELECT id, text, score, doc_date
FROM txtai
WHERE similar(:x)
AND score > :s
AND doc_date > :sdt
AND doc_date <= :edt
"""
docs = super().search(
query,
limit=limit,
parameters={"x": text,
"s": min_score,
"sdt": start_date,
"edt": end_date},
)
return docs
def search_direct(self, query, limit=None, weights=None, index=None, parameters=None, graph=False):
"""simple way to bypass the search construct override, for direct control of query"""
docs = super().search(
query=query,
limit=limit,
weights=weights,
index=index,
parameters=parameters,
graph=graph
)
return docs
if __name__ == "__main__":
embeddings = TemporalEmbeddings(
hybrid=True,
autoid="uuid5",
path="intfloat/e5-large",
instructions={"query": "query: ", "data": "passage: "},
content=True,
gpu=True,
)
# Example documents with timestamps
documents = [
{
"id": str(uuid.uuid4()),
"text": "The latest AI breakthrough in natural language processing",
"doc_date": (datetime.now() - timedelta(days=22)).isoformat(),
},
{
"id": str(uuid.uuid4()),
"text": "Historical overview of machine learning developments",
"doc_date": (datetime.now() - timedelta(days=20)).isoformat(),
},
{
"id": str(uuid.uuid4()),
"text": "Future predictions for artificial intelligence",
"doc_date": (datetime.now() - timedelta(days=3)).isoformat(),
},
]
approximate_year_ago = datetime.now() - timedelta(days=3365)
# Index documents with temporal information
embeddings.index(documents)
print(embeddings.count())
print(embeddings.search("artificial"))
results = embeddings.search(
"machine",
min_score=0.6,
start_date=approximate_year_ago.isoformat()
)
print("Search results:", results) |
Here's a plan to enhance TxtAI with geospatial and temporal search capabilities:
1. Extend indexing for geospatial data:
2. Implement temporal search functionalities:
3. Integrate with existing semantic search:
This implementation:
To use this enhanced graph:
This approach extends TxtAI's capabilities while maintaining simplicity and integration with its existing ecosystem.
Citations:
[1] https://networkx.org/documentation/stable/auto_examples/geospatial/index.html
[2] https://networkx.org/documentation/stable/auto_examples/geospatial/extended_description.html
[3] geopandas/geopandas#1592
[4] https://napo.github.io/geospatial_course_unitn/lessons/05-street-network-analysis
[5] https://pypi.org/project/networkx-temporal/
[6] https://www.timescale.com/blog/tools-for-working-with-time-series-analysis-in-python/
[7] https://pythongis.org/part1/chapter-03/nb/03-temporal-data.html
[8] https://github.com/MaxBenChrist/awesome_time_series_in_python
[9] https://unit8co.github.io/darts/
[10] https://www.timescale.com/blog/how-to-work-with-time-series-in-python/
[11] https://github.com/sacridini/Awesome-Geospatial
[12] https://www.mdpi.com/1999-4893/10/2/37
The text was updated successfully, but these errors were encountered: