-
Notifications
You must be signed in to change notification settings - Fork 227
Support better integration with Map-reduce #556
Comments
James, |
We have APIs to access metadata - they're the standard JDBC interface for DatabaseMetaData. My preference would be that we expose access to these when running map-reduce. I agree that for (b), the CSV bulk loader only solves a subset of the problem. One thing that might help is making it easy to create built-in functions. Also, using Pig instead of Map-reduce may be a better model. I agree that for (c), we can't rely on holding everything in memory |
Hi James,
For a) , we just need to give a Scan instance to the job as part of TableMapReduceUtil , that we can construct using the builder class. Regards |
To construct the row key for a table, you can do the following:
|
Hi James, Regards |
Not sure I follow about forming the start/stop key of the Scan. Is this based on a SQL query, as there's quite a bit of code in Phoenix that figures this out? For (b), it's fine to initialize a connection in the reducer, as our driver is embedded. All connections on the same JVM share the same HConnection between the HBase client and server, so creating a new connection is just a few instantiations. One other call you'll likely need to make is the following:
This will populate the client-side cache that your using to retrieve the PTable from the server-side definition in the SYSTEM.TABLE. If you know that the schema of the table won't be changing while your map-reduce job is running, this would be a one-time thing. Otherwise, you'd need to do this prior to forming the row key to ensure you're working with the latest table definition. |
@mravi - did anything come out of your effort here? |
Hi James, |
Idea from the mailing list here and here, to support a SELECT query to determine the output of the Map function and/or an UPSERT SELECT in the Reduce function.
Given that Phoenix already parallelizes your query in a similar way to a Map-reduce job, it's unclear to me what benefit this functionality would provide. Would the SELECT statement be running over only a single regions worth of data? What can you not do with an UPSERT SELECT that you can do with Map-reduce? I'm sure there are plenty of things, but it would be good to list the top ten to identify what's missing from Phoenix.
The text was updated successfully, but these errors were encountered: