-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Table writer 6: create directory if not exists #3414
Conversation
✅ Deploy Preview for meta-velox canceled.
|
@gggrace14 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@gggrace14 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@@ -28,6 +30,10 @@ FileSink::FileSink( | |||
const MetricsLogPtr& metricLogger, | |||
IoStatistics* stats) | |||
: DataSink{name, metricLogger, stats} { | |||
auto dir = fs::path(name).parent_path(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we use Velox filesystem here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @xiaoxmeng , I find DataSink abstraction and Velox FileSystem abstraction are kept separate now, ex., we register DataSink independently,
velox/velox/dwio/common/DataSink.h
Line 200 in d38a666
#define VELOX_STATIC_REGISTER_DATA_SINK(function) \ |
Current Velox has separate implementation of local FileSink and WarmStorageBlockSink, https://fburl.com/code/usn9qvb8. Each calls their corresponding "file system" APIs for file ops. I think we should follow this pattern.
As you might know, the reason we call Velox FileSystem interface is to auto dispatch to the right "file system" APIs based on the file path prefix. Here we're already with the local file DataSink, so we don't need the FileSystem interface for the Polymorphism
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we use the same set of interface to access file system. My understanding is that local FileSink and WarmStorageBlockSink could be a layer on top of file system abstraction. Ideally, we could have one DataSink object built on top of different file system interfaces?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gggrace14 Can we use Velox filesystem interface to access local file system instead of directly calling posix APIs. The same for ws based data sink? Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gggrace14 Would you add a test and update PR description to provide some more details about the change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gggrace14 Is FileSink a test-only class? If so, it would be cleaner to move it into test namespace and document accordingly.
Hi @mbasmanova , nope, FileSink is prod class and being called by table writing on a local Presto server. |
@gggrace14 Ge, thank you for clarifying. I see that FileSink is an implementation of DataSink interface for the local file system. I guess it would be clearer if it was called LocalFileSink :-) |
@gggrace14 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@mbasmanova Hi Masha, good suggestion. I refactored to rename FileSink to LocalFileSink, to make the name more clear (and consistent w/ the class LocalFileSystem that we have). It touches several files, so will wait for the CicleCi tests to be all green. There're many deps on FileSink in fbcode. Also adapting them |
@gggrace14 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
1 similar comment
@gggrace14 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@gggrace14 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gggrace14 Looks good to me % some questions below.
|
||
namespace facebook::velox::dwio::common { | ||
|
||
TEST(LocalFileSinkTests, create) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
naming: LocalFileSinkTests -> LocalFileSinkTest and the file name should be LocalFileSinkTest.cpp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Revised
auto subdir = fs::path(); | ||
for (auto i = 0; i < 4; i++) { | ||
subdir = | ||
subdir / boost::uuids::to_string(boost::uuids::random_generator()()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test would be more readable if only top-level directory name is auto-generated, but the rest of the directories and the file name are hard-coded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Revised.
} | ||
auto fileName = boost::uuids::to_string(boost::uuids::random_generator()()); | ||
|
||
auto filePath = root / subdir / fileName; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: perhaps, add a check that root / subdir doesn't exist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add it.
DataSink::create(fmt::format("file:{}", filePath.string())); | ||
localFileSink->close(); | ||
|
||
EXPECT_TRUE(fs::exists(filePath.string())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make sure the file is deleted at the end of the test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The root is created by TempDirectoryPath::create()
, and I see the destructor of TempDirectoryPath is like
velox/velox/exec/tests/utils/TempDirectoryPath.cpp
Lines 30 to 32 in 05ad3d0
TempDirectoryPath::~TempDirectoryPath() { | |
boost::filesystem::remove_all(path.c_str()); | |
} |
Also TempDirectoryPath::create() calls mkdtemp() underlying
velox/velox/exec/tests/utils/TempDirectoryPath.h
Lines 41 to 43 in 05ad3d0
static std::string createTempDirectory() { | |
char path[] = "/tmp/velox_test_XXXXXX"; | |
const char* tempDirectoryPath = mkdtemp(path); |
So my understanding is the root directory would be deleted when the variable leaves its scope.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
auto root = fs::path(TempDirectoryPath::create()->path);
This code creates directory at 'root' path and deletes it (because the result of TempDirectoryPath::create() is not stored in a variable and goes out or scope after the call.). After this call, I expect root to not exist. Hence, it is not clear which code will take care of deleting the directory at the end of the test. Would you double check that the directory is indeed deleted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Revised, and checked the directory gets deleted after the unit test finishes.
I see what you mean, and you're right. As I didn't store in a variable TempDirectoryPath::create() but stored ->path, the destructor ~TempDirectoryPath() is called right away after this line. So no directory deletion will be called after the test finishes.
@@ -270,12 +270,12 @@ class HiveConnectorFactory : public ConnectorFactory { | |||
"hive-hadoop2"; | |||
|
|||
HiveConnectorFactory() : ConnectorFactory(kHiveConnectorName) { | |||
dwio::common::FileSink::registerFactory(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder why do we need to register local file sink in the Hive Connector? Shouldn't the application using Hive connector register whatever file system it is going to use?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My best guess is it just tries to make sure at least one DataSink, the LocalFileSink, is registered, as Velox itself also needs it for, ex., unit tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It feels like it is not HiveConnector's job to setup things for unit tests. It would be nice to remove this code in a follow-up PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will follow up with a separate PR and move it to the right place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gggrace14 Looks good % a couple of nits and a question about cleaning up test directory after the test.
LocalFileSink::registerFactory(); | ||
|
||
auto root = fs::path(TempDirectoryPath::create()->path); | ||
auto filePath = root / "test_dir_xxx/test_dir_yyy/test_dir_zzz/test_file"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: "xxx/yyy/zzz/test_file.ext" might be easier to read
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Revised
auto root = fs::path(TempDirectoryPath::create()->path); | ||
auto filePath = root / "test_dir_xxx/test_dir_yyy/test_dir_zzz/test_file"; | ||
|
||
ASSERT_FALSE(fs::exists((filePath.string()))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: redundant parentheses around filePath.string()
@@ -270,12 +270,12 @@ class HiveConnectorFactory : public ConnectorFactory { | |||
"hive-hadoop2"; | |||
|
|||
HiveConnectorFactory() : ConnectorFactory(kHiveConnectorName) { | |||
dwio::common::FileSink::registerFactory(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It feels like it is not HiveConnector's job to setup things for unit tests. It would be nice to remove this code in a follow-up PR.
DataSink::create(fmt::format("file:{}", filePath.string())); | ||
localFileSink->close(); | ||
|
||
EXPECT_TRUE(fs::exists(filePath.string())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
auto root = fs::path(TempDirectoryPath::create()->path);
This code creates directory at 'root' path and deletes it (because the result of TempDirectoryPath::create() is not stored in a variable and goes out or scope after the call.). After this call, I expect root to not exist. Hence, it is not clear which code will take care of deleting the directory at the end of the test. Would you double check that the directory is indeed deleted?
Summary: If the parent directory of the file that a LocalFileSink writes to does not exist, create the directory first when creating the LocalFileSink. Rename FileSink to LocalFileSink, to be more clear. LocalFileSink is one subclass of the interface dwio::common::DataSink that implements a local file system based data sink for a dwio writer. This is needed by a partitioned table writer, which calls on the LocalFileSink to write to files under partition subdirectories that are created during query execution. Pull Request resolved: facebookincubator#3414 Test Plan: Imported from GitHub, without a `Test Plan:` line. buck build //fb_presto_cpp:main buck test velox/dwio/common/tests: Reviewed By: mbasmanova Differential Revision: D41723673 Pulled By: gggrace14 fbshipit-source-id: 6aab5e9504f6b4c284f26c85eb664dd415d5db7c
This pull request was exported from Phabricator. Differential Revision: D41723673 |
@gggrace14 merged this pull request in 8127e86. |
If the parent directory of the file that a LocalFileSink writes to
does not exist, create the directory first when creating
the LocalFileSink.
Rename FileSink to LocalFileSink, to be more clear.
LocalFileSink is one subclass of the interface
dwio::common::DataSink that implements a local file
system based data sink for a dwio writer.
This is needed by a partitioned table writer, which calls on
the LocalFileSink to write to files under partition subdirectories
that are created during query execution.