Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Table writer 6: create directory if not exists #3414

Closed
wants to merge 1 commit into from

Conversation

gggrace14
Copy link
Contributor

@gggrace14 gggrace14 commented Dec 5, 2022

If the parent directory of the file that a LocalFileSink writes to
does not exist, create the directory first when creating
the LocalFileSink.

Rename FileSink to LocalFileSink, to be more clear.
LocalFileSink is one subclass of the interface
dwio::common::DataSink that implements a local file
system based data sink for a dwio writer.

This is needed by a partitioned table writer, which calls on
the LocalFileSink to write to files under partition subdirectories
that are created during query execution.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Dec 5, 2022
@netlify
Copy link

netlify bot commented Dec 5, 2022

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit 7fa3d5d
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/639aa3afba23900009328859

@facebook-github-bot
Copy link
Contributor

@gggrace14 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@gggrace14 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@@ -28,6 +30,10 @@ FileSink::FileSink(
const MetricsLogPtr& metricLogger,
IoStatistics* stats)
: DataSink{name, metricLogger, stats} {
auto dir = fs::path(name).parent_path();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we use Velox filesystem here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @xiaoxmeng , I find DataSink abstraction and Velox FileSystem abstraction are kept separate now, ex., we register DataSink independently,

#define VELOX_STATIC_REGISTER_DATA_SINK(function) \
.
Current Velox has separate implementation of local FileSink and WarmStorageBlockSink, https://fburl.com/code/usn9qvb8. Each calls their corresponding "file system" APIs for file ops. I think we should follow this pattern.

As you might know, the reason we call Velox FileSystem interface is to auto dispatch to the right "file system" APIs based on the file path prefix. Here we're already with the local file DataSink, so we don't need the FileSystem interface for the Polymorphism

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we use the same set of interface to access file system. My understanding is that local FileSink and WarmStorageBlockSink could be a layer on top of file system abstraction. Ideally, we could have one DataSink object built on top of different file system interfaces?

Copy link
Contributor

@xiaoxmeng xiaoxmeng Dec 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gggrace14 Can we use Velox filesystem interface to access local file system instead of directly calling posix APIs. The same for ws based data sink? Thanks!

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gggrace14 Would you add a test and update PR description to provide some more details about the change?

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gggrace14 Is FileSink a test-only class? If so, it would be cleaner to move it into test namespace and document accordingly.

@gggrace14
Copy link
Contributor Author

@gggrace14 Is FileSink a test-only class? If so, it would be cleaner to move it into test namespace and document accordingly.

Hi @mbasmanova , nope, FileSink is prod class and being called by table writing on a local Presto server.

@mbasmanova
Copy link
Contributor

@gggrace14 Ge, thank you for clarifying. I see that FileSink is an implementation of DataSink interface for the local file system. I guess it would be clearer if it was called LocalFileSink :-)

@facebook-github-bot
Copy link
Contributor

@gggrace14 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@gggrace14
Copy link
Contributor Author

gggrace14 commented Dec 8, 2022

@gggrace14 Ge, thank you for clarifying. I see that FileSink is an implementation of DataSink interface for the local file system. I guess it would be clearer if it was called LocalFileSink :-)

@mbasmanova Hi Masha, good suggestion. I refactored to rename FileSink to LocalFileSink, to make the name more clear (and consistent w/ the class LocalFileSystem that we have).

It touches several files, so will wait for the CicleCi tests to be all green.

There're many deps on FileSink in fbcode. Also adapting them

@facebook-github-bot
Copy link
Contributor

@gggrace14 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

1 similar comment
@facebook-github-bot
Copy link
Contributor

@gggrace14 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@gggrace14 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gggrace14 Looks good to me % some questions below.


namespace facebook::velox::dwio::common {

TEST(LocalFileSinkTests, create) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

naming: LocalFileSinkTests -> LocalFileSinkTest and the file name should be LocalFileSinkTest.cpp

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revised

auto subdir = fs::path();
for (auto i = 0; i < 4; i++) {
subdir =
subdir / boost::uuids::to_string(boost::uuids::random_generator()());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test would be more readable if only top-level directory name is auto-generated, but the rest of the directories and the file name are hard-coded.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Revised.

}
auto fileName = boost::uuids::to_string(boost::uuids::random_generator()());

auto filePath = root / subdir / fileName;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: perhaps, add a check that root / subdir doesn't exist.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add it.

DataSink::create(fmt::format("file:{}", filePath.string()));
localFileSink->close();

EXPECT_TRUE(fs::exists(filePath.string()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make sure the file is deleted at the end of the test?

Copy link
Contributor Author

@gggrace14 gggrace14 Dec 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The root is created by TempDirectoryPath::create(), and I see the destructor of TempDirectoryPath is like

TempDirectoryPath::~TempDirectoryPath() {
boost::filesystem::remove_all(path.c_str());
}

Also TempDirectoryPath::create() calls mkdtemp() underlying

static std::string createTempDirectory() {
char path[] = "/tmp/velox_test_XXXXXX";
const char* tempDirectoryPath = mkdtemp(path);

So my understanding is the root directory would be deleted when the variable leaves its scope.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

auto root = fs::path(TempDirectoryPath::create()->path);

This code creates directory at 'root' path and deletes it (because the result of TempDirectoryPath::create() is not stored in a variable and goes out or scope after the call.). After this call, I expect root to not exist. Hence, it is not clear which code will take care of deleting the directory at the end of the test. Would you double check that the directory is indeed deleted?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revised, and checked the directory gets deleted after the unit test finishes.

I see what you mean, and you're right. As I didn't store in a variable TempDirectoryPath::create() but stored ->path, the destructor ~TempDirectoryPath() is called right away after this line. So no directory deletion will be called after the test finishes.

@@ -270,12 +270,12 @@ class HiveConnectorFactory : public ConnectorFactory {
"hive-hadoop2";

HiveConnectorFactory() : ConnectorFactory(kHiveConnectorName) {
dwio::common::FileSink::registerFactory();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder why do we need to register local file sink in the Hive Connector? Shouldn't the application using Hive connector register whatever file system it is going to use?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My best guess is it just tries to make sure at least one DataSink, the LocalFileSink, is registered, as Velox itself also needs it for, ex., unit tests.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels like it is not HiveConnector's job to setup things for unit tests. It would be nice to remove this code in a follow-up PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will follow up with a separate PR and move it to the right place.

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gggrace14 Looks good % a couple of nits and a question about cleaning up test directory after the test.

LocalFileSink::registerFactory();

auto root = fs::path(TempDirectoryPath::create()->path);
auto filePath = root / "test_dir_xxx/test_dir_yyy/test_dir_zzz/test_file";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "xxx/yyy/zzz/test_file.ext" might be easier to read

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revised

auto root = fs::path(TempDirectoryPath::create()->path);
auto filePath = root / "test_dir_xxx/test_dir_yyy/test_dir_zzz/test_file";

ASSERT_FALSE(fs::exists((filePath.string())));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: redundant parentheses around filePath.string()

@@ -270,12 +270,12 @@ class HiveConnectorFactory : public ConnectorFactory {
"hive-hadoop2";

HiveConnectorFactory() : ConnectorFactory(kHiveConnectorName) {
dwio::common::FileSink::registerFactory();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels like it is not HiveConnector's job to setup things for unit tests. It would be nice to remove this code in a follow-up PR.

DataSink::create(fmt::format("file:{}", filePath.string()));
localFileSink->close();

EXPECT_TRUE(fs::exists(filePath.string()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

auto root = fs::path(TempDirectoryPath::create()->path);

This code creates directory at 'root' path and deletes it (because the result of TempDirectoryPath::create() is not stored in a variable and goes out or scope after the call.). After this call, I expect root to not exist. Hence, it is not clear which code will take care of deleting the directory at the end of the test. Would you double check that the directory is indeed deleted?

Summary:
If the parent directory of the file that a LocalFileSink writes to
does not exist, create the directory first when creating
the LocalFileSink.

Rename FileSink to LocalFileSink, to be more clear.
LocalFileSink is one subclass of the interface
dwio::common::DataSink that implements a local file
system based data sink for a dwio writer.

This is needed by a partitioned table writer, which calls on
the LocalFileSink to write to files under partition subdirectories
that are created during query execution.

Pull Request resolved: facebookincubator#3414

Test Plan:
Imported from GitHub, without a `Test Plan:` line.

buck build //fb_presto_cpp:main
buck test velox/dwio/common/tests:

Reviewed By: mbasmanova

Differential Revision: D41723673

Pulled By: gggrace14

fbshipit-source-id: 6aab5e9504f6b4c284f26c85eb664dd415d5db7c
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D41723673

@facebook-github-bot
Copy link
Contributor

@gggrace14 merged this pull request in 8127e86.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. Merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants