Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Removes DrupalUrlAlias #47 and optimizes maps index #73

Closed
wants to merge 15 commits into from

Conversation

sirMackk
Copy link
Contributor

@sirMackk sirMackk commented May 8, 2014

Finishes #47.

This PR removes DrupalUrlAlias. I tried a couple of things, but the most robust way turned out to be to simply moving the url_alias.dst column into the node table, thereby making dst lookups faster and easier and reducing the amount joins one has to do. It also slimmed down the code a bit here and there.

A second, smaller part involves a minor optimization of the maps index that tangents off #33. It involves optimizing the view rendering time and in my dev environment it went from ~800ms to ~130ms.

Because this involves a central table, I'd advise checking this code in development mode using a copy of the production database to see if there are any issues. As far as I was able to check, things look to be working correctly :).

@jywarren
Copy link
Member

jywarren commented May 8, 2014

Luckily we now have a development server with a fresh copy of the live DB
at dev.publiclab.org. I'm testing some tag consolidation stuff on it soon
but will also try your code out on a branch and run tests, once I'm done
with that. Excited!

On Thu, May 8, 2014 at 2:26 PM, sirMackk notifications@github.com wrote:

Finishes #47 #47.

This PR removes DrupalUrlAlias. I tried a couple of things, but the most
robust way turned out to be to simply moving the url_alias.dst column into
the node table, thereby making dst lookups faster and easier and reducing
the amount joins one has to do. It also slimmed down the code a bit here
and there.

A second, smaller part involves a minor optimization of the maps index
that tangents off #33 #33. It
involves optimizing the view rendering time and in my dev environment it
went from ~800ms to ~130ms.

Because this involves a central table, I'd advise checking this code in
development mode using a copy of the production database to see if there
are any issues. As far as I was able to check, things look to be working

correctly :).

You can merge this Pull Request by running

git pull https://github.com/sirMackk/plots2 master

Or view, comment on, or merge it at:

#73
Commit Summary

  • Removes DrupalUrlAlias, adds dummy maps to seed file and optimizes
    maps
  • Removes DrupalUrlAlias.rb file

File Changes

Patch Links:


Reply to this email directly or view it on GitHubhttps://github.com//pull/73
.

@sirMackk
Copy link
Contributor Author

sirMackk commented May 9, 2014

That's great news! Looking forward to the results and keeping my fingers crossed.

@jywarren
Copy link
Member

Hi, we've had some trouble with the testing server DNS and haven't been
able to run these tests, unfortunately. I'm hoping to resolve it soon;
sorry for the delay!

On Fri, May 9, 2014 at 7:11 AM, sirMackk notifications@github.com wrote:

That's great news! Looking forward to the results and keeping my fingers
crossed.


Reply to this email directly or view it on GitHubhttps://github.com//pull/73#issuecomment-42655374
.

@sirMackk
Copy link
Contributor Author

No worries at all, I've slowed down somewhat due to getting ready to move back home to NYC 👍. I'll be back in the game shortly :).

@jywarren
Copy link
Member

Hi, sirMack - we are still debugging the big migration in the pipeline at jywarren#181, but that should be done soon (takes a while to reload the db each time). Then this is up at bat! I was wondering if you could change the "dst" column to a more legible name like "path" -- which may require renaming all instances of Node.path to something else for the time being, that function conflicts with the table name (see https://github.com/publiclab/plots2/blob/master/app/models/drupal_node.rb#L337)

Alternatively now that I look at it, we might just be able to drop the path method, since your new table would just replace it. But the method has a leading slash, where DrupalUrlAlias.dst does not. So could you add leading slashes when you migrate? That could alleviate the need for sirMackk@c9c0a68#diff-506c305b074506147da2da7a9d9d6dcfR332

Sorry we let this go a little while and it's no longer automerging :-( my bad, due to the trouble getting the test server going.

@sirMackk
Copy link
Contributor Author

Sure thing - I'll change the the dst column to path and arrange so that it gets rid of the path function. I'll also hammer it into shape so that it automerges with the master branch 👍 . Thanks for the tips!

@sirMackk
Copy link
Contributor Author

sirMackk commented Jun 5, 2014

I changed the .dst to path as you suggested and made sure the migration adds a slash. A few minor tweaks and it looks to be working well - I tested out the wiki pages, revision pages, maps pages and a few others. I also pulled in from publiclab/plots2 and merged, so it should be merging cleanly into master.

I think that should do it :).

Crosses fingers the migration works correctly

@jywarren
Copy link
Member

jywarren commented Jun 6, 2014

OK, super, once we get the tests running after #87 we'll
try this one out too. Some of the key paths which we should look out for
are:

/about
/wiki/es/cedro-grassroots-mapping-curriculum-es

and some of the legacy ones which we're trying to continue supporting... i
think these are dealt with in the routes.rb and legacy_controller.rb, but:

/tool/balloon-mapping
/place/gulf-coast

I guess in theory we could just be storing the "slug" -- the /wiki/foo or
/notes/author/01-01-2014/foo but let's think about that later. This will be
such a great optimization as it is.

On Thu, Jun 5, 2014 at 4:54 PM, sirMackk notifications@github.com wrote:

I changed the .dst to path as you suggested and made sure the migration
adds a slash. A few minor tweaks and it looks to be working well - I tested
out the wiki pages, revision pages, maps pages and a few others. I also
pulled in from publiclab/plots2 and merged, so it should be merging cleanly
into master.

I think that should do it :).

Crosses fingers the migration works correctly


Reply to this email directly or view it on GitHub
#73 (comment).

@sirMackk
Copy link
Contributor Author

sirMackk commented Jun 6, 2014

Thanks for the pointers, I'll check out the language related one today as I haven't done that yet, whereas the others resolve to wiki pages, and I'll see if I can make anything break that shouldn't :).

@jywarren
Copy link
Member

jywarren commented Jun 6, 2014

Super. If you'd like to send me and
stefan.unterhauser@organizers.publiclab.org your public key, we can give
you access to the test server to try it out.

On Fri, Jun 6, 2014 at 10:56 AM, sirMackk notifications@github.com wrote:

Thanks for the pointers, I'll check out the language related one today as
I haven't done that yet, whereas the others resolve to wiki pages, and I'll
see if I can make anything break that shouldn't :).


Reply to this email directly or view it on GitHub
#73 (comment).

@jywarren
Copy link
Member

Oops, should've commented here; i'll repeat:

http://publiclab.org/wiki/foldable-mini-spectrometer and http://publiclab.org/wiki/foldable-spec lead to the same page. Maybe there's some easy way to do this, where we create a new node but with status = 3 or something, which triggers a redirect?

@sirMackk
Copy link
Contributor Author

Thanks for pointing this out! I've been working on getting the migrations to work with the db on the test server. I've encountered some bugs in my code when using the real db, so I've been working on fixing those :).

Do you think there's a lot of urls pointing to the same nodes? I like the idea of using redirects to keep the clone-nodes leaner, could definitely go for that.

@jywarren
Copy link
Member

I think maybe a dozen or so. They were all created manually and in many
cases we could simply make a new node containing a link to the one people
will be looking for in the node body.
On Jun 12, 2014 6:10 PM, "sirMackk" notifications@github.com wrote:

Thanks for pointing this out! I've been working on getting the migrations
to work with the db on the test server. I've encountered some bugs in my
code when using the real db, so I've been working on fixing those :).

Do you think there's a lot of urls pointing to the same nodes? I like the
idea of using redirects to keep the clone-nodes leaner, could definitely go
for that.


Reply to this email directly or view it on GitHub
#73 (comment).

@btbonval
Copy link
Member

What if we moved all slugs to their own relation. It would contain the fields node_id, slug with unique constraint only on slug, primary key and index on node_id, slug.

If we did it this way, there would be a one-to-many mapping between node-to-slug. One node could have many slugs, but one slug could only belong to one node.

Migrating would be easy: iterate through nodes, put node's id and slug in the new table. For the few by-hand cases, we could update those with some quick inserts. Then we have to rewrite the code to use the slug relation instead of the slug field.

@sirMackk
Copy link
Contributor Author

@jywarren That sounds great, the duplicate nodes wouldn't almost any overhead.

@btbonval Wait, that sounds exactly like the url_alias table this PR tries to remove! With the src column storing the node id and the dst column storing the slug.

@btbonval
Copy link
Member

@sirMackk Yeah that might be. I haven't been following the motivation for this ticket.

When @jywarren said he wants a mapping from many slugs to one page, then apparently that's what url_alias table did.

Perhaps the right approach would be to remove the slug field from the node tables entirely, leave url_alias, but ensure the database constraints I mentioned earlier: index on node, slug and unique on slug.

Keeping slug information in multiple places is a Drupal hack that doesn't make good use of the database. Question is whether we fix it to support 1:1 mapping (use only slug field in node) or 1:N mapping (use only url_alias).

@sirMackk
Copy link
Contributor Author

Having only about a dozen duplicate nodes in a 1:1 mapping wouldn't hurt the database much and would avoid an extra query. The url_alias table was had two separate indexes: id,slug,language and id,node,language so changing it to index on node, slug and making slug unique is an awesome idea for the 1:N mapping.

However, as you say, the real question is whether to go with 1:1 or 1:N, especially taking into account possible future features? I'll hold off on this for a bit and focus on the consolidate_tags migration fix until I get the green light.

@btbonval
Copy link
Member

@sirMackk A relation table like url_alias doesn't need an id. The relation is precisely defined by the combination of slug and node_id. However, it is super common of web developers to add auto incrementing ids on all tables, when they should only do so on first order object tables (as opposed to second order tables which relate first order tables).

It's kind of a pet peeve of mine when people add ids where they aren't needed, but that's what I get for being in webdev ;)

Yeah let's wait to see whether @jywarren thinks we're okay to stick with 1:1 or if we should do 1:N properly.

@jywarren
Copy link
Member

Hi, all -- sorry to be gone from the discussion; I think the advantages of
a 1:1 mapping and the vast majority of nodes having only one slug currently
is too good to pass up. We also don't have any interface currently for
users to make aliases anyways, and I think it's acceptable to just create
new nodes for the multiple-slug nodes with an HTML/Markdown link that
redirect people is fine for the ~12 nodes that have this. Thanks for being
thorough on this you guys.

On Fri, Jun 13, 2014 at 10:09 AM, Bryan Bonvallet notifications@github.com
wrote:

@sirMackk https://github.com/sirMackk A relation table like url_alias
doesn't need an id. The relation is precisely defined by the combination of
slug and node_id. However, it is super common of web developers to add
auto incrementing ids on all tables, when they should only do so on first
order object tables (as opposed to second order tables which relate first
order tables).

It's kind of a pet peeve of mine when people add ids where they aren't
needed, but that's what I get for being in webdev ;)

Yeah let's wait to see whether @jywarren https://github.com/jywarren
thinks we're okay to stick with 1:1 or if we should do 1:N properly.


Reply to this email directly or view it on GitHub
#73 (comment).

@sirMackk
Copy link
Contributor Author

@btbonval Thanks for the lesson! Rails migrations add an auto-incremented id column by default if I remember correctly. Then it plays well with the default rails methods like .find. But experience has shown me time and time again that the rails way is not always the best way or even the good way and more often than not one has to override the rails magic like in this case. I'll keep this in mind in the future :)!

@jywarren Sounds good - I'll then continue on cleaning up the migrations to finally get them working alright. Learned a thing or two with working with pretty big databases along the way. Off we go!

@btbonval
Copy link
Member

If there's a simple parameter to disable adding an id, use that. Rails is
meant for rails programmers. When you're working in open source with a
community as devoted to convention as the Rails community is, it can
actually be a worse thing to write efficient code because it flies in the
face of what the community knows and understands. It's important to include
the community as best as possible in open source projects, even if it means
doing things that violate your innermost principles.

Just like smoking in high school.

On Fri, Jun 13, 2014 at 4:00 PM, sirMackk notifications@github.com wrote:

@btbonval https://github.com/btbonval Thanks for the lesson! Rails
migrations add an auto-incremented id column by default if I remember
correctly. Then it plays well with the default rails methods like .find.
But experience has shown me time and time again that the rails way is
not always the best way or even the good way and more often than not
one has to override the rails magic like in this case. I'll keep this in
mind in the future :)!

@jywarren https://github.com/jywarren Sounds good - I'll then continue
on cleaning up the migrations to finally get them working alright. Learned
a thing or two with working with pretty big databases along the way. Off we
go!


Reply to this email directly or view it on GitHub
#73 (comment).

@jywarren
Copy link
Member

Hi, SirMackk - are you still actively working on the test server, or can I test out a new commit there in a separate branch? If you want to commit your work to a new branch, i can just do my work in a separate branch if you havent run the migration itself yet. My commit is to do with email notifications so it shouldn't conflict at all.

@sirMackk
Copy link
Contributor Author

sirMackk commented Jul 3, 2014

Hi!
I ironed out some of the last fixes and successfully ran the migrations on the test server. The url_alias table is no more! Furthermore, any duplicate urls, about 200 of them using the May db dump, are now nodes of the "redirect" type, so if anyone hits one of those addresses - they get 301'd to the new one. Hopefully this means that google will mark these as moved and slowly stop indexing them in favor of the correct ones.

The code is live on the test server and looks to be working good. However, the more eyes, the less bugs, right?

@jywarren
Copy link
Member

jywarren commented Jul 3, 2014

wow cool -- yes, and perhaps we can consult a list of the likely
duplicates. I'll take some for a spin tomorrow, i hope.

A quick check of a couple show that https://test.publiclab.org/lists
doesn't seem to work -- shows a blank page "edited 44 years ago"? Maybe
back in May we hadn't yet created that short link...

but "/licenses" works like a charm.

/kits works
/about works
/wiki/about seems to show the same page as /about -- is that expected
behavior?

On Thu, Jul 3, 2014 at 7:18 PM, sirMackk notifications@github.com wrote:

Hi!
I ironed out some of the last fixes and successfully ran the migrations on
the test server. The url_alias table is no more! Furthermore, any duplicate
urls, about 200 of them using the May db dump, are now nodes of the
"redirect" type, so if anyone hits one of those addresses - they get 301'd
to the new one. Hopefully this means that google will mark these as moved
and slowly stop indexing them in favor of the correct ones.

The code is live on the test server and looks to be working good. However,
the more eyes, the less bugs, right?


Reply to this email directly or view it on GitHub
#73 (comment).

@sirMackk
Copy link
Contributor Author

sirMackk commented Jul 4, 2014

Thanks for catching those - I found the culprits, changed them, tested on the test server and now things are looking good:

  1. /wiki/about and /about don't open the same node
  2. /lists redirect correctly as well
  3. Duplicate nodes are marked with a the word DUPLICATE in the path column - using the database on the test server, there were around 20 duplicate nodes, most of them classified as spam. I think these would have to be checked out by hand.
  4. Nodes, that are meant to redirect duplicate urls are marked as redirect|<nid> in the type column. I think that eventually google will stop indexing those urls and they could be removed. Don't know how soon though.

@jywarren
Copy link
Member

jywarren commented Jul 5, 2014

Exciting. I'm traveling right now but should have a chance to look at these
in a day or two. I'll log into test server and hand check the duplicates w
rails console.

On Jul 4, 2014 5:49 PM, "sirMackk" notifications@github.com wrote:

Thanks for catching those - I found the culprits, changed them, tested on
the test server and now things are looking good:

/wiki/about and /about don't open the same node
/lists redirect correctly as well
Duplicate nodes are marked with a the word DUPLICATE in the path column -
using the database on the test server, there were around 20 duplicate
nodes, most of them classified as spam. I think these would have to be
checked out by hand.
Nodes, that are meant to redirect duplicate urls are marked as
redirect| in the type column. I think that eventually google will stop
indexing those urls and they could be removed. Don't know how soon though.


Reply to this email directly or view it on GitHub.

@jywarren
Copy link
Member

Back in town, cranking away at catching up!!!

OK, all the examples I could think of are working properly on test.publiclab.org. I think we should merge it and wait for complaints, if any... can you think of anything else, or any way we'd lose data in the process? Are we ready to go?

@sirMackk
Copy link
Contributor Author

The only thing that comes to mind would be losing certain paths to some nodes, since I didn't manually check every node url, but that can be easily fixed manually after a complaint or something in the logs. Then just gotta check out those DUPLICATE nodes - from the database dump that's on the test server, it seems like almost all are spam except 2-3 of them.

Being cautious, I'd make a backup first and if time is scarce - you could just backup the node and url_alias tables since backing them up and restoring them should take under 10 minutes (something I found out during testing on the test server).

So.... I'd say we're ready for launch, Houston!

@jywarren
Copy link
Member

Merged and pushed live! Thanks again SirMackk and please report any errors or problems on the site here, folks -- feel free to reopen!

@jywarren jywarren closed this Jul 20, 2014
@jywarren
Copy link
Member

Actually one extra thing -- i couldn't find the db dump of Duplicates, and couldn't find any querying around a bit. We have backups, of course, nightly. Can you post the DUPLICATES you found, and I can just look them over? Thanks!

@sirMackk
Copy link
Contributor Author

Whoooo!

That's interesting - maybe they were removed before the migration so the migration didn't pick 'em and mark 'em? I'll check what I had on the test db first thing later today as I'm attending HOPE X atm :).

@jywarren
Copy link
Member

Oh cool lots of my friends are too! Say hi to Nadya peek!
On Jul 20, 2014 12:24 PM, "sirMackk" notifications@github.com wrote:

Whoooo!

That's interesting - maybe they were removed before the migration so the
migration didn't pick 'em and mark 'em? I'll check what I had on the test
db first thing later today as I'm attending HOPE X atm :).


Reply to this email directly or view it on GitHub
#73 (comment).

@sirMackk
Copy link
Contributor Author

Will do if I run into her! :)

@sirMackk
Copy link
Contributor Author

I ran a quick query to get all the dupes on the test database and here's what comes up (ie. where path like "DUPLICATE%"):

| nid   | title                                                         |
+-------+---------------------------------------------------------------+
|     2 | Gulf Coast                                                    |
|    22 | Balloon & Kite Mapping                                        |
|    37 | Lima                                                          |
|    65 | Providence                                                    |
|    90 | Guides                                                        |
|   178 | Somerville, Massachusetts                                     |
|   589 | Thermal Camera: Arduino + MLX90614 IR Thermometer             |
|  5719 | Bayou Safe Bay. Oil Spill?                                    |
|  6963 | Air Quality Class 7                                           |
|  6968 | Air Quality Class 6                                           |
|  9246 | Denzel Washington's film Flight premieres in Hollywood        |
|  9251 | Fashion Rio Summer 2013                                       |
|  9252 | Does your driveway require a makeover                         |
|  9253 | 1000 on summer time fashion                                   |
|  9285 | Does your driveway require a makeover                         |
|  9287 | Does your driveway require a makeover                         |
|  9288 | Does your driveway require a makeover                         |
|  9289 | DESIGNER Shoes Low-cost Related Articles                      |
|  9292 | Does your driveway require a makeover                         |
|  9647 | Does your driveway require a makeover                         |
|  9648 | Does your driveway require a makeover                         |
|  9651 | Does your driveway require a makeover                         |
|  9653 | Does your driveway require a makeover                         |
|  9654 | Does your driveway require a makeover                         |
|  9659 | 1000 on summer time fashion                                   |
|  9660 | Corsa is balck and vibrant                                    |
|  9661 | Does your driveway require a makeover                         |
|  9662 | Does your driveway require a makeover                         |
|  9663 | Corsa is balck and vibrant                                    |
|  9664 | Does your driveway require a makeover                         |
|  9665 | Does your driveway require a makeover                         |
|  9667 | Fashion Rio Summer 2013                                       |
|  9668 | Does your driveway require a makeover                         |
|  9669 | Does your driveway require a makeover                         |
|  9671 | Corsa is balck and vibrant                                    |
|  9672 | Does your driveway require a makeover                         |
|  9673 | Does your driveway require a makeover                         |
|  9674 | Does your driveway require a makeover                         |
|  9675 | Does your driveway require a makeover                         |
|  9676 | Does your driveway require a makeover                         |
| 10232 | hometesting-environmental-estrogens-bibliography|Bibliography |
+-------+---------------------------------------------------------------+

@btbonval
Copy link
Member

Oh man, I'm glad you ran that. I really didn't want to lose the driveway makeover page. Let's make sure that is preserved, folks. I got 24 driveways to makeover and 24 notes to help me do so.

@jywarren
Copy link
Member

Hm, so that means we have two nodes which have the path /wiki/gulf-coast
and so on? But with unique nids? We'll need to look at everything up to "Air
Quality Class 6" -- but maybe we can just delete them? Worth a try on the
test server anyways, and we can test each of the "real" nodes to be sure no
content is actually gone. Odd.

On Sun, Jul 20, 2014 at 11:21 PM, Bryan Bonvallet notifications@github.com
wrote:

Oh man, I'm glad you ran that. I really didn't want to lose the driveway
makeover page. Let's make sure that is preserved, folks. I got 24 driveways
to makeover and 24 notes to help me do so.


Reply to this email directly or view it on GitHub
#73 (comment).

@sirMackk
Copy link
Contributor Author

It means that when the migration was scanning the url_alias table, it happened on two rows with the exact same dst column leading to two different nodes. Rails was using only one - I can't remember if the first or last, but the most up to date one. So the these unloved nodes, whom no one could reach, instead of getting a real-deal working path, got a path that marks them as DUPLICATE instead.

Deleting them shouldn't affect anything since they weren't accessible before the migration, but I left them just in case it turns out we'd prefer to keep and expose Gulf Coast with nid 2 instead of with nid 403 (just made that number up). Deleting them won't affect the real nodes too, so we're good in that department.

@jywarren
Copy link
Member

Let's leave them there, or if someone wants to delete those with nids > 6968
but < 10232 (which are spammy), that's fine. They don't take up much space.

On Tue, Jul 22, 2014 at 6:59 AM, sirMackk notifications@github.com wrote:

It means that when the migration was scanning the url_alias table, it
happened on two rows with the exact same dst column leading to two
different nodes. Rails was using only one - I can't remember if the first
or last, but the most up to date one. So the these unloved nodes, whom no
one could reach, instead of getting a real-deal working path, got a path
that marks them as DUPLICATE instead.

Deleting them shouldn't affect anything since they weren't accessible
before the migration, but I left them just in case it turns out we'd prefer
to keep and expose Gulf Coast with nid 2 instead of with nid 403 (just made
that number up). Deleting them won't affect the real nodes too, so we're
good in that department.


Reply to this email directly or view it on GitHub
#73 (comment).

@sirMackk
Copy link
Contributor Author

Sounds pragmatic and good! Things have been pretty hectic on my end, but when they settle down, I'll try to check out some issues :)

@jywarren
Copy link
Member

Just a weird thing - I see that on https://test.publiclab.org/wiki/balloon-mapping we see an old page, last edit 3 yrs ago, while on https://publiclab.org/wiki/balloon-mapping we see an up-to-date wiki page. Did someone import an old db?

@sirMackk
Copy link
Contributor Author

Oh man, gotta sit down and look at the code, I remember I changed a good deal of how node finds things and this is probably it. So I'm sure it's my code in drupal_node and not the db.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants