Comments on: Data Coding 101 – Intro to Bash – ep6 (last episode) https://data36.com/command-line-data-awk-sed-join-date/ Learn Data Science the Hard Way! Mon, 04 Apr 2022 09:36:58 +0000 hourly 1 https://wordpress.org/?v=6.7.4 By: Web Scraping Tutorial -- episode #1 -- Scraping a Webpage (with Bash) https://data36.com/command-line-data-awk-sed-join-date/#comment-123130 Wed, 05 Feb 2020 23:39:00 +0000 https://data36.com/?p=953#comment-123130 […] Introduction to Bash episode #6 […]

]]>
By: Scraping Multiple Webpages with For Loops (Web Scraping Tutorial) https://data36.com/command-line-data-awk-sed-join-date/#comment-123128 Wed, 05 Feb 2020 23:35:45 +0000 https://data36.com/?p=953#comment-123128 […] Introduction to Bash episode #6 […]

]]>
By: Tomi Mester https://data36.com/command-line-data-awk-sed-join-date/#comment-11037 Thu, 08 Feb 2018 17:18:41 +0000 https://data36.com/?p=953#comment-11037 In reply to Balint.

hi Balint,

spot on!

Although, my goal was to make the task a bit more difficult than that. πŸ™‚

(Assuming that you’re talking about the “Test Yourself” section), see this part:
“For the sake of practicing, let’s say, you can only use these four columns:
1st column: Year
2nd column: Month
3rd column: Day
15th column: ArrDelay” πŸ˜‰

Cheers,
Tomi

]]>
By: Balint https://data36.com/command-line-data-awk-sed-join-date/#comment-11028 Thu, 08 Feb 2018 09:35:39 +0000 https://data36.com/?p=953#comment-11028 Hey,

I was wondering if it wasn’t easier to use column 4 as weekday? so basically:

cut col 4 and 15
use awk to filter first column where it’s 1 (indicating monday)
sum up col15.

]]>
By: Tomi Mester https://data36.com/command-line-data-awk-sed-join-date/#comment-8442 Tue, 10 Oct 2017 15:30:02 +0000 https://data36.com/?p=953#comment-8442 In reply to Nishant.

Hey Nishant!
Thanks! Yeah, I meant like “same as it was the -d option for cut for instance”
But you have right, it’s much easier to understand with the example of sort!
So I have updated the article!

Thanks for the input! And glad you like the blog! πŸ˜‰

Cheers,
Tomi

]]>
By: Nishant https://data36.com/command-line-data-awk-sed-join-date/#comment-8419 Tue, 10 Oct 2017 10:52:33 +0000 https://data36.com/?p=953#comment-8419 Hi

A small mistake (perhaps):
Excerpt from your article above:
…”define this value with the -t option (same as it was with cut for instance)”…

I suppose you meant ‘sort’ instead of ‘cut’..

Thanks
Nishant

P.S: Needless to say, your blog has been very helpful in getting me started with bash programming. Keep it up! πŸ™‚

]]>
By: Variables, if statements and while loops in bash (Data Coding 101) https://data36.com/command-line-data-awk-sed-join-date/#comment-7848 Mon, 02 Oct 2017 14:12:45 +0000 https://data36.com/?p=953#comment-7848 […] pet-project and learn more and more by doing and less and less by reading articles! Wanna continue? This way please! If you want to get notified about my upcoming articles, videos or webinars, subscribe to my […]

]]>
By: Tomi Mester https://data36.com/command-line-data-awk-sed-join-date/#comment-3387 Tue, 18 Apr 2017 16:30:11 +0000 https://data36.com/?p=953#comment-3387 In reply to Chris.

hey Chris,

thanks for the comment!
Hm, now you made me think.
Let me check this with some other files myself and if I found the same as you, I’ll definitely update my article!

I guess the difference here is in the different computer capacities. Eg. in my tutorials I use a very-very cheap computer with 512Mbyte memory, 20GB SSD and only 1 CPU.
But let me check!

Cheers,
Tomi

]]>
By: Chris https://data36.com/command-line-data-awk-sed-join-date/#comment-3269 Wed, 12 Apr 2017 18:23:41 +0000 https://data36.com/?p=953#comment-3269 Hi Tomi,

This is a nice series of techniques on basic data processing on the Linux command line, and I’ll definitely point my students to it. However, I was surprised to see that you recommend the use of cut + awk rather than using awk alone when summing up a column of numbers. I’ve always used awk only under this circumstance, and never experienced the inefficiency that you mentioned. To check this out I thought I’d give it a try on some random files I have lying around. Here I’m adding up field #14 (in these files it’s the number of records in the line, so mostly in the range 1-48) in 27 CSV files with a combined total of more than 3 million lines.
This is how long it takes with cut and awk:

$ time cut -d, -f14 MDMR-*.DAT | awk '{sum += $1}END{print sum}'
40376443

real 1m3.617s
user 1m4.436s
sys 0m0.619s

and here it is all done in awk:

$ time awk -F, '{sum += $14}END{print sum}' MDMR-*.DAT
40376443

real 0m13.106s
user 0m12.361s
sys 0m0.743s

So it takes almost 6 times longer when using cut.

In case it matters, which I don’t think it does, I’m using GNU Awk 3.1.5 and cut from GNU coreutils) 5.97.

Filtering records purely in awk (as in /re/{…}) vs using grep and awk is a different story, however, as grep is super well optimised for this particular task.

]]>