Here's an easy, if not always precise way to remember:
* Hyphens connect things, such as compound words: double-decker, cut-and-dried, 212-555-5555.
* EN dashes make a range between things: Boston–San Francisco flight, 10–20 years: both connect not only the endpoints, but define that all the space between is included. (Compare the last usage with the phone number example under Hyphens.)
* EM dashes break things, such as sentences or thoughts: 'What the—!'; A paragraph should express one idea—but rules are made to be broken.
Unicode has the original ASCII hyphen-minus (U+002d), as well as a dedicated hyphen (U+2010), other functional hyphens such as soft and non-breaking hyphens, and a dedicated minus sign (U+2212), and some variations of minus such as subscript, superscript, etc.
There's also the figure dash "‒" (U+2012), essentally a hyphen-minus that's the same width as numbers and used aesthetically for typsetting, afaik. And don't overlook two-em-dashes "⸺" and three-em-dashes "⸻" and horizontal bars "―", the latter used like quotation marks!
> EM dashes break things, such as sentences or thoughts
Some style guides recommend "space, en dash, space" for this, and I prefer that myself – mainly because some software doesn't treat em dashes correctly as word separators for double click selection purposes.
For example, I'm pretty sure that at least some Kindle models would highlight both the word before and after the em dash when selecting one of them, which makes using the dictionary very annoying.
It's actually only your post that made me realize people don't normally put spaces around em dash. In French, Russian and a bunch of other languages proper typesetting is to use em dash as a standard dash character, and you always put spaces around them. So I did it in English as well, for many years now.
(I also now looked up and found out that in Spanish, apparently, you are supposed to put space only on one side of the dash, when used as a direct speech separator.)
I also put spaces around em dashes. It looks wrong—subtly wrong—to me to have the words glued together around the dash. It looks right — completely right — to me to have the dash standing on its own, as if it was a word in its own right.
The reason not to do this is observable in your post on my phone. The spaces cause the word wrapping algorithm to leave a dangling dash at the end of the line which looks ugly. Omitting spaces prevents the word break.
I mentioned that as an advantage in one of my other comments. An advantage both ways, because it depends on preference. I have the same preference as hansvm: I would rather see the dangling dash at the end of the line, so I prefer putting spaces around the dashes. Having the entire word-dash-word structure move to the next line feels ugly to me. As with most things, de gustibus non est disputandum. (And also, quidquid Latine dictum sit altum videtur).
It's the dangling dash at the beginning of the line that gets me. I see a lot of word break algorithms, including the one WebKit (and I suspect Blink) uses, which are happy to break "foo—bar" on either side of the em dash.
Funny, I'd rather have the break at the start or end of the emdash-implied break than just before or after it, not having to mentally handle some single dangling word divorced from its compatriots.
> The reason not to do this is observable in your post on my phone. The spaces cause the word wrapping algorithm to leave a dangling dash at the end of the line which looks ugly. Omitting spaces prevents the word break.
That's an interesting practicality but I don't think it's the cause of the rule: The rule probably long predates automated line breaking. Also, I think automatic line breaking will break compound words at the hyphen; it doesn't require spaces (which is also obvious from a software development point of view: the logic is relatively simple either way):
Lorem ipsum dolor sit amet, consectetur adipiscing double-
decker lorem ipsum dolor sit amet, consectetur ...
Ironically, on my phone the only line that ends with an em dash has no spaces in it.
If you want to not have a line break, you shouldn't rely on arbitrary behavior. You should use non-breaking characters like non-breaking spaces and word joiners.
To each their own: fully agreed, even though our tastes differ. I will mention one advantage of the spaces-around-dashes method: word wrap with default settings will break on the spaces around the dashes so that the entire word one, dash, word two combo doesn't end up pulled onto the next line as a whole unit. Whereas the advantage of the no-spaces method that you prefer is that word wrap will pull the entire word one, dash, word two combo onto the next line as a whole unit.
Why yes, I did list the opposite behavior as an advantage of each. Because that, too, is up to individual preference. :-)
That depends on the layout engine, I believe. Just tried it in Firefox (on macOS; not sure if it uses Core Text or something custom there), and it does sometimes break around the em dash in "foo—bar" style, not just "foo – bar" style.
I've definitely noticed the behavior you describe on some layout engines, too, and it's another reason why I personally prefer "foo – bar" style.
I've wondered about this for similar reasons. I usually omit the spaces but as I said in an earlier post I'll sometimes include them when I think the typography calls for it or when I want to add extra emphasis.
I've come to the conclusion it boils down to which style manual one follows. I've taken a careful look at numbers of high-end books which no doubt have been carefully typeset and I've found EM dashes with and without spaces.
It seems there is no definitive rule but I might be wrong.
For what it's worth, I was in the last class in my high school to learn typing on IBM Selectric typewriters. We were taught to type two spaces, two hyphens, then two spaces. Incidentally, we were taught two spaces after periods and colons. To this day, I find it hard to read text that doesn't have proper spacing after periods. (HTML and WYSIWYG word processors handle formatting, but e.g. fixed-font text editors don't)
Its funny that people think that conventions for typewritten text built around the limitations of typewriters define what is “proper” in environments where typewriters and their limitations are not involved.
Yes, this always grinds my gears too. There is already a slightly larger space after periods in contemporary typefaces.
The old typewriter typefaces were monospaced, ie. every character was the same width, but this is no longer the case. Virtually all typefaces today are proportionally spaced, not monospaced. So it’s redundant to leave extra room after periods.
What does this have to do with what I wrote? I said nothing of the sort. In fact, I explicitly pointed out that HTML and WYSIWYG word processors address it automatically.
I was under the impression that you do "-" for hyphen, "--" for En dash, and "---" for Em dash. IIRC, LaTeX (or maybe the editor, it has been some time) even helpfully changes that for you to the correct dash.
> I was under the impression that you do "-" for hyphen, "--" for En dash, and "---" for Em dash. IIRC, LaTeX (or maybe the editor, it has been some time) even helpfully changes that for you to the correct dash.
The conversion of '--' to an en dash and '---' to an em dash is done by the TeX compiler, and appears in the rendered file, but I think that most TeX editors don't change the TeX code itself. (This is distinct from XeTeX-based compilers, which can handle non-ASCII Unicode characters like the em dash '—' directly in the source.)
(I think that the article's point is that, in some fonts, -- (two hyphens) is literally the (approximate) size of an em dash, not that it is always understood as meaning an em dash. At least in my font, --- (three hyphens) is far too long to literally look like an em dash:
---
--
—
–
(in order, three hyphens, two hyphens, em dash, en dash).)
British typesetting style is a little different from US style in the way dashes are presented. In the UK, you might see a thin-space--en-dash---thin-space where a US typesetter would use a em-dash. Typewriter style generally follows books style. Since typesetters no longer use an extra space after punctuation, it's vestigial in typing.
How so? One is the only way to approximate an en or em dash on a typewriter or in a charset that doesn’t have one, the other seems like a workaround of a typesetting bug at best.
-, --, --- is, IIRC, how it is done in LaTex and would be exceedingly simple to do on a typewriter. That being said, to break up sentences I use " -- " because I think it looks nicer than "---". I'll go now ;)
LaTeX is a markup language though, not ASCII art. I can get behind two dashes as a substitute if no en dash is available, but three seems too much and looks like halfway to a horizontal line to me ;)
TeX puts more space after periods/fullstops (which is why you're supposed to do special markup or other measures to mark '.' in the middle of sentences which aren't sentence-enders (e.g. like e.g.)). But it's generally smaller than the equivalent of two manual spaces.
(A nice thing in (La)TeX is that one could follow the "two spaces after a full-stop" rule, which then has the advantage of being an explicit marking for sentence boundaries (which your editor might be able to navigate; Emacs has a convention of assuming two spaces after a sentence-ending '.'), but then the TeX typesetting will take care of making it look right. I lost the habit of actually doing this, for better or worse, except when flycheck/checkdoc/package-linter.el makes me do it for docstrings.)
I used to feel similarly. Now I find the double space a visual distraction that doesn't in any way improve readability.
The effect of the double space is, I suspect, a product of the reader's expectations: if you expect it, its absence creates mental work, detracting from readability; if you don't expect it, its presence is what creates mental work.
Hard habit to break. I learned it so long ago too.
Haha I learned to type organically, and it was only in my mid-40s that I retrained myself to type the correct way. It took something like 40 hours of practice on keybr.com before I could get close enough to my regular typing speed, such that I could switch over to the 'correct' method without it impacting my work.
Retraining myself to stop doing double-spaces took maybe a week.
> Some style guides recommend "space, en dash, space" for this
The last paragraph of the article also addressed the subjective nature of spacing around the em dash:
> Spacing around an em dash varies. Most newspapers insert a space before and after the dash, and many popular magazines do the same, but most books and journals omit spacing, closing whatever comes before and after the em dash right up next to it.
As far as the selection detail, did you mean that you replace an em dash used like a comma or parenthesis with spaces and an en dash for specific highlight performance issues? Surely the spaces and an em dash would alleviate the selection highlight behavior and not muddy the waters of when to use an em vs. an en dash?
> Spacing around an em dash varies. Most newspapers insert a space before and after the dash, and many popular magazines do the same, but most books and journals omit spacing, closing whatever comes before and after the em dash right up next to it.
It's funny that they omit to mention the possibility of setting it off with a thin space ' ' or hair space ' ' (those are the thin-space and hair-space Unicode characters, though they show up full width for me), which I thought was preferred typographic practice.
(On Googling, maybe the reason that they don't mention it is that I was imagining it; I can't find any evidence for my belief.)
> those are the thin-space and hair-space Unicode characters, though they show up full width for me
Interestingly, at least in my browser and grabbing the direct link to the comment with curl, show the bytes as 0x20 for both. Perhaps the comment submission handler, or even the browser, collated your more specific U+2009 (thin) and U+200A (hair) spaces into the regular U+0020 space?
> Interestingly, at least in my browser and grabbing the direct link to the comment with curl, show the bytes as 0x20 for both. Perhaps the comment submission handler, or even the browser, collated your more specific U+2009 (thin) and U+200A (hair) spaces into the regular U+0020 space?
Probably! I think HN strips out emoji; maybe it just takes the safest approach and strips out all non-white-listed Unicode.
Company I used to work for used AP for things like press releases and, I think, official blog posts and Chicago plus a couple different tech style guides for everything else.
Basically, we didn’t like some things in AP but we wanted to make it easy for journalists to copy/paste.
The good thing about style guides is that they’re guides, not laws :)
That’s one thing I really like about English: There’s no central authority decreeing what’s right and what’s wrong top down, and it feels like there is some room for individual preferences and experimentation.
Very refreshing, compared to e.g. German, which has more than one semi-official authority gate keeping “correctness” in speech and writing.
A semicolon connects, whereas an em-dash creates more of a pause and therefore separates. In addition, em-dashes can be used in pairs to create a parenthesis, which semicolons can’t. I think with time you will appreciate the difference.
Dashes surround a sub-clause - something like this - which is like a parenthetical addition to a sentence that could stand alone without it; semi-colons (';') connect a further sentence or part of one where perhaps a full-stop and additional word could have been. They also sometimes separate list items following a colon, especially if the things listed are longer sentences perhaps themselves containing commas that'd otherwise be ambiguous.
Em dashes are very similar to semicolons. You use em dashes if your related sentence is in the middle of another sentence, and semicolons if it's at the end.
They're frequently used in skilled and professional grade writing.
So as not to mislead anyone, the parent is mostly incorrect:
Here's an example sentence: Semicolons must have independent clauses—phrases that could form a full sentence on their own—on both sides of them; they are essentially alternatives for periods. Em dashes don't require independent clauses on either side.
In the italicized sentence,
* phrases that could form a full sentence on their own is not an independent clause but is valid between em dashes. on both sides of them, after the em dashes, is also not an independent clause. (The em dashes function like commas or parentheses here.)
* The parts before and after the semicolon are independent clauses. You could replace the semicolon with a period and you'd have perfectly valid grammar. I just chose to connect the two sentences a bit more.
I don't know if you can use em dashes as the parent comment describes, connecting three independent clauses:
* My favorite fruit is peaches—they are very sweet—I eat them all summer.
I think the above is wrong; it should be one of the following:
* My favorite fruit is peaches—they are very sweet—and I eat them all summer.: The last section is a dependent clause made by "and", not an independent clause.
* My favorite fruit is peaches—they are very sweet; I eat them all summer.: One both sides of the semicolon are independent clauses; I could replace the semicolon with a period.
Maybe there are examples I'm not thinking of? I infer that the rule might be that the punctution following the em-dashed clauses should be the punctuation that would have been used without the em-dashed clause, but that's based on very limited evidence.
Many people don't use semicolons (;) in English but many do, and they are certainly part of correct grammar.
Semicolons are generally alternatives to periods, when you want more connection between the two sentences. Like periods, semicolons must have two full sentences—that is, what could be full sentences—on either side of them; the potential 'full sentences' are properly called independent clauses. (A dependent clause needs the rest of the sentence to form valid grammar; it can't function on its own. For example, in this paragraph's first sentence, when you want more connection between the two sentences is a dependent clause. Often they follow commas.)
Another use of semicolons is for lists in a paragraph where one of the list items has a comma in it (similar to the parsing problem for CSVs where some records contain commas): I only like wine; beer, but only ales; and orange juice.
> Unicode has the original ASCII hyphen-minus (U+002d), as well as a dedicated hyphen (U+2010), other functional hyphens…
Which can be fun when parsing CSV files from various sources. I've hit numbers with U2010 or others where you would expect a hyphen-minus should be. Presumably someone² has copied a negative number from a document where one of the alternate symbols was used, and pasted it into everyone's favourite data-mangler¹ which interpreted it as a string, and so on down the chain.
--------
[1] Excel. Sometimes a joy, sometimes the bane of my existence.
[2] It is surprising, horrifying even, how much manual manipulation of data goes on in banking, where you might naturally assume everything is more automated these days. Sometimes a laborious manual process done regularly is seen as cheaper than paying for it to be automated…
G. Brandon Robinson swears by U+2010 for hyphens in groff's Unicode output [0], but I see it as a hypercorrection. The most common convention by far (among authors who use Unicode and care about dashes) is to use U+002D for hyphens and U+2212 for minus signs. Not even the Unicode Consortium uses U+2010 for hyphens in its documents, and I'm not aware of any major organization that does.
As far as appearance goes, almost all fonts I've looked at make U+2010 identical to U+002D (i.e., they don't put any 'minus' into the 'hyphen-minus'), but a few make U+2010 a smidgeon shorter.
Intl.NumberFormat also prefers it, but then you can't paste negative numbers into most financial software, calculators, spreadsheets. Even back into inputs on the same webpage, if it does custom number parsing. Even though <input type=number> accepts U+2212 as a minus, it turns it into a regular minus when you spin it down to -2.
It looks much better though and more visible: −1 vs -1. I wish hyphen was a separate symbol from the ascii start, or that monospace fonts didn't tend to shorten "-" cause it makes little sense in monospace anyway.
— In the context of automatic text processing, it unambiguously indicates the function of a hyphen, as opposed to a minus
— Fonts can choose to make the hyphen-minus a bit wider than a regular hyphen, to accommodate the usage as a minus sign. In that case, U+2010 would be typographically more appropriate for a hyphen, similar to how U+2212 usually is typographically more appropriate for a minus sign.
Visual style of hyphen-minus depends on font. Some fonts displays it more like a minus, others like a hyphen. So if you care about distinguishing hyphen and minus, it makes sense to use dedicated hyphen and minus, and do not use hyphen-minus at all.
It's infuriating that people are drawing this conclusion. LLMs pick up on em dash usage because professional and skilled writers use em dashes. They're a consistently useful, if niche, part of the literary toolkit.
But, no, now it's a problem because the majority of people's experience with writing is graded essays. And because LLMs emulate professionals, it's now a red flag if students write too much like professionals. What a joke.
Ha, good point, and an interesting question: What kinds of dashes did Dickinson intend?
It's a hard one to answer: We could look at published Emily Dickinson books from the time, but did Dickinson really pay that close attention to or have that much control over the type?
We could look at Dickinson's actual personal documents, but if they were handewritten, distinguishing dashes could be difficult even if there was intention there.
Fortunately we have troves of her handwritten documents; all of her poems were first printed posthumously. To me, she's using the punctuation as pacing or tonal markers as opposed to ligatures ("I'll clutch— and clutch— " vs "I'll clutch-and clutch-"). Many publishers style these marks as longer than normal m-dashes for that reason, which makes sense seeing as they are rarely used as asides.
Em-dashes have been the norm in every Dickinson poem I read, and I think it might have derived from the preferences of Victorian publishers, who I understand loved those long dashes.
I imagine it would have been up to the typesetter to make the call. The conventions for dash usage are fairly straightforward. You use em-dashes for asides, en dashes for ranges, and hyphens for most other cases. Its easy to figure out the right character from context (apart from en ranges vs hyphen ranges).
You want Robert Bringhurst, poet and typographic nerd. He gives them special withering attention in his Elements of Typographic Style. I think he referred to them as Victorian excrescences?
However this is the kind of rule that "existed" for a while and most likely will go away as most people can't be bothered with the difference and it all looks similar anyway
Or maybe who knows, it will keep going on because chatgpt knows it
That doesn't seem to be an array at all, if the idea is to check whether a number is within a range. Seems like an interesting data type though, a combination of a range data type and a map/associative array.
I was thinking of a sparse array but any name will do. obj[~42] ?
One may have a bunch of key ranges each associated with a value or one may have a key that should be "rounded" to the nearest key or retreave the one below or above it.
It feels like something basic enough to have in a language and I found it oddly complicated to write myself. Comparing it with all values doesn't seem like a very good solution.
Re last paragraph: dashes, etc. are confusing for perhaps most of us who aren't, say, typesetters, myself included. I use EM dashes a lot usually without a space between words and sometimes with spaces when I think the typography calls for it—or for extra emphasis.
Essentially, most of us guess the rules and often this doesn't matter much but it can in certain circumstances.
For example, in say machine conversion/transliteration. The ASCII dash is often used as a substitute for Unicode minus sign because it's easy to select [it's my usual practice], and anyway many don't know there is an actual difference. Whilst a human will usually know the difference by its use or context a machine may take the literal interpretation which could lead to say a numerical calculation error.
This problem has annoyed me for a long while. Why is it that wordprocessors and editors do not highlight these characters and query whether the usage is correct? Surely this ought not to be that difficult.
Another example is Roman numerals. The average person will enter say an uppercase 'I' for the Roman numeral one. Here's a typical example which is incorrect:
WWII
Here I entered the normal ASCII 'I' because it was too involved to find the correct Unicode character for Roman numeral one.
I'd like to know what others who are in typography, machine learning etc. think about this, and why WP programs and editors don't have simple ergonomics that allow for easy selection of the correct character.
† On a related matter, you'll note I've used single quotes whereas mmooss uses double quotes. This tell me that mmooss is likely in the US whereas I'm not. Again, this is not really a major problem for humans but it can be in transliteration, etc. Also, it's unclear (at least to me) what the default is for quoting quotes, i.e.: "" versus "' (right, I've refrained from using triple quotes).
Again, this seems country specific with I believe the US favoring double followed by single. Even when these rules are defined do people strictly adhere to them?
This one is U+4E00, CJK Unified Ideograph-4E00. So it's a common character between Chinese, Japanese, and Korean. This should be "one" in all three. And it does technically look a little different than a dash: https://unicodeplus.com/U+4E00
And this is different from Japanese's chuuonpu (U+30FC) which is a vowel elongation mark, and it's rendered horizontally or vertically depending on whether the text direction is horizontal or vertical, respectively.
* Hyphens connect things, such as compound words: double-decker, cut-and-dried, 212-555-5555.
* EN dashes make a range between things: Boston–San Francisco flight, 10–20 years: both connect not only the endpoints, but define that all the space between is included. (Compare the last usage with the phone number example under Hyphens.)
* EM dashes break things, such as sentences or thoughts: 'What the—!'; A paragraph should express one idea—but rules are made to be broken.
Unicode has the original ASCII hyphen-minus (U+002d), as well as a dedicated hyphen (U+2010), other functional hyphens such as soft and non-breaking hyphens, and a dedicated minus sign (U+2212), and some variations of minus such as subscript, superscript, etc.
There's also the figure dash "‒" (U+2012), essentally a hyphen-minus that's the same width as numbers and used aesthetically for typsetting, afaik. And don't overlook two-em-dashes "⸺" and three-em-dashes "⸻" and horizontal bars "―", the latter used like quotation marks!