Golang 统计字符串

恋喵大鲤鱼 2022-06-05 我要评论

1.需求说明

记录一下项目对用户 UGC 文本进行字数限制的具体实现。

不同的产品，出于种种原因，一般都会对用户输入的文本内容做字数限制。

出于产品定位，比如 140 字符限制的 Twitter，让内容保持简洁凝练，易于阅读；
出于用户的阅读体验，过多的文字会造成阅读疲劳，合适的字数能够提高阅读舒适度；
出于技术与成本的考虑，不设上限的 UGC 内容会引发一些潜在的问题，比如增加存储的成本，降低检索效率等。

回到自己的项目，是一个用户发帖的业务场景。产品同学给到的要求是：

帖子名称，限制在 25 个字；
帖子正文，限制在 1500 字；
关于字的说明：1 个汉字为一个字，一个 Emoji 表情相当于 1 个字，2 个数字/英文字母相当于 1 个字。

正常情况下，汉字，Emoji 字符，数字与英文字母都是单独的字符。这里 2 个数字/英文算作 1 个字，所以在计算字符串长度时，不能够使用 []rune 强转后来获取其长度，而是需要统计出数字与英文字母的数量，再加上其他字符数量，作为其长度。所以，要想实现产品同学的要求，关键是需要统计出用户输入文本中的数字与英文字母的数量。

2.实现

在 Golang，一般有两种方法。

2.1 ASCII 码值法

数字和英文字母的 ASCII 码值我们是知道的，通过对原字符串遍历，便可统计出数字/英文字母的数量。

// GetAlphanumericNumByASCII 根据 ASCII 码值获取字母数字数量。
func GetAlphanumericNumByASCII(s string) int {
	num := int(0)
	for i := 0; i < len(s); i++ {
		switch {
		case 48 <= s[i] && s[i] <= 57: // 数字
			fallthrough
		case 65 <= s[i] && s[i] <= 90: // 大写字母
			fallthrough
		case 97 <= s[i] && s[i] <= 122: // 小写字母
			num++
		default:
		}
	}
	return num
}

// 或者
// GetAlphanumericNumByASCIIV2 根据 ASCII 码值获取字母数字数量。
func GetAlphanumericNumByASCIIV2(s string) int {
	num := int(0)
	for _, c := range s {
		switch {
		case '0' <= c && c <= '9':
			fallthrough
		case 'a' <= c && c <= 'z':
			fallthrough
		case 'A' <= c && c <= 'Z':
			num++
		default:
		}
	}
	return num
}

2.2 正则表达式

我们可以利用 Golang 标准库包 regexp 获取指定表达式的字串数量。

// GetAlphanumericNumByRegExp 根据正则表达式获取字母数字数量。
func GetAlphanumericNumByRegExp(s string) int {
	rNum := regexp.MustCompile(`\d`)
	rLetter := regexp.MustCompile("[a-zA-Z]")
	return len(rNum.FindAllString(s, -1)) + len(rLetter.FindAllString(s, -1))
}

我们可以写个单测来验证下上面三个函数的正确性。

package string
import "testing"
func TestGetAlphanumericNumByASCII(t *testing.T) {
	type args struct {
		s string
	}
	tests := []struct {
		name string
		args args
		want int
	}{
		{
			name: "包含数字",
			args: args{"108条梁山好汉"},
			want: 3,
		},
		{
			name: "包含字母",
			args: args{"一百条梁山man"},
			want: 3,
		},
		{
			name: "包含数字与字母",
			args: args{"108条梁山man"},
			want: 6,
		},
	}
	for _, tt := range tests {
		t.Run(tt.name, func(t *testing.T) {
			if got := GetAlphanumericNumByASCII(tt.args.s); got != tt.want {
				t.Errorf("GetAlphanumericNumByASCII() = %v, want %v", got, tt.want)
			}
		})
	}
}
func TestGetAlphanumericNumByASCIIV2(t *testing.T) {
	type args struct {
		s string
	}
	tests := []struct {
		name string
		args args
		want int
	}{
		{
			name: "包含数字",
			args: args{"108条梁山好汉"},
			want: 3,
		},
		{
			name: "包含字母",
			args: args{"一百条梁山man"},
			want: 3,
		},
		{
			name: "包含数字与字母",
			args: args{"108条梁山man"},
			want: 6,
		},
	}
	for _, tt := range tests {
		t.Run(tt.name, func(t *testing.T) {
			if got := GetAlphanumericNumByASCIIV2(tt.args.s); got != tt.want {
				t.Errorf("GetAlphanumericNumByASCII() = %v, want %v", got, tt.want)
			}
		})
	}
}
func TestGetAlphanumericNumByRegExp(t *testing.T) {
	type args struct {
		s string
	}
	tests := []struct {
		name string
		args args
		want int
	}{
		{
			name: "包含数字",
			args: args{"108条梁山好汉"},
			want: 3,
		},
		{
			name: "包含字母",
			args: args{"一百条梁山man"},
			want: 3,
		},
		{
			name: "包含数字与字母",
			args: args{"108条梁山man"},
			want: 6,
		},
	}
	for _, tt := range tests {
		t.Run(tt.name, func(t *testing.T) {
			if got := GetAlphanumericNumByRegExp(tt.args.s); got != tt.want {
				t.Errorf("GetAlphanumericNumByRegExp() = %v, want %v", got, tt.want)
			}
		})
	}
}

运行go test main/string命令，其中 main/string 为单元测试所在包的路径。输出如下：

ok main/string 0.355s

验证无误。

3.性能对比

上面提到的两种方法都可以用来获取字符串中数字与英文字母的数量，那么我们应该采用哪一种方法呢？

功能上没有差别，那么我们来看下性能对比吧。

func BenchmarkGetAlphanumericNumByASCII(b *testing.B) {
	for n := 0; n < b.N; n++ {
		GetAlphanumericNumByASCII("108条梁山man")
	}
}
func BenchmarkGetAlphanumericNumByASCIIV2(b *testing.B) {
	for n := 0; n < b.N; n++ {
		GetAlphanumericNumByASCIIV2("108条梁山man")
	}
}
func BenchmarkGetAlphanumericNumByRegExp(b *testing.B) {
	for n := 0; n < b.N; n++ {
		GetAlphanumericNumByRegExp("108条梁山man")
	}
}

运行上面的基准测试，输出如下：

go test -bench=. -benchmem main/string

goos: windows
goarch: amd64
pkg: main/string
cpu: Intel(R) Core(TM) i7-9700 CPU @ 3.00GHz
BenchmarkGetAlphanumericNumByASCII-8 89540210 12.67 ns/op 0 B/op 0 allocs/op
BenchmarkGetAlphanumericNumByASCIIV2-8 63227778 19.11 ns/op 0 B/op 0 allocs/op
BenchmarkGetAlphanumericNumByRegExp-8 465954 2430 ns/op 1907 B/op 27 allocs/op
PASS
ok main/string 3.965s

不测不知道，一测吓一跳。通过正则表达式的实现方式，代码虽然简洁，但是涉及多次内存配分，性能与 ASCII 码值法相比，差距非常之大，是 ASCII 码值法的 200 倍左右。所以从性能的考虑，推荐使用 ASCII 码值的方式获取数字字母数量。

ASCII 码值法有两种遍历方式，一种是按照字节遍历，一种是按照 rune 字符遍历。因为后者涉及 rune 字符的判断，所以性能会差一些。推荐使用按照字节遍历。

4.小结

本文给出了两种从字符串获取数字与字母数量的方法：

ASCII 码值。
正则表达式。

出于性能的考虑，推荐使用 ASCII 码值法，并使用字节遍历的方式。

此外，本文给出的两种方法，三种实现方式，相关源码已放置开源库 go-huge-util，可 import 直接使用。

package main
import (
	"fmt"
	huge "github.com/dablelv/go-huge-util"
)
func main() {
	fmt.Println(huge.GetAlphanumericNumByASCII("108条梁山man"))  	// 6
	fmt.Println(huge.GetAlphanumericNumByASCIIV2("108条梁山man"))  	// 6
	fmt.Println(huge.GetAlphanumericNumByRegExp("108条梁山man")) 	// 6
}

参考文献

golang统计出其中英文字母、空格、数字和其它字符的个数